sklearn.utils.resample stratify parameter is not working #17321

tuhinsharma121 · 2020-05-24T06:15:29Z

I am using

Python 3.6.8
scikit-learn==0.23.1

I am using sklearn.utils,resample for stratified sampling. Here is my code:-

from sklearn.utils import resample
y=[1,1,2,3,2,2,1,3,1,1,2,3,2,2,1,3,4,4]
sample = resample(y, n_samples=5, replace=False, stratify=y,
         random_state=0)
print(sample)

gives me output:-

[3, 2, 2, 1, 1]

Expected output:-
output should contain all the distinct values in y. At least one "4" should be present in the output. For eg.

[3,2,2,1,4]

Am I missing something? Thank you for your help in advance.

The text was updated successfully, but these errors were encountered:

alfaro96 · 2020-05-24T17:01:11Z

@tuhinsharma121 AFAIK the sklearn.utils.resample function uses the multivariate hypergeometric distribution (see Hypergeometric Distribution for details) to compute the approximate mode, that is, the number of samples to take from each class. In this case:

[2, 2, 1, 0]

This means that the class 4 would not be in the stratified sampling. What is your specific use case? (maybe we can find an alternative).

NicolasHug · 2020-05-25T13:44:55Z

This is an expected output considering i) the class proportions in y and ii) the number of samples that you ask for. with n_samples=6 you should start getting 4s

jnothman · 2020-05-25T21:07:22Z

I'm guessing this is not well tested or implemented with replace=False. It looks like a bug in the use of `_approximate_mode`. Tests and fix are welcome.

jnothman · 2020-05-25T21:08:52Z

Sorry I'd not seen the previous comments.

Yeah stratification with small samples might still imply that we should have a minimum of 1 sample per class...

tuhinsharma121 · 2020-05-26T04:55:53Z

First of all thank you all for the explanations. I really appreciate the efforts made by the scikit-learn community to make an awesome ml package. I shall dig a bit deeper to see if I can find any bugs.

@alfaro96 This is the problem I am looking forward to solve:-

I want to split a Dataframe into 4 parts with stratified sampling. Make sure all categories form column 'B' Should present in each chunk. If any category is not having sufficient records for all chunks, copy same record into remaining chunks.

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                             'foo', 'bar', 'foo', 'foo',
                         'foo', 'bar', 'foo', 'bar',
                             'foo', 'bar', 'foo', 'foo', 'bar'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three',
                             'one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three', 'four'],
                       'C' : np.random.randn(17), 'D' : np.random.randn(17)})

print(df)

      A      B         C         D
0   foo    one  0.960627  0.318723
1   bar    one  0.269439 -0.945565
2   foo    two  0.210376  0.765680
3   bar  three -0.375095 -1.617334
4   foo    two -1.910716 -0.532117
5   bar    two -0.277426  0.019717
6   foo    one -0.260074  1.384464
7   foo  three  0.072119 -1.077725
8   foo    one  0.093446 -0.683513
9   bar    one -0.154885 -1.453996
10  foo    two -1.258207  1.406615
11  bar  three -0.003332 -0.083092
12  foo    two  1.250562  0.519337
13  bar    two -0.837681 -1.465363
14  foo    one -0.403992 -0.133496
15  foo  three -0.757623 -0.459532
16  bar   four -2.071840  0.802953

Output should be like below (All categories from 'B' column should present in each chunk. Index doesn't matter)

     A      B         C         D
0   foo    one  0.200466 -0.394136
2   foo    two  0.086008 -0.528286
3   bar  three -1.979613 -1.345405
8   foo    one -1.195563 -0.832880
15  foo  three -0.737060 -0.437047
16  bar   four -2.071840  0.802953

     A      B         C         D
1   bar    one  1.177119  0.693766
4   foo    two  0.452803 -0.595433
7   foo  three  1.285687  1.107021
12  foo    two  1.746976  1.449390
16  bar   four -2.071840  0.802953

     A      B         C         D
6   foo    one -0.095485  0.129541
5   bar    two  0.803417 -0.219461
7   foo  three  1.285687  1.107021
13  bar    two  1.166246 -1.711505
16  bar   four -2.071840  0.802953

     A      B         C         D
9   bar    one  2.001238 -0.283411
10  foo    two  0.865580  0.052533
11  bar  three -0.437604 -0.652073
14  foo    one -0.655985 -0.942792
16  bar   four -2.071840  0.802953

I thought using resample I could solve the problem.

MaxwellLZH · 2022-02-09T08:49:18Z

@NicolasHug I noticed that If I reduce n_samples to 4, I was also able to get 4 in the sample, which is kind of unexpected.

from sklearn.utils import resample
y = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4]
resample(y, n_samples=4, replace=False, stratify=y, random_state=0)
# [2, 1, 4, 3]
# 4 appears in sample

resample(y, n_samples=5, replace=False, stratify=y, random_state=0)
# [[3, 2, 2, 1, 1]
# 4 not in sample

tuhinsharma121 · 2024-04-02T02:36:12Z

@glemaitre @jnothman Can I work on a PR for this?

tuhinsharma121 · 2024-04-03T11:01:57Z

take

tuhinsharma121 · 2024-04-03T11:35:20Z

I see. take does not work.

adrinjalali · 2024-04-23T12:14:36Z

@tuhinsharma121 you can simply open a PR, no need to assign.

tuhinsharma121 added the Bug: triage label May 24, 2020

tuhinsharma121 changed the title ~~sklearn.utils,resample stratify paramerter is not working~~ sklearn.utils,resample stratify parameter is not working May 24, 2020

tuhinsharma121 changed the title ~~sklearn.utils,resample stratify parameter is not working~~ sklearn.utils.resample stratify parameter is not working May 24, 2020

cmarmo added the module:utils label May 25, 2020

glemaitre added Bug and removed Bug: triage labels Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sklearn.utils.resample stratify parameter is not working #17321

sklearn.utils.resample stratify parameter is not working #17321

tuhinsharma121 commented May 24, 2020 •

edited by glemaitre

alfaro96 commented May 24, 2020 •

edited

NicolasHug commented May 25, 2020

jnothman commented May 25, 2020 via email

jnothman commented May 25, 2020

tuhinsharma121 commented May 26, 2020

MaxwellLZH commented Feb 9, 2022

tuhinsharma121 commented Apr 2, 2024

tuhinsharma121 commented Apr 3, 2024

tuhinsharma121 commented Apr 3, 2024

adrinjalali commented Apr 23, 2024

sklearn.utils.resample stratify parameter is not working #17321

sklearn.utils.resample stratify parameter is not working #17321

Comments

tuhinsharma121 commented May 24, 2020 • edited by glemaitre

alfaro96 commented May 24, 2020 • edited

NicolasHug commented May 25, 2020

jnothman commented May 25, 2020 via email

jnothman commented May 25, 2020

tuhinsharma121 commented May 26, 2020

MaxwellLZH commented Feb 9, 2022

tuhinsharma121 commented Apr 2, 2024

tuhinsharma121 commented Apr 3, 2024

tuhinsharma121 commented Apr 3, 2024

adrinjalali commented Apr 23, 2024

tuhinsharma121 commented May 24, 2020 •

edited by glemaitre

alfaro96 commented May 24, 2020 •

edited