Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn.utils.resample stratify parameter is not working #17321

Open
tuhinsharma121 opened this issue May 24, 2020 · 10 comments
Open

sklearn.utils.resample stratify parameter is not working #17321

tuhinsharma121 opened this issue May 24, 2020 · 10 comments

Comments

@tuhinsharma121
Copy link
Contributor

tuhinsharma121 commented May 24, 2020

I am using

Python 3.6.8
scikit-learn==0.23.1

I am using sklearn.utils,resample for stratified sampling. Here is my code:-

from sklearn.utils import resample
y=[1,1,2,3,2,2,1,3,1,1,2,3,2,2,1,3,4,4]
sample = resample(y, n_samples=5, replace=False, stratify=y,
         random_state=0)
print(sample)

gives me output:-

[3, 2, 2, 1, 1]

Expected output:-
output should contain all the distinct values in y. At least one "4" should be present in the output. For eg.

[3,2,2,1,4]

Am I missing something? Thank you for your help in advance.

@tuhinsharma121 tuhinsharma121 changed the title sklearn.utils,resample stratify paramerter is not working sklearn.utils,resample stratify parameter is not working May 24, 2020
@tuhinsharma121 tuhinsharma121 changed the title sklearn.utils,resample stratify parameter is not working sklearn.utils.resample stratify parameter is not working May 24, 2020
@alfaro96
Copy link
Member

alfaro96 commented May 24, 2020

@tuhinsharma121 AFAIK the sklearn.utils.resample function uses the multivariate hypergeometric distribution (see Hypergeometric Distribution for details) to compute the approximate mode, that is, the number of samples to take from each class. In this case:

[2, 2, 1, 0]

This means that the class 4 would not be in the stratified sampling. What is your specific use case? (maybe we can find an alternative).

@NicolasHug
Copy link
Member

This is an expected output considering i) the class proportions in y and ii) the number of samples that you ask for. with n_samples=6 you should start getting 4s

@jnothman
Copy link
Member

jnothman commented May 25, 2020 via email

@jnothman
Copy link
Member

Sorry I'd not seen the previous comments.

Yeah stratification with small samples might still imply that we should have a minimum of 1 sample per class...

@tuhinsharma121
Copy link
Contributor Author

First of all thank you all for the explanations. I really appreciate the efforts made by the scikit-learn community to make an awesome ml package. I shall dig a bit deeper to see if I can find any bugs.

@alfaro96 This is the problem I am looking forward to solve:-

I want to split a Dataframe into 4 parts with stratified sampling. Make sure all categories form column 'B' Should present in each chunk. If any category is not having sufficient records for all chunks, copy same record into remaining chunks.

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                             'foo', 'bar', 'foo', 'foo',
                         'foo', 'bar', 'foo', 'bar',
                             'foo', 'bar', 'foo', 'foo', 'bar'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three',
                             'one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three', 'four'],
                       'C' : np.random.randn(17), 'D' : np.random.randn(17)})

print(df)

      A      B         C         D
0   foo    one  0.960627  0.318723
1   bar    one  0.269439 -0.945565
2   foo    two  0.210376  0.765680
3   bar  three -0.375095 -1.617334
4   foo    two -1.910716 -0.532117
5   bar    two -0.277426  0.019717
6   foo    one -0.260074  1.384464
7   foo  three  0.072119 -1.077725
8   foo    one  0.093446 -0.683513
9   bar    one -0.154885 -1.453996
10  foo    two -1.258207  1.406615
11  bar  three -0.003332 -0.083092
12  foo    two  1.250562  0.519337
13  bar    two -0.837681 -1.465363
14  foo    one -0.403992 -0.133496
15  foo  three -0.757623 -0.459532
16  bar   four -2.071840  0.802953

Output should be like below (All categories from 'B' column should present in each chunk. Index doesn't matter)

     A      B         C         D
0   foo    one  0.200466 -0.394136
2   foo    two  0.086008 -0.528286
3   bar  three -1.979613 -1.345405
8   foo    one -1.195563 -0.832880
15  foo  three -0.737060 -0.437047
16  bar   four -2.071840  0.802953

     A      B         C         D
1   bar    one  1.177119  0.693766
4   foo    two  0.452803 -0.595433
7   foo  three  1.285687  1.107021
12  foo    two  1.746976  1.449390
16  bar   four -2.071840  0.802953

     A      B         C         D
6   foo    one -0.095485  0.129541
5   bar    two  0.803417 -0.219461
7   foo  three  1.285687  1.107021
13  bar    two  1.166246 -1.711505
16  bar   four -2.071840  0.802953

     A      B         C         D
9   bar    one  2.001238 -0.283411
10  foo    two  0.865580  0.052533
11  bar  three -0.437604 -0.652073
14  foo    one -0.655985 -0.942792
16  bar   four -2.071840  0.802953

I thought using resample I could solve the problem.

@glemaitre glemaitre added Bug and removed Bug: triage labels Dec 22, 2021
@MaxwellLZH
Copy link
Contributor

@NicolasHug I noticed that If I reduce n_samples to 4, I was also able to get 4 in the sample, which is kind of unexpected.

from sklearn.utils import resample
y = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4]
resample(y, n_samples=4, replace=False, stratify=y, random_state=0)
# [2, 1, 4, 3]
# 4 appears in sample

resample(y, n_samples=5, replace=False, stratify=y, random_state=0)
# [[3, 2, 2, 1, 1]
# 4 not in sample

@tuhinsharma121
Copy link
Contributor Author

@glemaitre @jnothman Can I work on a PR for this?

@tuhinsharma121
Copy link
Contributor Author

take

@tuhinsharma121
Copy link
Contributor Author

I see. take does not work.

@adrinjalali
Copy link
Member

@tuhinsharma121 you can simply open a PR, no need to assign.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants