New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sklearn.utils.resample stratify parameter is not working #17321
Comments
@tuhinsharma121 AFAIK the
This means that the class |
This is an expected output considering i) the class proportions in |
I'm guessing this is not well tested or implemented with replace=False. It
looks like a bug in the use of `_approximate_mode`.
Tests and fix are welcome.
|
Sorry I'd not seen the previous comments. Yeah stratification with small samples might still imply that we should have a minimum of 1 sample per class... |
First of all thank you all for the explanations. I really appreciate the efforts made by the scikit-learn community to make an awesome ml package. I shall dig a bit deeper to see if I can find any bugs. @alfaro96 This is the problem I am looking forward to solve:- I want to split a Dataframe into 4 parts with stratified sampling. Make sure all categories form column 'B' Should present in each chunk. If any category is not having sufficient records for all chunks, copy same record into remaining chunks.
Output should be like below (All categories from 'B' column should present in each chunk. Index doesn't matter)
I thought using |
@NicolasHug I noticed that If I reduce from sklearn.utils import resample
y = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4]
resample(y, n_samples=4, replace=False, stratify=y, random_state=0)
# [2, 1, 4, 3]
# 4 appears in sample
resample(y, n_samples=5, replace=False, stratify=y, random_state=0)
# [[3, 2, 2, 1, 1]
# 4 not in sample
|
@glemaitre @jnothman Can I work on a PR for this? |
take |
I see. |
@tuhinsharma121 you can simply open a PR, no need to assign. |
I am using
I am using
sklearn.utils,resample
for stratified sampling. Here is my code:-gives me output:-
Expected output:-
output should contain all the distinct values in y. At least one "4" should be present in the output. For eg.
Am I missing something? Thank you for your help in advance.
The text was updated successfully, but these errors were encountered: