Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature-Request] Add a flag to StratifiedKFold to force classes with only 1 sample in training #10767

Open
akhilkedia opened this issue Mar 7, 2018 · 4 comments

Comments

@akhilkedia
Copy link

Description

Add a flag to StratifiedKFold which ensures each class is present in training set.
For StratifiedKFold.split, if some class has only 1 sample, currently this sample might be included in the test split rather than the training split. (sklearn does give a warning.)
While for some applications this can be acceptable, a flag which forces classes with a single sample to always be in training can be helpful

Steps/Code to Reproduce

from sklearn.model_selection import StratifiedKFold
import numpy as np
X = np.array([0, 1, 2])
y = np.array([0, 0, 1])


skf = StratifiedKFold(n_splits=2, random_state=0, shuffle=True)
for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Expected Results

There should be some flag in StratifiedKFold, so that atleast 1 element of class 1 is always present in train (in this case, the element at index 2)

Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=2.
  % (min_groups, self.n_splits)), Warning)
TRAIN: [0 2] TEST: [1]
TRAIN: [1 2] TEST: [0]

Actual Results

Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=2.
  % (min_groups, self.n_splits)), Warning)
TRAIN: [0 2] TEST: [1]
TRAIN: [1] TEST: [0 2]

Versions

Linux-4.13.0-36-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.19.1
@akhilkedia
Copy link
Author

I can submit a PR for the same if needed.

@jnothman
Copy link
Member

jnothman commented Mar 7, 2018

I've been tempted to phase out the current StratifiedKFold implementation, as I noted at #10274 (comment). I think it should be implemented as a stable sort on y followed by a round-robin, because:

Whether this is offered through a separate class, or through a strategy or method option on the same class, I'd be interested in seeing it implemented, and then maybe I can persuade the rest of the core devs, and ideally persuade them to eventually make it the default StratifiedKFold behaviour despite breaking backwards compatibility...

@akhilkedia
Copy link
Author

Issue stale. I do not know if this is fixed or not fixed, but I no longer need this feature.
Closing.

@mfeurer
Copy link
Contributor

mfeurer commented Sep 13, 2021

Hey @akhilkedia would you mind re-opening this issue? I just ran into the same problem and would like to not open an identical issue. As a workaround I created a custom solution at https://github.com/automl/auto-sklearn/pull/1244/files but would rather see this in scikit-learn itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants