Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
FIX Shuffle each class's samples with different random_state in StratifiedKFold #13124
import numpy as np from sklearn.model_selection import StratifiedKFold X = np.arange(20) y =  * 10 +  * 10 StratifiedKFold(n_splits=5, shuffle=True, random_state=XXX) # all the test folds in the split will be [a, b, 10+a, 10+b]
(2) When there're only two samples in the groups, users will always get the same splits with different random_state.
import numpy as np from sklearn.model_selection import StratifiedKFold X = np.arange(10) y =  * 5 +  * 5 StratifiedKFold(n_splits=5, shuffle=True, random_state=XXX) # the test folds will always be [[0, 5], [1, 6], [2, 7], [3, 8], [4, 9]]
referenced this pull request
Feb 9, 2019
There's no test here, and I'm quite confused about what's going on.
It seems that what's going on is we currently pass each class's samples to
I will also note that I think there are two design errors here that have been raised previously which make this hard to work with:
I think the solution is simpler than the title of this thread suggest.
My personal solution is to shuffle the data before passing it to StratifiedKFold() with shuffle=False. This fixes all problems, however, it renders shuffle=True useless.
Therefore, I think the appropriate solution is to have StratifiedKFold() shuffle the data with the given random seed (we only have one random seed), and then go about things as normal. In this case, KFold() won't need to shuffle, so that option can be turned off.
What is "the data" you want to shuffle? We want shuffled indices within each class. Shuffling the data, i.e. y requires inverting the permutation before returning... So it's not really any simpler. But I agree that the description of the fix tells too much detail about the implementation.
I agree that this can solve the problem. I choose another way because we did so previously and I guess there's not too much difference between these two ways.