Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Shuffle each class's samples with different random_state in StratifiedKFold #13124

Merged
merged 9 commits into from Feb 27, 2019

Conversation

7 participants
@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Feb 9, 2019

Fixes #13110
Our StratifiedKFold shuffle each class's samples with same random_state (i.e., in the same way). This is not good IMO:
(1) We can only get certain splits.

import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.arange(20)
y = [0] * 10 + [1] * 10
StratifiedKFold(n_splits=5, shuffle=True, random_state=XXX)
# all the test folds in the split will be [a, b, 10+a, 10+b]

(2) When there're only two samples in the groups, users will always get the same splits with different random_state.

import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.arange(10)
y = [0] * 5 + [1] * 5
StratifiedKFold(n_splits=5, shuffle=True, random_state=XXX)
# the test folds will always be [[0, 5], [1, 6], [2, 7], [3, 8], [4, 9]]

Revert part of #7823, I think it's a regression introduced in 0.19, ping @jnothman @amueller

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Feb 10, 2019

I would add a test that if shuffle is True then results depends on random_state for all cv classes that support this.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 11, 2019

PR #7823 was included in 0.19.0, so if it's due to that, I don't think it's a regression since 0.19.x.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 11, 2019

There's no test here, and I'm quite confused about what's going on.

It seems that what's going on is we currently pass each class's samples to KFold(self.n_splits, shuffle=self.shuffle, random_state=self.random_state). This PR changes it so that instead of passing self.random_state we pass a RandomState that is progressed in each call to KFold. I don't really see why the current code causes the stated bug.

I will also note that I think there are two design errors here that have been raised previously which make this hard to work with:

  1. In the sklearn.cross_validation -> sklearn.model_selection rewrite, we did not carefully design what to do with random states. Ideally check_random_state should be called in some fit method, or perhaps in __init__, such that each call to split is different. Instead we landed up calling check_random_state in split, such that you get the surprising behaviour that passing in a RandomState will give different results for each split but passing in an int will give fixed results for each split.
  2. StratifiedKFold uses this "apply K fold to each class" which gives some weird (imbalanced) results. We previously had a sort-and-round-robin approach, which I think was rejected for the wrong reasons (all we needed was a stable sort), and it should be reinstated/available as a much simpler approach.

@jnothman jnothman removed this from the 0.20.3 milestone Feb 11, 2019

@qinhanmin2014 qinhanmin2014 changed the title FIX Enable StratifiedKFold to produce different splits FIX Shuffle each stratification with different random_state in StratifiedKFold Feb 11, 2019

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Feb 11, 2019

@jnothman I'm aware that there's some design issues (maybe errors) in StratifiedKFold, but my intention here is to fix bugs.
Apologies I didn't describe the problem very well. I've updated the PR body. Please take some time to have a look.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 11, 2019

Do you still believe it's a regression since 0.19?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 11, 2019

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Feb 11, 2019

Do you still believe it's a regression since 0.19?

#7823 is included in 0.19.X (not included in 0.18.X), so I think it's a regression.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 11, 2019

Right, so a regression introduced in 0.19... so we should not be scheduling it as a 0.20.x fix.

@NicolasHug
Copy link
Contributor

NicolasHug left a comment

LGTM!

Show resolved Hide resolved doc/whats_new/v0.21.rst Outdated
Show resolved Hide resolved sklearn/model_selection/tests/test_split.py Outdated
@jnothman
Copy link
Member

jnothman left a comment

Thanks.

Show resolved Hide resolved doc/whats_new/v0.21.rst
Show resolved Hide resolved sklearn/model_selection/tests/test_split.py Outdated
@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Feb 12, 2019

not each stratification, per se, but each class's samples

this term comes from the doc and I've also updated the doc accordingly.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 12, 2019

@Arjannikov

This comment has been minimized.

Copy link

Arjannikov commented Feb 12, 2019

I think the solution is simpler than the title of this thread suggest.

My personal solution is to shuffle the data before passing it to StratifiedKFold() with shuffle=False. This fixes all problems, however, it renders shuffle=True useless.

Therefore, I think the appropriate solution is to have StratifiedKFold() shuffle the data with the given random seed (we only have one random seed), and then go about things as normal. In this case, KFold() won't need to shuffle, so that option can be turned off.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 12, 2019

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Feb 12, 2019

My personal solution is to shuffle the data before passing it to StratifiedKFold() with shuffle=False. This fixes all problems, however, it renders shuffle=True useless.

I agree that this can solve the problem. I choose another way because we did so previously and I guess there's not too much difference between these two ways.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 13, 2019

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Feb 13, 2019

I don't understand

I mean we can either shuffle the entire dataset or shuffle each class's samples (with a different random_state). I choose the second way because we did so previously (and it's consistent with our doc).

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 13, 2019

Oh, I wrote "I don't understand" in response to a comment by @Arjannikov that appears to have since been deleted

@Arjannikov

This comment has been minimized.

Copy link

Arjannikov commented Feb 13, 2019

Oh, my comment was. We only have one random seed. So it makes more sense to do the shuffle once, at the very beginning. But later I thought about using the same random seed for different subsets of the data would work also. So, I deleted my comment. Sorry.

@Arjannikov

This comment has been minimized.

Copy link

Arjannikov commented Feb 13, 2019

What confuses me is @qinhanmin2014 note: "(with different random_state)".

@qinhanmin2014 qinhanmin2014 changed the title FIX Shuffle each stratification with different random_state in StratifiedKFold FIX Shuffle each class's samples with different random_state in StratifiedKFold Feb 14, 2019

@Arjannikov

This comment has been minimized.

Copy link

Arjannikov commented Feb 16, 2019

@qinhanmin2014, where do you get "different random_state"? StartifiedKFold() takes only one integer as random_state parameter.

@qinhanmin2014

This comment has been minimized.

Copy link
Member Author

qinhanmin2014 commented Feb 16, 2019

@qinhanmin2014, where do you get "different random_state"? StartifiedKFold() takes only one integer as random_state parameter.

We can produce different random states using one integer (the random_state parameter).

@qinhanmin2014 qinhanmin2014 added this to To do in Sprint Paris 2019 via automation Feb 27, 2019

@qinhanmin2014 qinhanmin2014 moved this from To do to Needs review in Sprint Paris 2019 Feb 27, 2019

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 27, 2019

We can put this in 0.20.X if people want. It should be merged either way (another pro-forma review anyone, or we can accept @NicolasHug's above?)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux left a comment

LGTM. 👍 for merge.

Merging.

@GaelVaroquaux GaelVaroquaux merged commit afc6cc5 into scikit-learn:master Feb 27, 2019

11 checks passed

LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 92.42%)
Details
codecov/project 92.43% (+0.01%) compared to 9f0b959
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

Sprint Paris 2019 automation moved this from Needs review to Done Feb 27, 2019

@qinhanmin2014 qinhanmin2014 deleted the qinhanmin2014:StratifiedKFold-split branch Feb 27, 2019

@ogrisel ogrisel added this to the 0.20.4 milestone Mar 1, 2019

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Mar 1, 2019

The 0.20.3 tag has already been pushed without this fix but I think we should include it in the next LTS bugfix release.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Mar 3, 2019

Did I fall to cherry pick this? Sorry :(

Kiku-git added a commit to Kiku-git/scikit-learn that referenced this pull request Mar 4, 2019

FIX Shuffle each class's samples with different random_state in Strat…
…ifiedKFold (scikit-learn#13124)

* Enable StratifiedKFold to produce different splits

* what's new

* redundant statement

* update what's new

* redundant comment

* add a test

* move what's new entry

* review comment

* review comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.