Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Set random seed in init for CV splitters #15177

Open
wants to merge 24 commits into
base: master
from

Conversation

@NicolasHug
Copy link
Contributor

NicolasHug commented Oct 10, 2019

Fixes #15149
Towards #14042

Addresses bullet point in roadmap

This PR sets a seed attribute in the CV splitters in __init__, which is then used as the only source of RNG in split.

This is IMO a much more natural and simpler behavior w.r.t to randomness.

This is backward incompatible in the sense that:

  1. Multiple calls to split() on the same instance now result in the same splits (this is the intended behavior, so IMO that's a bugfix)
  2. Since the randomness isn't consumed in the exact same way as in master, the splits generated by the code of this PR are different from the splits generated by the master branch.

I guess this is a good first step towards a more natural handling of randomness in scikit-learn. In terms of code the changes involved here are pretty minimal.

@@ -470,41 +472,6 @@ def test_shuffle_kfold():
assert sum(all_folds) == 300


def test_shuffle_kfold_stratifiedkfold_reproducibility():

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 10, 2019

Author Contributor

replaced by test_random_state_reproducibility

@@ -991,37 +958,6 @@ def test_repeated_cv_repr(RepeatedCV):
assert repeated_cv_repr == repr(repeated_cv)


def test_repeated_kfold_determinstic_split():

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 10, 2019

Author Contributor

replaced by test_random_state_reproducibility and test_repeated_folds_are_different

@@ -1038,36 +974,21 @@ def test_get_n_splits_for_repeated_stratified_kfold():
assert expected_n_splits == rskf.get_n_splits()


def test_repeated_stratified_kfold_determinstic_split():

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 10, 2019

Author Contributor

replaced by test_random_state_reproducibility and test_repeated_folds_are_different

NicolasHug added 2 commits Oct 10, 2019
@@ -1247,22 +1243,16 @@ class RepeatedStratifiedKFold(_RepeatedSplits):
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> rskf = RepeatedStratifiedKFold(n_splits=2, n_repeats=2,
... random_state=36851234)

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 10, 2019

Author Contributor

Had to change seed else the repetitions would be equal (by chance)

NicolasHug added 6 commits Oct 10, 2019
@NicolasHug NicolasHug changed the title [WIP] Set random seed in init for CV splitters [MRG] Set random seed in init for CV splitters Oct 11, 2019
@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Oct 11, 2019

Pinging everyone who took part in #14042

@rth @ogrisel @jnothman @amueller @thomasjpfan

self.n_splits = n_splits
self.shuffle = shuffle
self.random_state = random_state
self.random_state = random_state # only here for repr to work

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 11, 2019

Member

What are the benefits of having the repr work? With this PR, we do not need to hold on to self.random_state.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 11, 2019

Author Contributor

What are the benefits of having the repr work?

CI and tests not breaking ^^

If we remove the random_state attribute, it is rendered as random_state=None (no matter the value) by the custom repr.

I feel like a reasonable fix would be to print the seed attr instead of the random_state, but didn't want to involve too many changes in the PR.

NicolasHug added 3 commits Oct 11, 2019
@adrinjalali adrinjalali self-assigned this Oct 21, 2019
@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Oct 22, 2019

Thanks for the review @thomasjpfan . Are you happy with this, or would you prefer we'd store the state instead of drawing a seed?

Copy link
Member

adrinjalali left a comment

First pass.

doc/modules/cross_validation.rst Show resolved Hide resolved
@@ -287,9 +287,17 @@ def __init__(self, n_splits, shuffle, random_state):
raise TypeError("shuffle must be True or False;"
" got {0}".format(shuffle))

if not shuffle and random_state is not None: # None is the default
raise ValueError(

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Oct 23, 2019

Member

Not sure if this should be an error.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 23, 2019

Author Contributor

We usually error when parameters combinations don't make sense, right?

That particular one has always been slightly confusing to me.

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Oct 24, 2019

Member

yeah, with the other comment, now I see why this is very useful.

But this needs a whats_new entry of its own.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 24, 2019

Author Contributor

you know what maybe I'll make a separate PR that we can quickly review. After all this is unrelated to the proposed changes.

self.n_splits = n_splits
self.shuffle = shuffle
self.random_state = random_state
self.random_state = random_state # only here for repr to work
self.seed = _get_seed_from_random_state_param(random_state)

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Oct 23, 2019

Member

It's kinda odd to have a public attribute which is not set through __init__ (I know it's not an estimator :p )

I think I'd be happier with _seed.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 23, 2019

Author Contributor

No strong opinion here. I'll make it private

sklearn/model_selection/_split.py Show resolved Hide resolved
sklearn/model_selection/tests/test_search.py Outdated Show resolved Hide resolved
@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Oct 23, 2019

Are you happy with this, or would you prefer we'd store the state instead of drawing a seed?

I am happy with storing the seed. I am -1 on storing self.random_state when it is never used. Do you plan on submitting another PR to remove it?

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Oct 23, 2019

I am -1 on storing self.random_state when it is never used. Do you plan on submitting another PR to remove it?

Yes. That requires changing the __repr__ of the splitters, and I'd rather bikeshed on that in another PR. It's not a big deal to me to temporarily keep self.random_state

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Oct 24, 2019

I am happy with storing the seed. I am -1 on storing self.random_state when it is never used. Do you plan on submitting another PR to remove it?

the random_state is a public attribute, we can't just remove it w/o deprecation. I'd be happy to have it deprecated, and be replaced by _seed maybe? It feels odd to pass random_state and then not have it in the instance. Also, I rather have a __repr__ result which represents the object in kind of a reproducible way.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Oct 24, 2019

the random_state is a public attribute, we can't just remove it w/o deprecation.

+1 on deprecating it.

NicolasHug added 2 commits Oct 24, 2019
…litters_randomstate_init
@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Oct 30, 2019

@GaelVaroquaux you seemed to have concerns about doing this during the meeting?

Copy link
Member

GaelVaroquaux left a comment

This is a change of semantics. I am not comfortable about it, because it means that the behavior of random states is no longer homogeneous in the library.

I do not find that the justification in #15149 (makes it easier to implement successive halving) is a legitimate case to introduce inconsistencies in the library.

@@ -298,7 +298,9 @@ def __init__(self, n_splits, shuffle, random_state):

self.n_splits = n_splits
self.shuffle = shuffle
self.random_state = random_state
self.random_state = random_state # only here for repr to work
self._seed = _get_seed_from_random_state_param(random_state)

This comment has been minimized.

Copy link
@GaelVaroquaux

GaelVaroquaux Oct 30, 2019

Member

The fact that this is done in the init breaks our convention. Arguably, these conventions are only for estimators, so it's not "outlawed" :).

The reason behind those conventions is to enable people changing public attributes without having surprising behavior.

If we were to adapt these conventions, we would do this call in the beginning of the "split" method. I haven't given a lot of thoughts to this choice. Do you have an educated opinion of what would be better?

for _ in range(n_repeats):
# the local rng object will be consumed exactly once in
# self.cv.__init__, hence ensuring that each repetition yields
# different folds.

This comment has been minimized.

Copy link
@GaelVaroquaux

GaelVaroquaux Oct 30, 2019

Member

That's a clear change of semantics of what "random_state" means. I am not very comfortable with it, because it means that semantics of arguments vary across the library, which is bad for usability.

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Oct 30, 2019

Thanks for your comments!

I am not comfortable about it, because it means that the behavior of random states is no longer homogeneous in the library.

I would definitely want to implement the same logic for the estimators as well (I agree inconsistency should be avoided). You can consider this PR as some sort of proof of concept that the required changes should be relatively small.

Regarding breaking our convention about __init__ parameters: I think we could implement the same logic by only introducing changes to split() (or fit()) with something like

# set _seed attribute when split / fit is called for the first time
self._seed = getattr(self, 'seed', 
                     _get_seed_from_random_state_param(self.random_state))

The reason behind those conventions is to enable people changing public attributes without having surprising behavior.

I always thought this was meant for clone to work/be easy to implement*, but that makes sense. If we want to allow users to dynamically change the randomness of their estimators, maybe we can consider having a special case in set_params() for the random_state attribute: this would set the _seed attribute as well, exactly like in the above snippet.

*Sort of related: clone typically doesn't "clone" as we might expect when est.random_state is a RandomState instance, since it could have been consumed and you end up with a different estimator from the initial one. I don't know whether this is something we're happy with when it comes to parameter searches. (the current proposal wouldn't fix this, unless we also special-case clone)

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Oct 30, 2019

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Oct 31, 2019

we need to think about how to control user expectations

By that you mean we should:

  • provide easy ways for users to implement what they were implementing with the old behavior (e.g. bootstraping)
  • let them know well in-advance that their results might change with the new behavior? (I feel like such a change would be a 1.0 thing anyway.)
  • both?

Thanks a lot for the feedback so far

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.