New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] FIX Be robust to non re-entrant/ non deterministic cv.split calls #7660

Merged
merged 20 commits into from Oct 30, 2016

Conversation

Projects
None yet
6 participants
@raghavrv
Member

raghavrv commented Oct 13, 2016

Fixes #6726

  • To make consistent splits for all parameter settings when the cv iterators are non re-entrant (thanks to Joel for the word!) and non deterministic for successive split calls, we call split once and store it as a list that will be iterated multiple times.
  • Similarly in _CVIterableWrapper, we store the list rather than the cv object and iterate on it... (NOTE that this would make previously (intentionally) non deterministic cv iterables, deterministic with consistent output for successive split calls. Do we want that to happen?)

@raghavrv raghavrv added this to the 0.18.1 milestone Oct 13, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Oct 13, 2016

Ping @jnothman @amueller for review...

@raghavrv

This comment has been minimized.

Member

raghavrv commented Oct 13, 2016

Also what kind of tests should I add here? I can make a mock splitter which will iterate only once and pass it as cv to grid search. That will fail in master but will pass in this branch. Any other tests?

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 13, 2016

passing in a generator expression for cv would be a sufficient non-regression test.

Without tests, surely this should be WIP?

@raghavrv raghavrv changed the title from [MRG] FIX Be robust to non re-entrant/ non deterministic cv.split calls to [WIP] FIX Be robust to non re-entrant/ non deterministic cv.split calls Oct 13, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Oct 13, 2016

Without tests, surely this should be WIP?

Indeed. Sorry for marking it MRG

@raghavrv raghavrv removed the Needs Review label Oct 13, 2016

@raghavrv raghavrv changed the title from [WIP] FIX Be robust to non re-entrant/ non deterministic cv.split calls to [MRG] FIX Be robust to non re-entrant/ non deterministic cv.split calls Oct 13, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Oct 13, 2016

Done! Reviews please ;)

@@ -129,6 +128,7 @@ def cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None,
X, y, groups = indexable(X, y, groups)
cv = check_cv(cv, y, classifier=is_classifier(estimator))
cv_iter = cv.split(X, y, groups)

This comment has been minimized.

@amueller

amueller Oct 13, 2016

Member

why this change?

This comment has been minimized.

@raghavrv

raghavrv Oct 14, 2016

Member

Just as a sign of best practice denoting that we enlist the indices and store them as soon as we check the cv param...

This comment has been minimized.

@amueller

amueller Oct 14, 2016

Member

I'm not sure I get the point. cv_iter is a generator, right? This adds a line and doesn't make the code easier to follow.

This comment has been minimized.

@raghavrv

raghavrv Oct 14, 2016

Member

argh sorry I somehow kept thinking I did list(cv.split(...)...

@@ -384,6 +384,7 @@ def cross_val_predict(estimator, X, y=None, groups=None, cv=None, n_jobs=1,
X, y, groups = indexable(X, y, groups)
cv = check_cv(cv, y, classifier=is_classifier(estimator))
cv_iter = cv.split(X, y, groups)

This comment has been minimized.

@amueller

This comment has been minimized.

@raghavrv

raghavrv Oct 14, 2016

Member

(Same comment as above)

Should I leave it as before?

'mean_fit_time', 'std_fit_time'):
gs.cv_results_.pop(key)
gs2.cv_results_.pop(key)
np.testing.assert_equal(gs.cv_results_, gs2.cv_results_)

This comment has been minimized.

@amueller

amueller Oct 13, 2016

Member

why not scikit learn testing?

This comment has been minimized.

@raghavrv

raghavrv Oct 14, 2016

Member

The numpy's testing.assert_equal is to check equivalence of nested dicts. We don't have an equivalent AFAIK...

@amueller

This comment has been minimized.

Member

amueller commented Oct 13, 2016

you're not testing the consistency of folds in GridSearchCV or learning_curve or validation_curve.

def __init__(self, n_samples=100):
self.indices = KFold(n_splits=5).split(np.ones(n_samples))
def split(self, X=None, y=None, groups=None):

This comment has been minimized.

@amueller

amueller Oct 14, 2016

Member

maybe add a comment that says that this split can only be called once.

_pop_time_keys(gs2.cv_results_))
# Check consistency of folds across the parameters
gs = GridSearchCV(LinearSVC(random_state=0),

This comment has been minimized.

@amueller

amueller Oct 14, 2016

Member

Did this test fail before? Maybe also add a one-line comment explaining what's happening.

This comment has been minimized.

@raghavrv

raghavrv Oct 16, 2016

Member

Yes it fails on master...

@raghavrv

This comment has been minimized.

Member

raghavrv commented Oct 16, 2016

@amueller Have fixed your comments. I'm not able to find a way to test the consistency of cv folds in learning_curve, the learning curve seems to filter duplicate values for train sizes. Otherwise I could have given two similar values and test their scores as a proof of consistency in splits...

Other than that, I have addressed all your comments, added explanatory comments in tests + all tests pass. One more review please?

cc: @jnothman too if you find time... :)

@@ -1467,7 +1467,7 @@ def get_n_splits(self, X=None, y=None, groups=None):
class _CVIterableWrapper(BaseCrossValidator):
"""Wrapper class for old style cv objects and iterables."""
def __init__(self, cv):
self.cv = cv
self.cv = list(cv)

This comment has been minimized.

@raghavrv

raghavrv Oct 16, 2016

Member

@jnothman @amueller

Do we want this change? This would mean check_cv(old_type_one_time_iterator) --> new type multi entrant iterator?
I am not able to dig through the huge pile of conversations, but I recall us choosing this for the reason that if the iterable that is wrapped is non re-entrant the wrapped splitter should also be non re-entrant...

For instance

import numpy as np

from sklearn.model_selection import KFold
from sklearn.model_selection import check_cv
from sklearn.datasets import make_classification

from sklearn.utils.testing import assert_array_equal

X, y = make_classification(random_state=0)

# At master - This test will pass ------------------------------------------

kf_iter = KFold(n_splits=5).split(X, y)
kf_iter_wrapped = check_cv(kf_iter)
# First split call will work
list(kf_iter_wrapped.split(X, y))
# But not subsequent
assert_array_equal(list(kf_iter_wrapped.split(X, y)), [])

# At this PR - This test will pass ------------------------------------------

kf_iter = KFold(n_splits=5).split(X, y)
kf_iter_wrapped = check_cv(kf_iter)
# Since the wrapped iterable is enlisted and stored,
# split can be called any number of times to produce
# consistent results
assert_array_equal(
    list(kf_iter_wrapped.split(X, y)),
    list(kf_iter_wrapped.split(X, y)))

This comment has been minimized.

@amueller

amueller Oct 17, 2016

Member

I think this change is fine. This is only used internally, right? All sklearn functions worked with a single-run iterator before, and now they will do that again after this PR. So there are no behavior changes, right?
If someone used check_cv they will get something different now, which is not entirely great, but this is a bug-fix on 0.18 so we have to change behavior. We never promised our wrapper would be one-iteration if the original cv was.

This comment has been minimized.

@raghavrv

raghavrv Oct 18, 2016

Member

Ok thanks for the comment...

This comment has been minimized.

@raghavrv

raghavrv Oct 19, 2016

Member

(Just to note that I've added the 2nd test just to ensure this change has effect and to fail if reverted later...)

This comment has been minimized.

@jnothman

jnothman Oct 20, 2016

Member

I suppose infinite CV strategies are unlikely. Expensive ones are a bit more likely, but I think this approach is okay.

This comment has been minimized.

@raghavrv

raghavrv Nov 24, 2016

Member

Do we revert this change too? @jnothman @amueller?

@amueller amueller added the Blocker label Oct 17, 2016

@@ -61,6 +61,22 @@
from sklearn.linear_model import SGDClassifier
class CustomSplitter():

This comment has been minimized.

@amueller

amueller Oct 17, 2016

Member

If this is used in two tests, it should probably be in utils.testing.

@@ -61,6 +61,7 @@
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection.tests.test_split import MockClassifier

This comment has been minimized.

@amueller

amueller Oct 17, 2016

Member

I don't think tests should import from other tests.

This comment has been minimized.

@raghavrv

raghavrv Oct 18, 2016

Member

There are quite a few MockClassifier's that are specific to the tests and the module. I'm not sure there is much benefit in grouping them into utils.testing...

This comment has been minimized.

@amueller

amueller Oct 18, 2016

Member

Well if they are specific to a test, then I think they should stay there. If they are not, then maybe they should be shared? I'm not sure.

This comment has been minimized.

@raghavrv

raghavrv Oct 19, 2016

Member

Other option would be to have a separate private file (in the tests directory) for these kind of classifiers/mock splitters and import from that?

This comment has been minimized.

@raghavrv

raghavrv Oct 19, 2016

Member

Well if they are specific to a test, then I think they should stay there.

They are specific to 2 sets of test in 2 different test files, but not anywhere else except the model_selection tests, hence my disinclination to move them to utils.testing :)

Maybe you want them moved to sklearn.model_selection.tests._test_setup?

cv=CustomSplitter(n_splits=n_splits, n_samples=30),
train_sizes=np.linspace(0.1, 1.0, 10))
if len(w) > 0:
raise RuntimeError("Unexpected warning: %r" % w[0].message)

This comment has been minimized.

@amueller

amueller Oct 17, 2016

Member

AssertionError?

This comment has been minimized.

@amueller

amueller Oct 17, 2016

Member

Or just do an assert instead of an if?

This comment has been minimized.

@amueller

amueller Oct 19, 2016

Member

apart from this LGTM

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 19, 2016

Elsewhere we have ./sklearn/cluster/tests/common.py

On 20 October 2016 at 01:10, Raghav RV notifications@github.com wrote:

@raghavrv commented on this pull request.

In sklearn/model_selection/tests/test_validation.py
#7660:

@@ -61,6 +61,7 @@
from sklearn.datasets import make_multilabel_classification

from sklearn.model_selection.tests.test_split import MockClassifier

Well if they are specific to a test, then I think they should stay there.

They are specific to 2 sets of test in 2 different test files, but not
anywhere else except the model_selection tests, hence my disinclination
to move them to utils.testing :)

Maybe you want them moved to sklearn.model_selection.tests._test_setup?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7660, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AAEz6zAtCX1DnzgNqZ-zMHJ7hicL9514ks5q1iTvgaJpZM4KVa_v
.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Oct 19, 2016

Elsewhere we have ./sklearn/cluster/tests/common.py

Thanks for checking. I'll move them to model_selection/tests/common.py...

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 2, 2016

i wouldn't usually support materialising an iterator as a list unnecessarily, but I think this is the solution of least surprise to the user, especially as CV splits aren't usually expensive to compute or store relative to the rest of the problem, and their number is necessarily finite and predetermined, at least in gscv/rscv.

amueller added a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016

@amueller

This comment has been minimized.

Member

amueller commented Nov 14, 2016

this needs a whatsnew.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Nov 21, 2016

I think that the solutions used in this PR are not correct: it is important to rely on iterators and not lists. Lists are leading to heavy data duplication #7919

@raghavrv

This comment has been minimized.

Member

raghavrv commented Nov 21, 2016

Ahh okay should we resort to itertools.tee then?

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 21, 2016

itertools.tee effectively stores things. We may just have to prohibit
non-reentrant iterators??

On 22 November 2016 at 10:05, Raghav RV notifications@github.com wrote:

Ahh okay should we resort to itertools.tee
https://docs.python.org/2/library/itertools.html#itertools.tee then?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7660 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz690iKqNOWD-N12f1bWVIQNKtdVXfks5rAiPAgaJpZM4KVa_v
.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Nov 21, 2016

I'm +1 for that. That would simplify a few things + discourage hacked up cvs that do not conform to our splitter API (Though I am not sure if reentrancy is kind of implied by our API)...

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 22, 2016

I think this still comes back to things like randomization between calls to split(). I'll try think about this more later.

:(

@amueller

This comment has been minimized.

Member

amueller commented Nov 23, 2016

@GaelVaroquaux how do you ensure that an generator returns identical iterators when called twice without storing it?
So we could give up on guaranteeing consistency across runs using arbitrary cross-validation iterators, and make sure ours are generating consistent sequences?

What do you think should be the semantics of passing KFold(shuffle=True) to GridSearchCV?
Our previous behavior (pre 0.18 and post 0.18.1) is that all parameters have the same folds.
In fact, iterating of KFold(shuffle=True) any number of times used to create exactly the same folds. Since 0.18, each iteration over KFold(shuffle=True) returns different folds.
And that is something we were not really careful enough about :(

I would argue that having it return always the same sequence is somewhat counter-intuitive. What do you think?

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Nov 23, 2016

@GaelVaroquaux how do you ensure that an generator returns identical iterators
when called twice without storing it?

Well, a generator strictly speak not. But we could ensure that the split
method of the CV object returns the same thing.

I would argue that having it return always the same sequence is somewhat
counter-intuitive. What do you think?

In terms of intuition, maybe if you didn't specific the random_state, I
would agree with you.

However, that's not a nice semantic, because it will require the user to
store the list of indices generated in many cases. For instance when
running multiple estimators and wanting to compare the results. It will
be easy to get it wrong. If you look at the bug that was fixed by
@raghavrv, I believe that the code would be wrong without the list if the
iteration was not reproducible.
https://github.com/scikit-learn/scikit-learn/pull/7660/files#diff-676ff6854a280680ed7d91103ef4b0b5R553

The list expands the generator, and hence avoids any irreproducibility
and side effects. But in a sense it defeats the purpose of having
generators. If we consider the extreme case of really large n small p
(say p = 30, and n in the millions), doing 100 iterations of
shuffle-split would create more indices that are stored in the memory
than actual data.

@amueller

This comment has been minimized.

Member

amueller commented Nov 23, 2016

The semantics of always producing the same result are somewhat nicer than what's there now, and yes, using lists defeats the purpose of iterators.
We can go back to checking the random_state on __init__ if we agree that "deterministic randomness" is fine. We should be very explicit about that, though.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Nov 23, 2016

We can go back to checking the random_state on init if we agree that
"deterministic randomness" is fine. We should be very explicit about that,
though.

With insight, I think that this is better. The bug and corresponding fix
point it out clearly, IMHO.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Nov 23, 2016

Since 0.18, each iteration over KFold(shuffle=True) returns different folds.

If that is the only concern here we could change the order of iteration here. It would paralelize over parameters rather than splits. The problem with that approach is it could be slower when one parameter setting happens to train faster than another, unless joblib is smart to figure that out and let the next job take over that core. (For instance say I iterate over n_estimators=[1, 10, 100, 2000] of random forest, then for a 4 core machine the first core could be idle unless joblib has some nice heuristics that I am unaware of..)

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 24, 2016

Joblib has no problem with pooling and taking the next job off the queue afaik, so that's not an issue.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Nov 24, 2016

Shall I proceed with the PR. Are you okay with that approach @amueller @GaelVaroquaux?

@raghavrv

This comment has been minimized.

Member

raghavrv commented Nov 24, 2016

And for the problem of shuffle in spliltters what do you think of setting a random int as random_state if random_state is None.

And then reusing this guy to regenerate rng at split?

# At init
if random_state is None:
    random_state = check_random_state(random_state).randint(0, 100000)

# At split
rng = check_random_state(random_state)

Would take care of determinism for successive splits and will also ensure the splits are random for multiple initialization but consistent across splits after initialization.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Nov 24, 2016

If, after both my changes, all tests pass we are good to go I think...

@raghavrv

This comment has been minimized.

Member

raghavrv commented Nov 24, 2016

we could change the order of iteration here.

Sweeeet now we have to worry about one time parameter samplers / grid iterators ;@ I think copying / changing parameter grid / sampler will break user code more than explicitly not supporting non-reentrant cvs. A lot of users hack the ParameterGrid in a variety of ways that may not be compatible with copying et all...

@raghavrv

This comment has been minimized.

Member

raghavrv commented Nov 24, 2016

I think we can have #7935, revert all the list(iter) changes brought about by this PR and add a note at cv doc stating that multiple iterations are done and hence we expect the cv to not be a non-reentrant one. If they want they can materialize the iterator externally... WDYT?

@raghavrv

This comment has been minimized.

Member

raghavrv commented Nov 25, 2016

Sweeeet now we have to worry about one time parameter samplers / grid iterators

Wait since all the parameter dicts are anyway generated and stored, it wouldn't actually hurt to materialize that as a list and store them. If my intuition here is correct, doing this will not consume any additional memory...

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 28, 2016

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment