Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Repeated K-Fold and Repeated Stratified K-Fold #8120

Merged
merged 13 commits into from Mar 4, 2017

Conversation

neerajgangwar
Copy link
Contributor

@neerajgangwar neerajgangwar commented Dec 27, 2016

Reference Issue

Fixes #7948

What does this implement/fix? Explain your changes.

Implements RepeatedKFold and RepeatedStratifiedKFold

Any other comments?

For previous discussion on this, please refer to #7960

@jnothman
Copy link
Member

Do you intend to close #7960

@neerajgangwar neerajgangwar changed the title Feature/repeated splits [MRG] Repeated K-Fold and Stratified K-Fold Dec 27, 2016
@neerajgangwar neerajgangwar changed the title [MRG] Repeated K-Fold and Stratified K-Fold [MRG] Repeated K-Fold and Repeated Stratified K-Fold Dec 27, 2016
@neerajgangwar
Copy link
Contributor Author

Yes.

>>> from sklearn.model_selection import RepeatedKFold, KFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> random_states = [12883823, 28827347]
>>> rkf = RepeatedKFold(KFold(n_splits=2), n_repeats=2, random_states=random_states)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want a solution that can accept a single random_state. It can generate random states for each KFold instance.

I also don't get why you should accept a KFold instance to the constructor of RepeatedKFold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be better to accept n_splits to the constructor of RepeatedKFold and create an instance of KFold inside. Likewise for RepeatedStratifiedKFold.

I'll make the changes.

if n_repeats <= 1:
raise ValueError("Number of repetitions must be greater than 1.")

rng = check_random_state(random_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not certain it's the best design, but currently all the splitters do this in split, not in __init__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other way to achieve this other than initializing self.random_states = [] in __init__ and generate random states when split is called for the first time? The code 946-950 will move to split inside an if condition?

@jnothman
Copy link
Member

jnothman commented Dec 28, 2016 via email

@neerajgangwar
Copy link
Contributor Author

Do you mean initialize RandomState in each split call with check_random_state and generate random states? In this case, if initial random_state is int, it will work fine as check_random_state will return RandomState with the same initial seed on every call. But if it's None, it will return RandomState with different seed on every call and if it's RandomState, it'll return the same object. In both of these cases, split will produce different splits on different calls. To generate same splits on different split calls, initial state needs to be stored somewhere probably?

Or are you referring to some other way?

@jnothman
Copy link
Member

jnothman commented Dec 29, 2016 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

[1 2] [0 3]
[0 3] [1 2]


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just mention RepeatStratifiedKFold here and under StratifiedKFold. Also in the "see also"s of relevant classes.

@@ -409,6 +432,30 @@ two slightly unbalanced classes::
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]

Repeated Stratified K-Fold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. I think this is overkill

@@ -913,6 +915,238 @@ def get_n_splits(self, X, y, groups):
return int(comb(len(np.unique(groups)), self.n_groups, exact=True))


class _RepeatedSplits(with_metaclass(ABCMeta)):
"""Repeated splits for K-Fold and Stratified K-Fold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for an arbitrary randomized CV splitter"

class _RepeatedSplits(with_metaclass(ABCMeta)):
"""Repeated splits for K-Fold and Stratified K-Fold

Repeats splits for cross-validators n times.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with different randomization

return self._repeated_splits.get_n_repeats()


class RepeatedStratifiedKFold(with_metaclass(ABCMeta)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an ABC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing.

def __init__(self, cv, n_repeats=5, random_state=None):
if not isinstance(cv, (KFold, StratifiedKFold)):
raise ValueError(
"cv must be an instance of KFold or StratifiedKFold.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only KFold and StratifiedKFold use random_state. That's why I added this check. Should I remove it? And also, is there a way to check if cv is an instance of cross-validator with randomized split functionality?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a private class, it doesn't require such validation.

for train_index, test_index in cv.split(X, y, groups):
yield train_index, test_index

def get_n_repeats(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. n_repeats is already an attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing from all classes.

test : ndarray
The testing set indices for that split.
"""
cv = self.cv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead we should be constructing new CV objects in here (i.e. in split). Thus _RepeatedSplits.__init__ would take a constructor for cv rather than cv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice as it would remove the dependency of KFold from RepeatedKFold in terms of parameters. I am thinking something like def __init__(self, cv, n_repeats=5, random_state=None, **cvargs):. It will be called as _RepeatedSplits(KFold, n_repeats, random_state, n_splits=n_splits). This is what you meant, right?

I have one doubt though. Since shuffle should always be True and random_state will be generated inside split function, would it be okay to just mention that user should not pass these arguments and one random_state that is passed does not correspond to random_state parameter of KFold?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have **cvargs in the _RepeatedSplits class while still only allowing specified named args in RepeatedKFold

train, test = next(splits)
assert_array_equal(train, [0, 1, 2])
assert_array_equal(test, [3, 4])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also check that a second call to split produces the same sets.

Copy link
Member

@jnothman jnothman Jan 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment to explain why this is repeated.
perhaps use a loop or a helper function to avoid duplicated code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also don't check here that the iterator is exhausted after 4 elements


def test_repeated_stratified_kfold_errors():
# n_repeats is not integer or <= 1
assert_raises(ValueError, RepeatedStratifiedKFold, n_repeats=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this in a loop together with the RepeatedKFold case.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking much better, thanks!

See also
--------
RepeatedStratifiedKFold: Repeats Stratified K-Fold n times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove blank line

train, test = next(splits)
assert_array_equal(train, [0, 1, 2])
assert_array_equal(test, [3, 4])

Copy link
Member

@jnothman jnothman Jan 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment to explain why this is repeated.
perhaps use a loop or a helper function to avoid duplicated code.

train, test = next(splits)
assert_array_equal(train, [0, 1, 2])
assert_array_equal(test, [3, 4])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also don't check here that the iterator is exhausted after 4 elements

@neerajgangwar
Copy link
Contributor Author

Thanks @jnothman for the review. And a very happy new year :)

@jnothman
Copy link
Member

jnothman commented Jan 1, 2017

LGTM!

@jnothman jnothman changed the title [MRG] Repeated K-Fold and Repeated Stratified K-Fold [MRG+1] Repeated K-Fold and Repeated Stratified K-Fold Jan 1, 2017
raise ValueError("Number of repetitions must be of Integral type.")

if n_repeats <= 1:
raise ValueError("Number of repetitions must be greater than 1.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never check values in __init__. Move it to split.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't error be thrown at the construction time if there is some discrepancy with the parameters passed? In _BaseKFold also, values are checked in __init__.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sklearn for the estimator, we never check the error in init because of set_params but these classes are not estimators. I imagine this rule is not applied here.

@jnothman As I'm not 100% sure can you confirm that ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, at least, CV splitters are a bit special in this regard. Checking in __init__ is consistent with other splitters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thx @jnothman


**cvargs : additional params
Constructor parameters for cv. Must not contain random_state
and shuffle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an obligation as _RepeatedSplits is private but can you raise an error in split to check that ?

rng = check_random_state(self.random_state)

for idx in range(n_repeats):
random_state = rng.randint(np.iinfo(np.int32).max)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but why directly send rng to random_state ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer random state generated by rng is sent as random_state. Do you have any other way in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was not clear.
Can you remove the line random_state = rng.randint(np.iinfo(np.int32).max) and change random_state by rng later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean after creating the object for cv? If yes, how would it make a difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean : cv = self.cv(random_state=rng, shuffle=True, **self.cvargs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code is similar to the code following :

In [1]: from sklearn.utils import check_random_state

In [2]: class Foo:
   ...:     def __init__(self, random_state):
   ...:         self.rng = check_random_state(random_state)
   ...:         
   ...:     def fit(self):
   ...:         print(self.rng.randint(1000))
   ...:         

In [3]: rng = check_random_state(0)

In [4]: f1 = Foo(rng)

In [5]: f2 = Foo(rng)

In [6]: f1.fit()
684

In [7]: f1.fit()
559

In [8]: f2.fit()
629

In [9]: f2.fit()
192

rng is an object and it will be modified every time you call cv.split. So for me is not necessary to generate a specific random_state at each iteration. Maybe I am missing something ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_random_state does not create a copy of rng if rng is a random_state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any case which we'll miss by this approach. But I think current implementation keeps the use of random_state clean. I am not really sure.

@jnothman thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think passing rng directly should be okay. If it's not okay, we need to be able to construct a test case that proves so!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to find any such testcase. So making the changes. Thanks!

Copy link
Contributor

@tguillemot tguillemot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Circle seems unrelated.


Parameters
----------
n_splits : int, default=3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the other estimators but seem pretty useless in practice. I would make 10 times 10-fold the default. Why would you want to do 3x5 instead of 10 fold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing it to 5x10 (n_splits x n_repeats) by default. Is that fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from minor changes, in particular default parameters. How do others feel about adding this to check_cv? We could have a mxn syntax to have m repetitions of n-fold with automatically detecting stratified vs not, so you could do cv="10x10". Not entirely sure about that though.

---------------

:class:`RepeatedKFold` repeats K-Fold n times. It can be used when one
requires to run :class:`KFold` n times, producing different splits in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "it can be used to run KFold multiple times to increase the fidelity of the estimate? Or can we say to decrease the variance? Is that accurate?

Parameters
----------
cv : callable
Constructor of cross-validator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the cross validation class itself? That seems more natural than passing the __init__ method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not passing the __init__ method. It's called as RepeatedSplits(KFold, n_repeats, random_state, n_splits=n_splits). I am changing description to "Cross-validator class.".

cv : callable
Constructor of cross-validator.

n_repeats : int, default=5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably do 10x10 by default, or maybe 5 times 10 fold. This is something you use when you care about accuracy but not necessarily time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing default values of n_splits to 5 and n_repeats to 10.

if n_repeats <= 1:
raise ValueError("Number of repetitions must be greater than 1.")

if any(key in cvargs for key in ('random_state', 'shuffle')):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if set(cvargs).intersection({'random_state', 'shuffle'})? Though not really shorter :-/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the same as both are of equal length. :P

rng = check_random_state(self.random_state)

for idx in range(n_repeats):
cv = self.cv(random_state=rng, shuffle=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we maybe want to raise nice errors if these arguments are not present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get you. Which arguments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I have no idea what I meant.... ?

@amueller
Copy link
Member

amueller commented Mar 4, 2017

LGTM. Can you add an entry to whatsnew.rst?

@neerajgangwar
Copy link
Contributor Author

@amueller Conflict in doc/whats_new.rst. Help?

@amueller
Copy link
Member

amueller commented Mar 4, 2017

Fixed it (which you could have also done locally ;) omg this github feature is amazing !!

@amueller
Copy link
Member

amueller commented Mar 4, 2017

It's customary to add your username to the entry, but you don't have to.

@codecov
Copy link

codecov bot commented Mar 4, 2017

Codecov Report

Merging #8120 into master will increase coverage by <.01%.
The diff coverage is 98.61%.

@@            Coverage Diff             @@
##           master    #8120      +/-   ##
==========================================
+ Coverage   95.48%   95.48%   +<.01%     
==========================================
  Files         342      342              
  Lines       60913    60985      +72     
==========================================
+ Hits        58160    58231      +71     
- Misses       2753     2754       +1
Impacted Files Coverage Δ
sklearn/model_selection/tests/test_split.py 95.96% <100%> (+0.25%)
sklearn/model_selection/init.py 100% <100%> (ø)
sklearn/model_selection/_split.py 98.6% <96%> (-0.17%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75c892c...338997e. Read the comment docs.

@neerajgangwar
Copy link
Contributor Author

@amueller Added my name :)

@amueller amueller merged commit af1796e into scikit-learn:master Mar 4, 2017
@neerajgangwar neerajgangwar deleted the feature/repeated-splits branch March 5, 2017 03:39
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
herilalaina pushed a commit to herilalaina/scikit-learn that referenced this pull request Mar 26, 2017
…8120)

* Add _RepeatedSplits and RepeatedKFold class

* Add RepeatedStratifiedKFold and doc for repeated cvs

* Change default value of n_repeats

* Change input parameters of repeated cv constructor to n_splits, n_repeats, random_state

* Generate random states in split function rather than store it beforehand

* Doc changes, inheriting RepeatedKFold, RepeatedStratifiedKFold from _RepeatedSplits and other review changes

* Remove blank line, put testcases for deterministic split in loop and add StopIteration check in testcase

* Using rng directly as random_state param to create cv instance and added a check for cvargs

* Fix pep8 warnings

* Changing default values for n_splits and n_repeats and add entry in changelog

* Adding name to the feature

* Missing space
massich pushed a commit to massich/scikit-learn that referenced this pull request Apr 26, 2017
…8120)

* Add _RepeatedSplits and RepeatedKFold class

* Add RepeatedStratifiedKFold and doc for repeated cvs

* Change default value of n_repeats

* Change input parameters of repeated cv constructor to n_splits, n_repeats, random_state

* Generate random states in split function rather than store it beforehand

* Doc changes, inheriting RepeatedKFold, RepeatedStratifiedKFold from _RepeatedSplits and other review changes

* Remove blank line, put testcases for deterministic split in loop and add StopIteration check in testcase

* Using rng directly as random_state param to create cv instance and added a check for cvargs

* Fix pep8 warnings

* Changing default values for n_splits and n_repeats and add entry in changelog

* Adding name to the feature

* Missing space
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…8120)

* Add _RepeatedSplits and RepeatedKFold class

* Add RepeatedStratifiedKFold and doc for repeated cvs

* Change default value of n_repeats

* Change input parameters of repeated cv constructor to n_splits, n_repeats, random_state

* Generate random states in split function rather than store it beforehand

* Doc changes, inheriting RepeatedKFold, RepeatedStratifiedKFold from _RepeatedSplits and other review changes

* Remove blank line, put testcases for deterministic split in loop and add StopIteration check in testcase

* Using rng directly as random_state param to create cv instance and added a check for cvargs

* Fix pep8 warnings

* Changing default values for n_splits and n_repeats and add entry in changelog

* Adding name to the feature

* Missing space
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…8120)

* Add _RepeatedSplits and RepeatedKFold class

* Add RepeatedStratifiedKFold and doc for repeated cvs

* Change default value of n_repeats

* Change input parameters of repeated cv constructor to n_splits, n_repeats, random_state

* Generate random states in split function rather than store it beforehand

* Doc changes, inheriting RepeatedKFold, RepeatedStratifiedKFold from _RepeatedSplits and other review changes

* Remove blank line, put testcases for deterministic split in loop and add StopIteration check in testcase

* Using rng directly as random_state param to create cv instance and added a check for cvargs

* Fix pep8 warnings

* Changi