Skip to content

[MRG+1] Change CV defaults to 5 #11557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jul 19, 2018
Merged

Conversation

aboucaud
Copy link
Contributor

Reference Issues/PRs

Fixes #11129 and takes over stalled PR #11139

What does this implement/fix? Explain your changes.

Add warning for models that do not specify an explicit value forcv or n_splits to prepare for a future deprecation of the default to 3 and an update of the default value to 5.

@glemaitre
Copy link
Member

I see this is still WIP but it could be worth mentioning that you will have to decorate the tests which use the cv using the @pytest.mark.filterwarning to avoid showing the deprecation warning.

@aboucaud aboucaud changed the title [WIP] Change CV defaults to 5 [MRG] Change CV defaults to 5 Jul 17, 2018
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far. A couple minor comments.

@@ -498,7 +499,7 @@ two slightly unbalanced classes::

>>> from sklearn.model_selection import StratifiedKFold

>>> X = np.ones(10)
>>> X = np.ones(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks strange.


- The default number of cross-validation folds ``cv`` and the default number of
splits ``n_splits`` in the :class:`model_selection.KFold`-like splitters will change
from 3 to 5 in 0.22 to account for good practice in the community.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to account for good practice in the community." => "as 3-fold has a lot of variance".

@@ -49,6 +49,17 @@
'check_cv']


NSPLIT_WARNING = (
"You should specify a value for 'n_splits' instead of relying on the "
"default value. Note that this default value of 3 is deprecated in "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "Note...", I would say "This default value will change from 3 to 5 in version 0.22."

@GaelVaroquaux
Copy link
Member

I canceled the travis build as @aboucaud is pushing a new version soon.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

+1 for merge.

@@ -406,8 +420,11 @@ class KFold(_BaseKFold):
RepeatedKFold: Repeats K-Fold n times.
"""

def __init__(self, n_splits=3, shuffle=False,
def __init__(self, n_splits=None, shuffle=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we're gonna use 'warn' from now on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to replace all None by "warn"? Fine with me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller for n_splits only or cv as well ?

@amueller
Copy link
Member

looks good apart from None as sentinel vs 'warn'.

@amueller
Copy link
Member

lgtm

@GaelVaroquaux
Copy link
Member

We'll merge when travis is ready.

@GaelVaroquaux GaelVaroquaux changed the title [MRG] Change CV defaults to 5 [MRG+1] Change CV defaults to 5 Jul 17, 2018
@amueller
Copy link
Member

test errors :-/

splits ``n_splits`` in the :class:`model_selection.KFold`-like splitters will change
from 3 to 5 in 0.22 as 3-fold has a lot of variance.
:issue:`11129` by :user:`Alexandre Boucaud <aboucaud>`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be the number of the PR not the issue, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dunno, you tell me sprint master.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed

@aboucaud
Copy link
Contributor Author

Off to bed, will finish this tomorrow. Most of the work should be behind now.

@aboucaud
Copy link
Contributor Author

Green ✌️ !
It was tougher than expected.

I still did not address properly @amueller comment since I only added # 0.22

can you please add a comment that this is about iid and add 0.22 so that we can grep for it once we need to remove it?

The difficulty is to separate the cases in which it was about cv and the others about n_splits since I have two different messages, and I was not brave enough to do that.

I could try to unify the warning messages (below) to catch a better part of the message in filterwarnings

NSPLIT_WARNING = (
    "You should specify a value for 'n_splits' instead of relying on the "
    "default value. The default value will change from 3 to 5 "
    "in version 0.22.")

CV_WARNING = (
    "You should specify a value for 'cv' instead of relying on the "
    "default value. The default value will change from 3 to 5 "
    "in version 0.22.")

WDYT ? @GaelVaroquaux @amueller

@jeremiedbb
Copy link
Member

Is skipping all the doctests the right way to make travis green ?
I may be interpreting what you did badly.

@aboucaud
Copy link
Contributor Author

aboucaud commented Jul 18, 2018

When I merged master into this branch, I saw that others implemented that workaround since now warning raise errors.

I agree it is probably not good thing.

Many of this failing tests used the default value for cv or n_splits which was set to 3 and will change to 5 but statically setting cv=5 also means increasing the size of the X and y arrays and adapting to the results.

I just end up having so many modifications in this PR that cannot be properly checked or reviewed that I would be in favor of having a following PR address the doctests, using # doctest: +SKIP as an anchor.

@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 18, 2018
@@ -312,6 +312,10 @@ class RFECV(RFE, MetaEstimatorMixin):
Refer :ref:`User Guide <cross_validation>` for the various
cross-validation strategies that can be used here.

.. deprecated:: 0.20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this versionchanged instead of deprecated? (because the keyword itself is not deprecated)

Copy link
Contributor Author

@aboucaud aboucaud Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

we should add a line in the contributing.rst then to specify that.

@jorisvandenbossche
Copy link
Member

I just end up having so many modifications in this PR that cannot be properly checked or reviewed that I would be in favor of having a following PR address the doctests, using # doctest: +SKIP as an anchor.

Guillaume: if you set it to 5 manually in the doc examples, is it then still needed to skip?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 18, 2018 via email

@aboucaud
Copy link
Contributor Author

Ok, I'm on it

@aboucaud
Copy link
Contributor Author

@GaelVaroquaux can you interrupt the build on the first commit to let the last one build

@jeremiedbb
Copy link
Member

It restarts automatically each time you push

@@ -99,10 +99,10 @@ Usage examples:
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf = svm.SVC(gamma='scale', random_state=0)
>>> cross_val_score(clf, X, y, scoring='recall_macro') # doctest: +ELLIPSIS
>>> cross_val_score(clf, X, y, scoring='recall_macro') # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still skipping this one

@@ -150,7 +150,8 @@ the :func:`fbeta_score` function::
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2)
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.svm import LinearSVC
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]},
... scoring=ftwo_scorer) # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

>>> # Getting the test set true positive scores
>>> print(cv_results['test_tp']) # doctest: +NORMALIZE_WHITESPACE
>>> print(cv_results['test_tp']) # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a few change in the doctest pragma that didn't look right.

Aside from that, +1 for merge.

>>> scores = cross_val_score(clf, iris.data, iris.target)
>>> scores.mean() # doctest: +ELLIPSIS
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5)
>>> scores.mean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very surprised by the fact that " doctest: +ELLIPSIS was removed.

array([[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
[0.51..., 0.52..., 0.49..., 0.47..., 0.49...]])
>>> valid_scores # doctest: +ELLIPSIS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I think that keeping "+NORMALIZE_WHITESPACE" would be a good idea.

@jeremiedbb
Copy link
Member

Why did you remove many # doctest : NORMALIZE_WHITESPACE and ELLIPSIS ?

@jeremiedbb
Copy link
Member

@GaelVaroquaux alex made the requested changes. I think it's good to go now.

@GaelVaroquaux
Copy link
Member

LGTM. Merging

@GaelVaroquaux GaelVaroquaux merged commit f158e2d into scikit-learn:master Jul 19, 2018
@aboucaud aboucaud deleted the cv-default-5 branch July 19, 2018 12:47
@amueller
Copy link
Member

Ohhh yeahhh!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change cv default to 5
7 participants