-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
[MRG+1] Change CV defaults to 5 #11557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I see this is still WIP but it could be worth mentioning that you will have to decorate the tests which use the cv using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great so far. A couple minor comments.
doc/modules/cross_validation.rst
Outdated
@@ -498,7 +499,7 @@ two slightly unbalanced classes:: | |||
|
|||
>>> from sklearn.model_selection import StratifiedKFold | |||
|
|||
>>> X = np.ones(10) | |||
>>> X = np.ones(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks strange.
doc/whats_new/v0.20.rst
Outdated
|
||
- The default number of cross-validation folds ``cv`` and the default number of | ||
splits ``n_splits`` in the :class:`model_selection.KFold`-like splitters will change | ||
from 3 to 5 in 0.22 to account for good practice in the community. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to account for good practice in the community." => "as 3-fold has a lot of variance".
sklearn/model_selection/_split.py
Outdated
@@ -49,6 +49,17 @@ | |||
'check_cv'] | |||
|
|||
|
|||
NSPLIT_WARNING = ( | |||
"You should specify a value for 'n_splits' instead of relying on the " | |||
"default value. Note that this default value of 3 is deprecated in " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "Note...", I would say "This default value will change from 3 to 5 in version 0.22."
I canceled the travis build as @aboucaud is pushing a new version soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
+1 for merge.
sklearn/model_selection/_split.py
Outdated
@@ -406,8 +420,11 @@ class KFold(_BaseKFold): | |||
RepeatedKFold: Repeats K-Fold n times. | |||
""" | |||
|
|||
def __init__(self, n_splits=3, shuffle=False, | |||
def __init__(self, n_splits=None, shuffle=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we're gonna use 'warn'
from now on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to replace all None by "warn"? Fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller for n_splits
only or cv
as well ?
looks good apart from None as sentinel vs 'warn'. |
lgtm |
We'll merge when travis is ready. |
test errors :-/ |
doc/whats_new/v0.20.rst
Outdated
splits ``n_splits`` in the :class:`model_selection.KFold`-like splitters will change | ||
from 3 to 5 in 0.22 as 3-fold has a lot of variance. | ||
:issue:`11129` by :user:`Alexandre Boucaud <aboucaud>`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be the number of the PR not the issue, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dunno, you tell me sprint master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confirmed
Off to bed, will finish this tomorrow. Most of the work should be behind now. |
Green ✌️ ! I still did not address properly @amueller comment since I only added
The difficulty is to separate the cases in which it was about I could try to unify the warning messages (below) to catch a better part of the message in NSPLIT_WARNING = (
"You should specify a value for 'n_splits' instead of relying on the "
"default value. The default value will change from 3 to 5 "
"in version 0.22.")
CV_WARNING = (
"You should specify a value for 'cv' instead of relying on the "
"default value. The default value will change from 3 to 5 "
"in version 0.22.") WDYT ? @GaelVaroquaux @amueller |
Is skipping all the doctests the right way to make travis green ? |
When I merged I agree it is probably not good thing. Many of this failing tests used the default value for I just end up having so many modifications in this PR that cannot be properly checked or reviewed that I would be in favor of having a following PR address the doctests, using |
sklearn/feature_selection/rfe.py
Outdated
@@ -312,6 +312,10 @@ class RFECV(RFE, MetaEstimatorMixin): | |||
Refer :ref:`User Guide <cross_validation>` for the various | |||
cross-validation strategies that can be used here. | |||
|
|||
.. deprecated:: 0.20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this versionchanged instead of deprecated? (because the keyword itself is not deprecated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
we should add a line in the contributing.rst
then to specify that.
Guillaume: if you set it to 5 manually in the doc examples, is it then still needed to skip? |
I just end up having so many modifications in this PR that cannot be properly
checked or reviewed that I would be in favor of having a following PR address
the doctests, using # doctests: + SKIP as an anchor.
No, this is really bad. We need to have tests passing, not skipped. This
is our guarantee of the quality of the codebase.
|
Ok, I'm on it |
@GaelVaroquaux can you interrupt the build on the first commit to let the last one build |
It restarts automatically each time you push |
doc/modules/model_evaluation.rst
Outdated
@@ -99,10 +99,10 @@ Usage examples: | |||
>>> iris = datasets.load_iris() | |||
>>> X, y = iris.data, iris.target | |||
>>> clf = svm.SVC(gamma='scale', random_state=0) | |||
>>> cross_val_score(clf, X, y, scoring='recall_macro') # doctest: +ELLIPSIS | |||
>>> cross_val_score(clf, X, y, scoring='recall_macro') # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still skipping this one
doc/modules/model_evaluation.rst
Outdated
@@ -150,7 +150,8 @@ the :func:`fbeta_score` function:: | |||
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2) | |||
>>> from sklearn.model_selection import GridSearchCV | |||
>>> from sklearn.svm import LinearSVC | |||
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer) | |||
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, | |||
... scoring=ftwo_scorer) # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
doc/modules/model_evaluation.rst
Outdated
>>> # Getting the test set true positive scores | ||
>>> print(cv_results['test_tp']) # doctest: +NORMALIZE_WHITESPACE | ||
>>> print(cv_results['test_tp']) # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a few change in the doctest pragma that didn't look right.
Aside from that, +1 for merge.
doc/modules/ensemble.rst
Outdated
>>> scores = cross_val_score(clf, iris.data, iris.target) | ||
>>> scores.mean() # doctest: +ELLIPSIS | ||
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5) | ||
>>> scores.mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very surprised by the fact that " doctest: +ELLIPSIS was removed.
doc/modules/learning_curve.rst
Outdated
array([[0.93..., 0.94..., 0.92..., 0.91..., 0.92...], | ||
[0.93..., 0.94..., 0.92..., 0.91..., 0.92...], | ||
[0.51..., 0.52..., 0.49..., 0.47..., 0.49...]]) | ||
>>> valid_scores # doctest: +ELLIPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I think that keeping "+NORMALIZE_WHITESPACE" would be a good idea.
Why did you remove many # doctest : NORMALIZE_WHITESPACE and ELLIPSIS ? |
@GaelVaroquaux alex made the requested changes. I think it's good to go now. |
LGTM. Merging |
Ohhh yeahhh! |
Reference Issues/PRs
Fixes #11129 and takes over stalled PR #11139
What does this implement/fix? Explain your changes.
Add warning for models that do not specify an explicit value for
cv
orn_splits
to prepare for a future deprecation of the default to 3 and an update of the default value to 5.