New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Change CV defaults to 5 #11557

Merged
merged 31 commits into from Jul 19, 2018

Conversation

Projects
None yet
7 participants
@aboucaud
Contributor

aboucaud commented Jul 16, 2018

Reference Issues/PRs

Fixes #11129 and takes over stalled PR #11139

What does this implement/fix? Explain your changes.

Add warning for models that do not specify an explicit value forcv or n_splits to prepare for a future deprecation of the default to 3 and an update of the default value to 5.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jul 16, 2018

I see this is still WIP but it could be worth mentioning that you will have to decorate the tests which use the cv using the @pytest.mark.filterwarning to avoid showing the deprecation warning.

@aboucaud aboucaud changed the title from [WIP] Change CV defaults to 5 to [MRG] Change CV defaults to 5 Jul 17, 2018

@GaelVaroquaux

Looks great so far. A couple minor comments.

@@ -498,7 +499,7 @@ two slightly unbalanced classes::
>>> from sklearn.model_selection import StratifiedKFold
>>> X = np.ones(10)
>>> X = np.ones(10)

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 17, 2018

Member

This looks strange.

- The default number of cross-validation folds ``cv`` and the default number of
splits ``n_splits`` in the :class:`model_selection.KFold`-like splitters will change
from 3 to 5 in 0.22 to account for good practice in the community.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 17, 2018

Member

"to account for good practice in the community." => "as 3-fold has a lot of variance".

@@ -49,6 +49,17 @@
'check_cv']
NSPLIT_WARNING = (
"You should specify a value for 'n_splits' instead of relying on the "
"default value. Note that this default value of 3 is deprecated in "

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 17, 2018

Member

Instead of "Note...", I would say "This default value will change from 3 to 5 in version 0.22."

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

I canceled the travis build as @aboucaud is pushing a new version soon.

@GaelVaroquaux

LGTM.

+1 for merge.

@@ -406,8 +420,11 @@ class KFold(_BaseKFold):
RepeatedKFold: Repeats K-Fold n times.
"""
def __init__(self, n_splits=3, shuffle=False,
def __init__(self, n_splits=None, shuffle=False,

This comment has been minimized.

@amueller

amueller Jul 17, 2018

Member

I thought we're gonna use 'warn' from now on?

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 17, 2018

Member

You want to replace all None by "warn"? Fine with me.

This comment has been minimized.

@aboucaud

aboucaud Jul 17, 2018

Contributor

@amueller for n_splits only or cv as well ?

@amueller

This comment has been minimized.

Member

amueller commented Jul 17, 2018

looks good apart from None as sentinel vs 'warn'.

@amueller

This comment has been minimized.

Member

amueller commented Jul 17, 2018

lgtm

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

We'll merge when travis is ready.

@GaelVaroquaux GaelVaroquaux changed the title from [MRG] Change CV defaults to 5 to [MRG+1] Change CV defaults to 5 Jul 17, 2018

@amueller

This comment has been minimized.

Member

amueller commented Jul 17, 2018

test errors :-/

splits ``n_splits`` in the :class:`model_selection.KFold`-like splitters will change
from 3 to 5 in 0.22 as 3-fold has a lot of variance.
:issue:`11129` by :user:`Alexandre Boucaud <aboucaud>`.

This comment has been minimized.

@jeremiedbb

jeremiedbb Jul 17, 2018

Contributor

should be the number of the PR not the issue, right ?

This comment has been minimized.

@aboucaud

aboucaud Jul 17, 2018

Contributor

dunno, you tell me sprint master.

This comment has been minimized.

@jeremiedbb

jeremiedbb Jul 17, 2018

Contributor

confirmed

@aboucaud

This comment has been minimized.

Contributor

aboucaud commented Jul 18, 2018

Off to bed, will finish this tomorrow. Most of the work should be behind now.

aboucaud added some commits Jul 18, 2018

@aboucaud

This comment has been minimized.

Contributor

aboucaud commented Jul 18, 2018

Green ✌️ !
It was tougher than expected.

I still did not address properly @amueller comment since I only added # 0.22

can you please add a comment that this is about iid and add 0.22 so that we can grep for it once we need to remove it?

The difficulty is to separate the cases in which it was about cv and the others about n_splits since I have two different messages, and I was not brave enough to do that.

I could try to unify the warning messages (below) to catch a better part of the message in filterwarnings

NSPLIT_WARNING = (
    "You should specify a value for 'n_splits' instead of relying on the "
    "default value. The default value will change from 3 to 5 "
    "in version 0.22.")

CV_WARNING = (
    "You should specify a value for 'cv' instead of relying on the "
    "default value. The default value will change from 3 to 5 "
    "in version 0.22.")

WDYT ? @GaelVaroquaux @amueller

@jeremiedbb

This comment has been minimized.

Contributor

jeremiedbb commented Jul 18, 2018

Is skipping all the doctests the right way to make travis green ?
I may be interpreting what you did badly.

@aboucaud

This comment has been minimized.

Contributor

aboucaud commented Jul 18, 2018

When I merged master into this branch, I saw that others implemented that workaround since now warning raise errors.

I agree it is probably not good thing.

Many of this failing tests used the default value for cv or n_splits which was set to 3 and will change to 5 but statically setting cv=5 also means increasing the size of the X and y arrays and adapting to the results.

I just end up having so many modifications in this PR that cannot be properly checked or reviewed that I would be in favor of having a following PR address the doctests, using # doctest: +SKIP as an anchor.

@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 18, 2018

@@ -312,6 +312,10 @@ class RFECV(RFE, MetaEstimatorMixin):
Refer :ref:`User Guide <cross_validation>` for the various
cross-validation strategies that can be used here.
.. deprecated:: 0.20

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jul 18, 2018

Contributor

Can you make this versionchanged instead of deprecated? (because the keyword itself is not deprecated)

This comment has been minimized.

@aboucaud

aboucaud Jul 18, 2018

Contributor

done.

we should add a line in the contributing.rst then to specify that.

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Jul 18, 2018

I just end up having so many modifications in this PR that cannot be properly checked or reviewed that I would be in favor of having a following PR address the doctests, using # doctest: +SKIP as an anchor.

Guillaume: if you set it to 5 manually in the doc examples, is it then still needed to skip?

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 18, 2018

@aboucaud

This comment has been minimized.

Contributor

aboucaud commented Jul 18, 2018

Ok, I'm on it

@aboucaud

This comment has been minimized.

Contributor

aboucaud commented Jul 18, 2018

@GaelVaroquaux can you interrupt the build on the first commit to let the last one build

@jeremiedbb

This comment has been minimized.

Contributor

jeremiedbb commented Jul 18, 2018

It restarts automatically each time you push

@@ -99,10 +99,10 @@ Usage examples:
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf = svm.SVC(gamma='scale', random_state=0)
>>> cross_val_score(clf, X, y, scoring='recall_macro') # doctest: +ELLIPSIS
>>> cross_val_score(clf, X, y, scoring='recall_macro') # doctest: +SKIP

This comment has been minimized.

@jeremiedbb

jeremiedbb Jul 18, 2018

Contributor

still skipping this one

@@ -150,7 +150,8 @@ the :func:`fbeta_score` function::
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2)
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.svm import LinearSVC
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]},
... scoring=ftwo_scorer) # doctest: +SKIP

This comment has been minimized.

@jeremiedbb

jeremiedbb Jul 18, 2018

Contributor

same

>>> # Getting the test set true positive scores
>>> print(cv_results['test_tp']) # doctest: +NORMALIZE_WHITESPACE
>>> print(cv_results['test_tp']) # doctest: +SKIP

This comment has been minimized.

@jeremiedbb

jeremiedbb Jul 18, 2018

Contributor

same

@GaelVaroquaux

I saw a few change in the doctest pragma that didn't look right.

Aside from that, +1 for merge.

>>> scores = cross_val_score(clf, iris.data, iris.target)
>>> scores.mean() # doctest: +ELLIPSIS
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5)
>>> scores.mean()

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 18, 2018

Member

I am very surprised by the fact that " doctest: +ELLIPSIS was removed.

array([[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
[0.51..., 0.52..., 0.49..., 0.47..., 0.49...]])
>>> valid_scores # doctest: +ELLIPSIS

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 18, 2018

Member

Here, I think that keeping "+NORMALIZE_WHITESPACE" would be a good idea.

@jeremiedbb

This comment has been minimized.

Contributor

jeremiedbb commented Jul 18, 2018

Why did you remove many # doctest : NORMALIZE_WHITESPACE and ELLIPSIS ?

aboucaud added some commits Jul 19, 2018

@jeremiedbb

This comment has been minimized.

Contributor

jeremiedbb commented Jul 19, 2018

@GaelVaroquaux alex made the requested changes. I think it's good to go now.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 19, 2018

LGTM. Merging

@GaelVaroquaux GaelVaroquaux merged commit f158e2d into scikit-learn:master Jul 19, 2018

7 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.29%)
Details
codecov/project 95.3% (+<.01%) compared to 5140762
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@aboucaud aboucaud deleted the aboucaud:cv-default-5 branch Jul 19, 2018

@amueller

This comment has been minimized.

Member

amueller commented Jul 20, 2018

Ohhh yeahhh!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment