Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+3] edit train/test_size default behavior #7459

Merged
merged 26 commits into from Jun 14, 2017

Conversation

@nelson-liu
Copy link
Contributor

@nelson-liu nelson-liu commented Sep 20, 2016

Reference Issue

Fixes #5948, #4618

What does this implement/fix? Explain your changes.

Changes the default behavior of train_size and test_size in splitters.
The defaults for both parameters are now None

  • if train_size and test_size are both None, train_size is set to 0.9 and test_size is set to 0.1
  • if only one of train_size or test_size are set, then the value of the unset parameter is n_samples - set_parameter (the complement)
  • If both are explicitly set, check that they are equal to or less than 1 or n_samples and respect that the user wants to subsample the dataset

Any other comments?

I'm aware that the cross_validation module is being deprecated; does this mean that I shouldn't add these changes to it and only to model_selection?

TODO

  • Edit documentation to clearly explain these new changes

This change is Reviewable

@nelson-liu nelson-liu force-pushed the nelson-liu:edit_train_test_split_api branch from 83a7a05 to 7ea9a8e Sep 20, 2016
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 20, 2016

Please don't modify cross_validation.

On 20 September 2016 at 13:35, Nelson Liu notifications@github.com wrote:

Reference Issue

Fixes #5948 #5948
What does this implement/fix? Explain your changes.

Changes the default behavior of train_size and test_size in splitters.
The defaults for both parameters are now None

  • if train_size and test_size are both None, train_size is set to 0.1
    and test_size is set to0.9`
  • if only one of train_size or test_size are set, then the value of
    the unset parameter is n_samples - set_parameter (the complement)
  • If both are explicitly set, check that they are equal to or less
    than 0 and respect that the user wants to subsample the dataset

Any other comments?

I'm aware that the cross_validation module is being deprecated; does this
mean that I shouldn't add these changes to it and only to model_selection?
TODO

  • Edit documentation to clearly explain these new changes

You can view, comment on, or merge this pull request online at:

#7459
Commit Summary

  • edit train/test_size default behavior

File Changes

Patch Links:


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7459, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AAEz6w0JaHZaFQeJaKrIaip96r33SnH0ks5qr1SAgaJpZM4KBMM_
.

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Sep 20, 2016

Please don't modify cross_validation.

ok, thanks.

@lesteve
Copy link
Member

@lesteve lesteve commented Sep 20, 2016

This needs a rebase on master to fix the conflicts.

@nelson-liu nelson-liu force-pushed the nelson-liu:edit_train_test_split_api branch from f77bdc6 to faa3124 Sep 20, 2016
@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Sep 20, 2016

This needs a rebase on master to fix the conflicts.

Done, thanks for letting me know. Didn't notice / realize someone else was working on the file as well :)

@nelson-liu nelson-liu changed the title [WIP] edit train/test_size default behavior [MRG] edit train/test_size default behavior Sep 27, 2016
@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Sep 27, 2016

this is ready to be looked at, missing a whatsnew though (I'd like to verify the current behavior is correct before writing it). Among the changes discussed earlier, GroupShuffleSplit has test_size set to a default of 0.2. This PR changes it to 0.1 to match all the other splitters.

@amueller
Copy link
Member

@amueller amueller commented Sep 30, 2016

I expect the docs of train_test_split also need a change. And you can't change the default behavior of GroupShuffleSplit without a deprecation. We could have done that in 0.18, but now it's too late :-/

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Sep 30, 2016

right, sorry for omitting that @amueller . I've actually never "deprecated" by changing the default parameter of a function before, would you mind pointing me to an example / explaining how to do it? It seems to be missing in the contributing docs as well, so I'd be happy to add it there.

@amueller
Copy link
Member

@amueller amueller commented Sep 30, 2016

I'm not sure we should actually change the default from 0.2 to 0.1 in that case.

For reference, the deprecation documentation is here:
http://scikit-learn.org/dev/developers/contributing.html#deprecation

@amueller
Copy link
Member

@amueller amueller commented Sep 30, 2016

I'm always kinda confused by the fact that the "Contributing" link on the help page doesn't go to contributing, but to developer....

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Sep 30, 2016

hmm, would

            warnings.warn("The default value of train_size parameter was "
                          "changed from 0.2 to None, which changes the "
                          "default (train_size, test_size)"
                          "from (0.2, 0.8) to (0.1, 0.9) "
                          "in version 0.19", DeprecationWarning)

be clear? I feel like there's a better way to word it...

@amueller
Copy link
Member

@amueller amueller commented Sep 30, 2016

Usually we just say "The default value of (..) will change from .. to .. in version 0.20" and the None is really just a weird implementation detail

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Oct 4, 2016

@amueller we also need a deprecation for train_test_split right? Considering the defaults will change from 0.25,0.75 to 0.1,0.9 in version 0.21

also, by the docs of train_test_split need a change, do you mean the docstrings or the examples / tutorial material?

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Oct 4, 2016

also I edited the default parameters taken by the __init__ of GroupShuffleSplit to take test_size=None and train_size=None, allowing me to throw the deprecation warning and then accordingly set the previous defaults (test_size=0.2 and train_size-0.8). Should I just not have touched this?

@amueller
Copy link
Member

@amueller amueller commented Oct 5, 2016

We should not change the defaults of train_test_split. We should use the same behavior, though. And I meant the docstrings need to document the new way train_size and test_size interact.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 6, 2016

There are a couple of errors in your issue description:

  • if train_size and test_size are both None, train_size is set to 0.1 and test_size is set to 0.9

you mean the opposite

  • If both are explicitly set, check that they are equal to or less than 0 and respect that the user wants to subsample the dataset

I think you mean less than n_samples or less than 1.

Now that model_selection is released we need to be aware of backwards compatibility issues. For now, if a user sets train_size without setting test_size (and train_size+test_size > 1 or n_samples) the best we can do is warn that the behaviour will change in a future release, IMO.

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Oct 7, 2016

We should not change the defaults of train_test_split. We should use the same behavior, though. And I meant the docstrings need to document the new way train_size and test_size interact.

Makes sense, I've reverted the change and updated the description accordingly.

There are a couple of errors in your issue description:

Indeed, thanks for catching that.

Now that model_selection is released we need to be aware of backwards compatibility issues. For now, if a user sets train_size without setting test_size (and train_size+test_size > 1 or n_samples) the best we can do is warn that the behaviour will change in a future release, IMO.

To clarify --- we should not implement the changes to the behavior when setting train_size without test_size, but throw a warning instead saying that it will change in future release (sort of like a deprecation)? Also, how is there a case that train_size+test_size > 1 or n_samples? Sorry if these questions have obvious answers.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 8, 2016

I'm not now sure about the details of my comment. Point is that all behaviours need to remain identical, except perhaps for a warning or extra user-specified parameters/values handled. I need to take a proper look at what's changing here to give more precise feedback but there's a huge load of issues for me to skim right now...

@jnothman
Copy link
Member

@jnothman jnothman commented Nov 24, 2016

I hope to look at this soon, @nelson-liu. In the meantime, resolving merge conflicts with master wouldn't hurt.

@amueller amueller added this to the 0.19 milestone Nov 30, 2016
@amueller
Copy link
Member

@amueller amueller commented Nov 30, 2016

@jnothman feel free to hassle me about this, too ;)

@lesteve lesteve force-pushed the nelson-liu:edit_train_test_split_api branch from 0c53fda to 898ab03 Dec 1, 2016
@lesteve
Copy link
Member

@lesteve lesteve commented Dec 1, 2016

Rebased on master to fix conflicts.

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Dec 1, 2016

thanks @lesteve! not sure if this is ready to be looked at actually, I don't think i've addressed @jnothman 's latest comments wrt keeping behaviors identical...

@nelson-liu nelson-liu changed the title [MRG] edit train/test_size default behavior [WIP] edit train/test_size default behavior Dec 1, 2016
Copy link
Member

@jnothman jnothman left a comment

Unfortunately, we need to retain the behaviour that train_test_split(train_size=.5) will sample 10% (or whatever the default is) for test and 50% for train. This behaviour can change in version 0.21. We achieve this by setting test_size to some otherwise inappropriate sentinel by default, such as 'default', which behaves exactly like the current default value (0.1). In 0.21, it can change to None and behave like you specify. In the case that test_size='default' and train_size is a number, we warn that behaviour will change to one where test_size will always complement train_size unless both are specified or both are unspecified.

test_size : float, int, or None, default None
If float, should be between 0.0 and 1.0 and represent the proportion
of the dataset to include in the test split. If int, represents the
absolute number of test samples. If None, and `train_size` is None,

This comment has been minimized.

@jnothman

jnothman Dec 8, 2016
Member

double backticks

If float, should be between 0.0 and 1.0 and represent the proportion
of the dataset to include in the test split. If int, represents the
absolute number of test samples. If None, and `train_size` is None,
the value is set to 0.1. If None and `train_size` is not None, the

This comment has been minimized.

@jnothman

jnothman Dec 8, 2016
Member

double backticks here too

random_state=None):
if test_size is None and train_size is None:
warnings.warn("The default value of the test_size parameter"
"will change from 0.1 to 0.2 in version 0.21.",

This comment has been minimized.

@jnothman

jnothman Dec 8, 2016
Member

0.2 is already the default, is it not??

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Dec 25, 2016

@jnothman - let me know if i understood what to do correctly. The idea is that the current behavior shouldn't change, and thus this PR should just add the functions that detect when a user invokes behavior that will change in the future (test_size default, unspecified value, train_size some specified value) and warn them that the behavior of the code will change in 0.21. Then when 0.20dev comes along, we'll actually implement the new behavior?

If this is the case, I reverted everything to ensure that it maintains the current behavior and just warns the user appropriately.

@jnothman jnothman changed the title [MRG+1] edit train/test_size default behavior [MRG+3] edit train/test_size default behavior Jun 13, 2017
@jnothman
Copy link
Member

@jnothman jnothman commented Jun 13, 2017

Sufficient what's new:

In version 0.21, the default behavior of splitters that use the test_size and train_size parameter will change, such that specifying train_size alone will cause test_size to be the remainder.

@codecov
Copy link

@codecov codecov bot commented Jun 14, 2017

Codecov Report

Merging #7459 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7459      +/-   ##
==========================================
+ Coverage   95.48%   95.48%   +<.01%     
==========================================
  Files         342      342              
  Lines       61013    61033      +20     
==========================================
+ Hits        58259    58279      +20     
  Misses       2754     2754
Impacted Files Coverage Δ
sklearn/model_selection/_split.py 98.65% <100%> (+0.04%) ⬆️
sklearn/model_selection/tests/test_split.py 96% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 919b4a8...fd49cb9. Read the comment docs.

@jnothman jnothman merged commit 1f781e6 into scikit-learn:master Jun 14, 2017
0 of 3 checks passed
0 of 3 checks passed
ci/circleci CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@jnothman
Copy link
Member

@jnothman jnothman commented Jun 14, 2017

Thanks @nelson-liu

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Jun 14, 2017

sorry for letting this languish @jnothman , and thanks for taking it into your own hands to patch it up.

@nelson-liu nelson-liu deleted the nelson-liu:edit_train_test_split_api branch Jun 14, 2017
@jnothman
Copy link
Member

@jnothman jnothman commented Jun 14, 2017

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
@eliot1785
Copy link

@eliot1785 eliot1785 commented Aug 16, 2017

So what actually changed here, in practice? If the user did nothing they would have 0.75/0.25 before, and it looks like that is still the case.

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Aug 16, 2017

So what actually changed here, in practice?

No breaking behavior has been implemented in version 0.19 (hence the FutureWarning). In version 0.21, if you specify train_size without specifying test_size, it will automatically set test_size = 1 - train_size

If the user did nothing they would have 0.75/0.25 before, and it looks like that is still the case.

Yes, that is correct.

@eliot1785
Copy link

@eliot1785 eliot1785 commented Aug 16, 2017

Right, so what I'm confused by is it seems like that's already the behavior. From the doc: "If None, the value is set to the complement of the train size."

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Aug 16, 2017

ah, test_size is not None by default (it's now a sentinel string "default", and was the float 0.1 in version 0.18). If you explicitly pass None to it, then it will complement the train size. This is the case in both 0.18 and 0.19.

Does that help clarify?

@eliot1785
Copy link

@eliot1785 eliot1785 commented Aug 16, 2017

That helps, although I would have thought the default was the float 0.25 since the docs also say, "By default, the value is set to 0.25." Was it 0.1 only in version 0.18?

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 16, 2017

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Aug 16, 2017

Also, note that the defaults are different for different splitting functions/classes (e.g. train_test_split does have a default of 0.25 instead of 0.1).

edit: note that the train_test_split default test_size is 0.25 only if train_size and test_size are both none (which is their default value, unlike the CVSplitters).

@eliot1785
Copy link

@eliot1785 eliot1785 commented Aug 16, 2017

But you're saying it was 0.1 for one version?

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Aug 16, 2017

The default test_size for BaseSplitter has been 0.1 for at least 0.17 and 0.18 (and possibly longer).

As far as i can remember, the default test_size for train_test_split has been 0.25 (it could have been changed from 0.1, though -- you can probably find out by looking at the release history)

In addition: I'm not sure what objects you're referring to by "it" --- the default behavior of the Splitter classes and train_test_split are different (and this PR touches both, so unsure which you're referring to). Hopefully I answered your question though.

@eliot1785
Copy link

@eliot1785 eliot1785 commented Aug 17, 2017

Ok thanks. I just wanted to make sure the defaults for train_test_split hadn't changed recently because if so it would affect my models. It doesn't sound like it, so I'm satisfied. I guess this is a good example of the wisdom of always specifying parameters rather than relying on defaults.

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

7 participants