[MRG+1] add shuffle paramater to train_test_split #8845

Merged
merged 20 commits into from Jun 8, 2017

Conversation

Projects
None yet
3 participants
@themrmax
Contributor

themrmax commented May 8, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Max Flander added some commits May 8, 2017

Max Flander
Max Flander
Max Flander
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 8, 2017

Member

I don't object to this, but it should handle stratification (by doing a stable sort over stratify then taking the appropriate fraction of each class) or at least raise an error if stratify is passed.

Doctests are failing.

This needs tests in sklearn/model_selection/tests/test_split.py. For instance, it only currently handles test_size being a float, but test_size may be an int.

Member

jnothman commented May 8, 2017

I don't object to this, but it should handle stratification (by doing a stable sort over stratify then taking the appropriate fraction of each class) or at least raise an error if stratify is passed.

Doctests are failing.

This needs tests in sklearn/model_selection/tests/test_split.py. For instance, it only currently handles test_size being a float, but test_size may be an int.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 8, 2017

Member

It doesn't make sense to have a shuffle param in *ShuffleSplit, not only because of the name, but because they are oriented towards multiple splits.

Member

jnothman commented May 8, 2017

It doesn't make sense to have a shuffle param in *ShuffleSplit, not only because of the name, but because they are oriented towards multiple splits.

Max Flander added some commits May 9, 2017

@themrmax

This comment has been minimized.

Show comment
Hide comment
@themrmax

themrmax May 9, 2017

Contributor

@jnothman thanks for the feedback, what do you think about removing "shuffle" from the class names, i.e. renaming ShuffleSplit and StratifiedShuffleSplit to Split and StratifiedSplit ? I'm not sure I understand your point about being orientated towards multiple splits.

Contributor

themrmax commented May 9, 2017

@jnothman thanks for the feedback, what do you think about removing "shuffle" from the class names, i.e. renaming ShuffleSplit and StratifiedShuffleSplit to Split and StratifiedSplit ? I'm not sure I understand your point about being orientated towards multiple splits.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 9, 2017

Member
Member

jnothman commented May 9, 2017

Max Flander added some commits May 22, 2017

Max Flander
Max Flander
Max Flander
Max Flander
Max Flander
Max Flander
sklearn/model_selection/_split.py
- CVClass = StratifiedShuffleSplit
+ if shuffle is False:
+ if stratify is not None:
+ raise NotImplementedError()

This comment has been minimized.

@jnothman

jnothman May 23, 2017

Member

I'm okay with not implementing, but there South be a helpful error message

@jnothman

jnothman May 23, 2017

Member

I'm okay with not implementing, but there South be a helpful error message

sklearn/model_selection/_split.py
+ n_train, n_test = _validate_shuffle_split(n_samples, test_size,
+ train_size)
+
+ train = range(n_train)

This comment has been minimized.

@jnothman

jnothman May 23, 2017

Member

We should be returning numpy arrays, not range objects

@jnothman

jnothman May 23, 2017

Member

We should be returning numpy arrays, not range objects

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 24, 2017

Member

If tests passed before this, then tests need improving

Member

jnothman commented on f734e22 May 24, 2017

If tests passed before this, then tests need improving

@jnothman

Apart from checking the return type, this LGTM

sklearn/model_selection/_split.py
@@ -1847,6 +1846,9 @@ def train_test_split(*arrays, **options):
If None, the random number generator is the RandomState instance used
by `np.random`.
+ shuffle : boolean, optional (default=True)
+ Whether or not to shuffle the data before splitting.

This comment has been minimized.

@jnothman

jnothman May 24, 2017

Member

Note that it does not currently work if stratified.

@jnothman

jnothman May 24, 2017

Member

Note that it does not currently work if stratified.

@themrmax

This comment has been minimized.

Show comment
Hide comment
@themrmax

themrmax May 24, 2017

Contributor

@jnothman the thing with my numpy array gaffe is that I wasn't creating the returned arrays, just the arrays used internally as indexers. Still an important change for performance reasons (the change resulted in an 8x speedup) but doesn't actually affect the returned values. might be able to add a a test for it with mock, but it's not immediately clear to me how ...

Contributor

themrmax commented May 24, 2017

@jnothman the thing with my numpy array gaffe is that I wasn't creating the returned arrays, just the arrays used internally as indexers. Still an important change for performance reasons (the change resulted in an 8x speedup) but doesn't actually affect the returned values. might be able to add a a test for it with mock, but it's not immediately clear to me how ...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 25, 2017

Member

oh, okay. I misunderstood, then. I forgot train_test_split doesn't return indices.

Thanks. LGTM

Member

jnothman commented May 25, 2017

oh, okay. I misunderstood, then. I forgot train_test_split doesn't return indices.

Thanks. LGTM

@jnothman jnothman changed the title from add shuffle paramater to train_test_split to [MRG+1] add shuffle paramater to train_test_split May 25, 2017

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Jun 8, 2017

Member

ok merging. I'll just change to ValueError in master following discussion with @GaelVaroquaux

thx @themrmax

Member

agramfort commented Jun 8, 2017

ok merging. I'll just change to ValueError in master following discussion with @GaelVaroquaux

thx @themrmax

@agramfort agramfort merged commit eaebfe0 into scikit-learn:master Jun 8, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@agramfort

This comment has been minimized.

Show comment
Hide comment
Member

agramfort commented Jun 8, 2017

see 29958d8 and 32f2418

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] add shuffle paramater to train_test_split (#8845)
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring

@themrmax themrmax deleted the themrmax:8844-train-test-split-shuffle branch Jun 28, 2017

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] add shuffle paramater to train_test_split (#8845)
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] add shuffle paramater to train_test_split (#8845)
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] add shuffle paramater to train_test_split (#8845)
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] add shuffle paramater to train_test_split (#8845)
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG+1] add shuffle paramater to train_test_split (#8845)
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] add shuffle paramater to train_test_split (#8845)
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

themrmax Jeremiah Johnson
[MRG+1] add shuffle paramater to train_test_split (#8845)
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment