Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] add shuffle paramater to train_test_split #8845

Merged
merged 20 commits into from Jun 8, 2017

Conversation

@themrmax
Copy link
Contributor

@themrmax themrmax commented May 8, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Max Flander added 3 commits May 8, 2017
Max Flander
Max Flander
Max Flander
@jnothman
Copy link
Member

@jnothman jnothman commented May 8, 2017

I don't object to this, but it should handle stratification (by doing a stable sort over stratify then taking the appropriate fraction of each class) or at least raise an error if stratify is passed.

Doctests are failing.

This needs tests in sklearn/model_selection/tests/test_split.py. For instance, it only currently handles test_size being a float, but test_size may be an int.

@jnothman
Copy link
Member

@jnothman jnothman commented May 8, 2017

It doesn't make sense to have a shuffle param in *ShuffleSplit, not only because of the name, but because they are oriented towards multiple splits.

Max Flander added 2 commits May 9, 2017
@themrmax
Copy link
Contributor Author

@themrmax themrmax commented May 9, 2017

@jnothman thanks for the feedback, what do you think about removing "shuffle" from the class names, i.e. renaming ShuffleSplit and StratifiedShuffleSplit to Split and StratifiedSplit ? I'm not sure I understand your point about being orientated towards multiple splits.

@jnothman
Copy link
Member

@jnothman jnothman commented May 9, 2017

Max Flander added 7 commits May 22, 2017
Max Flander
Max Flander
Max Flander
Max Flander
Max Flander
Max Flander
CVClass = StratifiedShuffleSplit
if shuffle is False:
if stratify is not None:
raise NotImplementedError()

This comment has been minimized.

@jnothman

jnothman May 23, 2017
Member

I'm okay with not implementing, but there South be a helpful error message

n_train, n_test = _validate_shuffle_split(n_samples, test_size,
train_size)

train = range(n_train)

This comment has been minimized.

@jnothman

jnothman May 23, 2017
Member

We should be returning numpy arrays, not range objects

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented on f734e22 May 24, 2017

If tests passed before this, then tests need improving

Copy link
Member

@jnothman jnothman left a comment

Apart from checking the return type, this LGTM

@@ -1847,6 +1846,9 @@ def train_test_split(*arrays, **options):
If None, the random number generator is the RandomState instance used
by `np.random`.
shuffle : boolean, optional (default=True)
Whether or not to shuffle the data before splitting.

This comment has been minimized.

@jnothman

jnothman May 24, 2017
Member

Note that it does not currently work if stratified.

@themrmax
Copy link
Contributor Author

@themrmax themrmax commented May 24, 2017

@jnothman the thing with my numpy array gaffe is that I wasn't creating the returned arrays, just the arrays used internally as indexers. Still an important change for performance reasons (the change resulted in an 8x speedup) but doesn't actually affect the returned values. might be able to add a a test for it with mock, but it's not immediately clear to me how ...

@jnothman
Copy link
Member

@jnothman jnothman commented May 25, 2017

oh, okay. I misunderstood, then. I forgot train_test_split doesn't return indices.

Thanks. LGTM

@jnothman jnothman changed the title add shuffle paramater to train_test_split [MRG+1] add shuffle paramater to train_test_split May 25, 2017
@agramfort
Copy link
Member

@agramfort agramfort commented Jun 8, 2017

ok merging. I'll just change to ValueError in master following discussion with @GaelVaroquaux

thx @themrmax

@agramfort agramfort merged commit eaebfe0 into scikit-learn:master Jun 8, 2017
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@agramfort
Copy link
Member

@agramfort agramfort commented Jun 8, 2017

see 29958d8 and 32f2418

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring
@themrmax themrmax deleted the themrmax:8844-train-test-split-shuffle branch Jun 28, 2017
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring
AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
* add shuffle paramater to train_test_split

* fix syntax error

* fix variable name

* fix formatting in doctest output

* fix doctest output

* refactor shuffle paramater into ShuffleSplit and StratifiedShuffleSplit

* include shuffle option in tests

* rollback refactor

* revert to simpler version of unshuffled split

* fix flake8 errors

* revert changes to ShuffleSplit

* revert BaseShuffleSplit

* more reversions

* fix indentation

* remove shuffle parameter from CVclass

* add text to NotImplementedError

* change indexing to use numpy.arange rather than range

* specify precondition for stratify to be None in docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.