Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Learning curve: Add an option to randomly choose indices for different training sizes #7506

Conversation

@NarineK
Copy link
Contributor

@NarineK NarineK commented Sep 28, 2016

Currently, training sizes are chosen sequentially from 0 to n_train_samples:
train[:n_train_samples]

If training data is sorted by the target variable, for small sizes of training data it will choose only one label and this will always lead to failures during model fitting.
For example:

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20, 21]], np.int32)
y = np.array(['a', 'a', 'a', 'a', 'b', 'b', 'b']) 
import numpy as np
from sklearn.learning_curve import learning_curve
from sklearn.svm import SVC

The following always fails with: ValueError: The number of classes has to be greater than one; got 1
train_sizes, train_scores, valid_scores = learning_curve(SVC(kernel='linear'), X, y, train_sizes=[0.7, 1.0], cv=3)

The following runs successfully for most tries:

train_sizes, train_scores, valid_scores = learning_curve(SVC(kernel='linear'), X, y, train_sizes=[0.7, 1.0], cv=3, shuffle=True)

If we would have an option to shuffle the indices of training data before choosing ’n_train_samples’, that will increase our chances of not fitting data with the same label into the learner and have more label variety.

In the following pull request I did a small modification and added an option to shuffle and choose randomly the indices. We could do the same also for incremental learning.

Let me know what you think.

Thanks!


This change is Reviewable

@amueller
Copy link
Member

@amueller amueller commented Sep 28, 2016

Why not just pass a cv with shuffle? Like cv=KFold(5, shuffle=True)?

@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Sep 28, 2016

Thank you, @amueller for the prompt response.
For my specific use case I need to use LabelKFold.
But it seems that there is no support for shuffle.

from sklearn.cross_validation import LabelKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
labels = np.array([0, 0, 2, 2])
label_kfold = LabelKFold(labels, n_folds=2, shuffle=True)
Traceback (most recent call last):
File "", line 1, in
TypeError: init() got an unexpected keyword argument 'shuffle'

@amueller
Copy link
Member

@amueller amueller commented Sep 28, 2016

You are right, there's no shuffle in LabelKFold (which was just renamed GroupKFold).
I think a better solution would be to add optional shuffling to the cross-validation generator.

@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Sep 28, 2016

I see, do you mean adding shuffle=False here ?
https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_validation.py#L397
Thank you!

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 28, 2016

I was going to say that I don't mind adding a shuffle parameter, but I think I'm coming to agree with @amueller.

(Also, the sklearn.learning_curve module is deprecated. See model_selection)

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 28, 2016

Yes, adding shuffling to the equivalent in model_selection/_split.py

On 29 September 2016 at 07:51, NarineK notifications@github.com wrote:

I see, do you mean adding shuffle=False here ?
https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_
validation.py#L397
Thank you!


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7506 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yDNFb3JJA48DDhWLT9FL6KDxu3gks5quuFvgaJpZM4KIS7h
.

@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Sep 28, 2016

Thank you @amueller and @jnothman!

I see GroupKFold in model_selection/_split.py.
Let me give a try and add shuffle there.

@amueller
Copy link
Member

@amueller amueller commented Sep 29, 2016

Great, thanks @NarineK :)

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 30, 2016

I hope it's okay to start a new PR, thanks @NarineK .

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 5, 2016

I think what you need here is actually a stratify option in learning_curve.

@amueller
Copy link
Member

@amueller amueller commented Oct 5, 2016

@jnothman or leave the subsetting to the cv object and use StratifiedShuffleSplit? Hm that actually conflates the subsetting and the cross-validation somewhat...

What would a stratified learning curve look like? Make sure that for each n_samples amount the data is stratified?

And do we then also add a group option to learning curve? Maybe shuffle is a good enough fix?

@amueller
Copy link
Member

@amueller amueller commented Oct 5, 2016

Thinking about it more, I feel that the way we are doing the learning curves feels weird. @agramfort do you have literature on cross-validation with learning curves?

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 5, 2016

We don't need to add a group option: the nice thing about the learning curve implementation is that any questions of dependency between training and test samples are handled by cv. The only question is then how to sample from the training data. I'm coming to the idea that shuffling it is not so bad. It will mean that there aren't ordering-dependent anomalies in the learning curve.

@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Oct 6, 2016

I prefer this option too. Should leaning_curve in that case have 2 additional arguments ? random_state=None and shuffle=False ?

@amueller
Copy link
Member

@amueller amueller commented Oct 7, 2016

@jnothman that's true about correlations between the training and the test set, but that might make for very strange learning curves. I guess we can do the shuffle here. I'm not entirely convinced by the CV based approach but I guess it's too late to change anyhow.

@amueller amueller reopened this Oct 7, 2016
@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Oct 7, 2016

I assume the change is needed in sklearn/model_selection/_validation.py instead of sklearn/learning_curve.py
https://github.com/NarineK/scikit-learn/blob/1263e5acb1b9e729fdd740299a1bf5fe73a6c618/sklearn/model_selection/_validation.py#L652
Or maybe in both places ?

@amueller
Copy link
Member

@amueller amueller commented Oct 8, 2016

Only the model_selection, I think.

Narine Kokhlikyan added 2 commits Oct 9, 2016
Narine Kokhlikyan
Narine Kokhlikyan
@jnothman
Copy link
Member

@jnothman jnothman commented Oct 9, 2016

the overall patch is now nothing...?

@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Oct 9, 2016

I merged to master to fix the conflicts. I'll push my changes in model selection soon.

@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Oct 16, 2016

I'll add the test score too.

@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Oct 16, 2016

SGDClassifier itself is giving non deterministic scores. It is hard to write test cases for it.
but MultinomialNB is maybe a better option ?

@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Oct 16, 2016

MultinomialNB doesn't fail if I set all labels the same. I'm not sure if this is by design, but it is not consistent with other algorithms. The output isn't helpful either.

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [11, 12], [13, 14],[15, 16],[17, 18],[19, 20],[7, 8], [9, 10], [11, 12], [13, 14],[15, 16],[17, 18]])
y= np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1])
groups = np.array([1,1,1,1,1,1,3,3,3,3,3,4,4,4,4])
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3, 1.0, 3),groups=groups, shuffle=True, exploit_incremental_learning=True)
>>> train_scores
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])
>>> test_scores
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])
Narine Kokhlikyan added 2 commits Oct 16, 2016
Narine Kokhlikyan
@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Oct 16, 2016

I couldn't find an estimator for which (shuffle=False and exploit_incremental_learning=True) would fail. Tried:

        sklearn.naive_bayes.MultinomialNB
        sklearn.linear_model.SGDClassifier
        sklearn.linear_model.PassiveAggressiveClassifier
@jnothman
Copy link
Member

@jnothman jnothman commented Oct 18, 2016

I've realised why: it gets the list of classes from the whole dataset. So performing the test, despite not getting an error in master, is sufficient.

We still have an underlying problem that the metrics are being calculated incorrectly (assuming fewer classes than there should be), but that's a somewhat different issue, closely related to #6231.

Copy link
Member

@jnothman jnothman left a comment

Otherwise LGTM. Please add an entry to what's new (under 0.19/enhancements)

@@ -779,6 +795,14 @@ def learning_curve(estimator, X, y, groups=None,
return train_sizes_abs, out[0], out[1]


def _shuffle_train_indices(cv_iter, shuffle, random_state):

This comment has been minimized.

@jnothman

jnothman Oct 18, 2016
Member

I think we had a little misunderstanding in the creation of this function. Can you please put it back inline? Thanks.

Narine Kokhlikyan added 2 commits Oct 19, 2016
… cases and added new entry under 0.19/enhancements
Narine Kokhlikyan
Copy link
Member

@jnothman jnothman left a comment

LGTM

@@ -31,6 +31,12 @@ Enhancements
<https://github.com/scikit-learn/scikit-learn/pull/4939>`_) by `Andrea
Esuli`_.

- Added ``shuffle`` and ``random_state`` parameters to shuffle training
data before taking prefixes of it based on training sizes in
``model_selection``s ``learning_curve``.

This comment has been minimized.

@jnothman

jnothman Oct 19, 2016
Member

this should be

:func:`model_selection.learning_curve`
data before taking prefixes of it based on training sizes in
``model_selection``s ``learning_curve``.
(`#7506` <https://github.com/scikit-learn/scikit-learn/pull/7506>_) by
`Narine Kokhlikyan`_.

This comment has been minimized.

@jnothman

jnothman Oct 19, 2016
Member

need to add link target at bottom of file

@jnothman jnothman changed the title Learning curve: Add an option to randomly choose indices for different training sizes [MRG+1] Learning curve: Add an option to randomly choose indices for different training sizes Oct 19, 2016
@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Oct 19, 2016

Added the modifications in doc/whats_new.rst
Thanks for reviewing, @jnothman

@@ -713,6 +716,37 @@ def test_learning_curve_with_boolean_indices():
np.linspace(0.1, 1.0, 10))


def test_learning_curve_with_shuffle():
"""Following test case was designed this way to verify the code

This comment has been minimized.

@amueller

amueller Oct 19, 2016
Member

Please use a comment, not a docstring for the test - that makes it easier to find out which test is run.
Also, I'm not sure I understand the test. Can you please add an explanation here?

This comment has been minimized.

@amueller

amueller Oct 19, 2016
Member

After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

Copy link
Member

@amueller amueller left a comment

LGTM apart from explaining the test.

@@ -713,6 +716,37 @@ def test_learning_curve_with_boolean_indices():
np.linspace(0.1, 1.0, 10))


def test_learning_curve_with_shuffle():
"""Following test case was designed this way to verify the code

This comment has been minimized.

@amueller

amueller Oct 19, 2016
Member

After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3, 1.0, 3),
groups=groups, shuffle=True, random_state=2,
exploit_incremental_learning=True)
assert_array_almost_equal(train_scores_inc.mean(axis=1),

This comment has been minimized.

@amueller

amueller Oct 19, 2016
Member

Any reason to use the mean here instead of everything?

This comment has been minimized.

@NarineK

NarineK Oct 19, 2016
Author Contributor

Thank you for the review, @amueller
I used mean instead of everything in order to be consistent with other test cases for learning curves.

@amueller
Copy link
Member

@amueller amueller commented Oct 19, 2016

thanks @NarineK

@amueller amueller merged commit 829efa5 into scikit-learn:master Oct 19, 2016
3 checks passed
3 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Oct 19, 2016

oh, I haven't addressed this point yet, @amueller

After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

Is it too late now? I see you merged it.

@amueller
Copy link
Member

@amueller amueller commented Oct 19, 2016

Damn too quick. I'll add the comment in master. I'm right about the intend, though?

@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Oct 19, 2016

yes, you're right. Thank you.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 19, 2016

Either way thanks, @NarineK, for raising the issue and solving it even when we told you to solve it the wrong way at first!

@NarineK
Copy link
Contributor Author

@NarineK NarineK commented Oct 19, 2016

No problem, my pleasure!

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017
…different training sizes (scikit-learn#7506)

* Chooses randomly the indices for different training sizes

* Bring back deleted line

* Rewrote the description of 'shuffle' attribute

* use random.sample instead of np.random.choice

* replace tabs with spaces

* merge to master

* Added shuffle in model-selection's learning_curve method

* Added shuffle for incremental learning + addressed Joel's comment

* Shorten long lines

* Add 2 blank spaces between test cases

* Addressed Joel's review comments

* Added 2 blank lines between methods

* Added non regression test for learning_curve with shuffle

* Fixed indentions

* Fixed space issues

* Modified test cases + small code improvements

* Fix some style issues

* Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements

* Added some modifications in whats_new.rst
Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…different training sizes (scikit-learn#7506)

* Chooses randomly the indices for different training sizes

* Bring back deleted line

* Rewrote the description of 'shuffle' attribute

* use random.sample instead of np.random.choice

* replace tabs with spaces

* merge to master

* Added shuffle in model-selection's learning_curve method

* Added shuffle for incremental learning + addressed Joel's comment

* Shorten long lines

* Add 2 blank spaces between test cases

* Addressed Joel's review comments

* Added 2 blank lines between methods

* Added non regression test for learning_curve with shuffle

* Fixed indentions

* Fixed space issues

* Modified test cases + small code improvements

* Fix some style issues

* Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements

* Added some modifications in whats_new.rst
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…different training sizes (scikit-learn#7506)

* Chooses randomly the indices for different training sizes

* Bring back deleted line

* Rewrote the description of 'shuffle' attribute

* use random.sample instead of np.random.choice

* replace tabs with spaces

* merge to master

* Added shuffle in model-selection's learning_curve method

* Added shuffle for incremental learning + addressed Joel's comment

* Shorten long lines

* Add 2 blank spaces between test cases

* Addressed Joel's review comments

* Added 2 blank lines between methods

* Added non regression test for learning_curve with shuffle

* Fixed indentions

* Fixed space issues

* Modified test cases + small code improvements

* Fix some style issues

* Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements

* Added some modifications in whats_new.rst
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…different training sizes (scikit-learn#7506)

* Chooses randomly the indices for different training sizes

* Bring back deleted line

* Rewrote the description of 'shuffle' attribute

* use random.sample instead of np.random.choice

* replace tabs with spaces

* merge to master

* Added shuffle in model-selection's learning_curve method

* Added shuffle for incremental learning + addressed Joel's comment

* Shorten long lines

* Add 2 blank spaces between test cases

* Addressed Joel's review comments

* Added 2 blank lines between methods

* Added non regression test for learning_curve with shuffle

* Fixed indentions

* Fixed space issues

* Modified test cases + small code improvements

* Fix some style issues

* Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements

* Added some modifications in whats_new.rst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants