New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Learning curve: Add an option to randomly choose indices for different training sizes #7506

Merged
merged 21 commits into from Oct 19, 2016

Conversation

Projects
None yet
4 participants
@NarineK
Contributor

NarineK commented Sep 28, 2016

Currently, training sizes are chosen sequentially from 0 to n_train_samples:
train[:n_train_samples]

If training data is sorted by the target variable, for small sizes of training data it will choose only one label and this will always lead to failures during model fitting.
For example:

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20, 21]], np.int32)
y = np.array(['a', 'a', 'a', 'a', 'b', 'b', 'b']) 
import numpy as np
from sklearn.learning_curve import learning_curve
from sklearn.svm import SVC

The following always fails with: ValueError: The number of classes has to be greater than one; got 1
train_sizes, train_scores, valid_scores = learning_curve(SVC(kernel='linear'), X, y, train_sizes=[0.7, 1.0], cv=3)

The following runs successfully for most tries:

train_sizes, train_scores, valid_scores = learning_curve(SVC(kernel='linear'), X, y, train_sizes=[0.7, 1.0], cv=3, shuffle=True)

If we would have an option to shuffle the indices of training data before choosing ’n_train_samples’, that will increase our chances of not fitting data with the same label into the learner and have more label variety.

In the following pull request I did a small modification and added an option to shuffle and choose randomly the indices. We could do the same also for incremental learning.

Let me know what you think.

Thanks!


This change is Reviewable

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 28, 2016

Member

Why not just pass a cv with shuffle? Like cv=KFold(5, shuffle=True)?

Member

amueller commented Sep 28, 2016

Why not just pass a cv with shuffle? Like cv=KFold(5, shuffle=True)?

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Sep 28, 2016

Contributor

Thank you, @amueller for the prompt response.
For my specific use case I need to use LabelKFold.
But it seems that there is no support for shuffle.

from sklearn.cross_validation import LabelKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
labels = np.array([0, 0, 2, 2])
label_kfold = LabelKFold(labels, n_folds=2, shuffle=True)
Traceback (most recent call last):
File "", line 1, in
TypeError: init() got an unexpected keyword argument 'shuffle'

Contributor

NarineK commented Sep 28, 2016

Thank you, @amueller for the prompt response.
For my specific use case I need to use LabelKFold.
But it seems that there is no support for shuffle.

from sklearn.cross_validation import LabelKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
labels = np.array([0, 0, 2, 2])
label_kfold = LabelKFold(labels, n_folds=2, shuffle=True)
Traceback (most recent call last):
File "", line 1, in
TypeError: init() got an unexpected keyword argument 'shuffle'

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 28, 2016

Member

You are right, there's no shuffle in LabelKFold (which was just renamed GroupKFold).
I think a better solution would be to add optional shuffling to the cross-validation generator.

Member

amueller commented Sep 28, 2016

You are right, there's no shuffle in LabelKFold (which was just renamed GroupKFold).
I think a better solution would be to add optional shuffling to the cross-validation generator.

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Sep 28, 2016

Contributor

I see, do you mean adding shuffle=False here ?
https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_validation.py#L397
Thank you!

Contributor

NarineK commented Sep 28, 2016

I see, do you mean adding shuffle=False here ?
https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_validation.py#L397
Thank you!

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 28, 2016

Member

I was going to say that I don't mind adding a shuffle parameter, but I think I'm coming to agree with @amueller.

(Also, the sklearn.learning_curve module is deprecated. See model_selection)

Member

jnothman commented Sep 28, 2016

I was going to say that I don't mind adding a shuffle parameter, but I think I'm coming to agree with @amueller.

(Also, the sklearn.learning_curve module is deprecated. See model_selection)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 28, 2016

Member

Yes, adding shuffling to the equivalent in model_selection/_split.py

On 29 September 2016 at 07:51, NarineK notifications@github.com wrote:

I see, do you mean adding shuffle=False here ?
https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_
validation.py#L397
Thank you!


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7506 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yDNFb3JJA48DDhWLT9FL6KDxu3gks5quuFvgaJpZM4KIS7h
.

Member

jnothman commented Sep 28, 2016

Yes, adding shuffling to the equivalent in model_selection/_split.py

On 29 September 2016 at 07:51, NarineK notifications@github.com wrote:

I see, do you mean adding shuffle=False here ?
https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_
validation.py#L397
Thank you!


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7506 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yDNFb3JJA48DDhWLT9FL6KDxu3gks5quuFvgaJpZM4KIS7h
.

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Sep 28, 2016

Contributor

Thank you @amueller and @jnothman!

I see GroupKFold in model_selection/_split.py.
Let me give a try and add shuffle there.

Contributor

NarineK commented Sep 28, 2016

Thank you @amueller and @jnothman!

I see GroupKFold in model_selection/_split.py.
Let me give a try and add shuffle there.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 29, 2016

Member

Great, thanks @NarineK :)

Member

amueller commented Sep 29, 2016

Great, thanks @NarineK :)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 30, 2016

Member

I hope it's okay to start a new PR, thanks @NarineK .

Member

jnothman commented Sep 30, 2016

I hope it's okay to start a new PR, thanks @NarineK .

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 5, 2016

Member

I think what you need here is actually a stratify option in learning_curve.

Member

jnothman commented Oct 5, 2016

I think what you need here is actually a stratify option in learning_curve.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 5, 2016

Member

@jnothman or leave the subsetting to the cv object and use StratifiedShuffleSplit? Hm that actually conflates the subsetting and the cross-validation somewhat...

What would a stratified learning curve look like? Make sure that for each n_samples amount the data is stratified?

And do we then also add a group option to learning curve? Maybe shuffle is a good enough fix?

Member

amueller commented Oct 5, 2016

@jnothman or leave the subsetting to the cv object and use StratifiedShuffleSplit? Hm that actually conflates the subsetting and the cross-validation somewhat...

What would a stratified learning curve look like? Make sure that for each n_samples amount the data is stratified?

And do we then also add a group option to learning curve? Maybe shuffle is a good enough fix?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 5, 2016

Member

Thinking about it more, I feel that the way we are doing the learning curves feels weird. @agramfort do you have literature on cross-validation with learning curves?

Member

amueller commented Oct 5, 2016

Thinking about it more, I feel that the way we are doing the learning curves feels weird. @agramfort do you have literature on cross-validation with learning curves?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 5, 2016

Member

We don't need to add a group option: the nice thing about the learning curve implementation is that any questions of dependency between training and test samples are handled by cv. The only question is then how to sample from the training data. I'm coming to the idea that shuffling it is not so bad. It will mean that there aren't ordering-dependent anomalies in the learning curve.

Member

jnothman commented Oct 5, 2016

We don't need to add a group option: the nice thing about the learning curve implementation is that any questions of dependency between training and test samples are handled by cv. The only question is then how to sample from the training data. I'm coming to the idea that shuffling it is not so bad. It will mean that there aren't ordering-dependent anomalies in the learning curve.

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 6, 2016

Contributor

I prefer this option too. Should leaning_curve in that case have 2 additional arguments ? random_state=None and shuffle=False ?

Contributor

NarineK commented Oct 6, 2016

I prefer this option too. Should leaning_curve in that case have 2 additional arguments ? random_state=None and shuffle=False ?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 7, 2016

Member

@jnothman that's true about correlations between the training and the test set, but that might make for very strange learning curves. I guess we can do the shuffle here. I'm not entirely convinced by the CV based approach but I guess it's too late to change anyhow.

Member

amueller commented Oct 7, 2016

@jnothman that's true about correlations between the training and the test set, but that might make for very strange learning curves. I guess we can do the shuffle here. I'm not entirely convinced by the CV based approach but I guess it's too late to change anyhow.

@amueller amueller reopened this Oct 7, 2016

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 7, 2016

Contributor

I assume the change is needed in sklearn/model_selection/_validation.py instead of sklearn/learning_curve.py
https://github.com/NarineK/scikit-learn/blob/1263e5acb1b9e729fdd740299a1bf5fe73a6c618/sklearn/model_selection/_validation.py#L652
Or maybe in both places ?

Contributor

NarineK commented Oct 7, 2016

I assume the change is needed in sklearn/model_selection/_validation.py instead of sklearn/learning_curve.py
https://github.com/NarineK/scikit-learn/blob/1263e5acb1b9e729fdd740299a1bf5fe73a6c618/sklearn/model_selection/_validation.py#L652
Or maybe in both places ?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 8, 2016

Member

Only the model_selection, I think.

Member

amueller commented Oct 8, 2016

Only the model_selection, I think.

NarineK added some commits Oct 9, 2016

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 9, 2016

Member

the overall patch is now nothing...?

Member

jnothman commented Oct 9, 2016

the overall patch is now nothing...?

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 9, 2016

Contributor

I merged to master to fix the conflicts. I'll push my changes in model selection soon.

Contributor

NarineK commented Oct 9, 2016

I merged to master to fix the conflicts. I'll push my changes in model selection soon.

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 10, 2016

Contributor

It seems that incremental training has the implementation in a separate method:

https://github.com/NarineK/scikit-learn/blob/61f66bb7d6158fd507f30a5a1a02e555e86bdc26/sklearn/model_selection/_validation.py#L847

We could shuffle it there too.

Contributor

NarineK commented Oct 10, 2016

It seems that incremental training has the implementation in a separate method:

https://github.com/NarineK/scikit-learn/blob/61f66bb7d6158fd507f30a5a1a02e555e86bdc26/sklearn/model_selection/_validation.py#L847

We could shuffle it there too.

Show outdated Hide outdated sklearn/model_selection/_validation.py
if shuffle:
train = rng.permutation(train)
train_test_proportions.append((train[:n_train_samples], test))

This comment has been minimized.

@NarineK

NarineK Oct 10, 2016

Contributor

I could have implemented it the following way:


        if shuffle:
            rng = check_random_state(random_state)
            train_test_proportions = [rng.permutation(train)[:n_train_samples], test)
                for train, test in cv for n_train_samples in train_sizes_abs]
        else:
            train_test_proportions = [(train[:n_train_samples], test)
                for train, test in cv for n_train_samples in train_sizes_abs]

but in this case line: for train, test in cv for n_train_samples in train_sizes_abs] is duplicated.

@NarineK

NarineK Oct 10, 2016

Contributor

I could have implemented it the following way:


        if shuffle:
            rng = check_random_state(random_state)
            train_test_proportions = [rng.permutation(train)[:n_train_samples], test)
                for train, test in cv for n_train_samples in train_sizes_abs]
        else:
            train_test_proportions = [(train[:n_train_samples], test)
                for train, test in cv for n_train_samples in train_sizes_abs]

but in this case line: for train, test in cv for n_train_samples in train_sizes_abs] is duplicated.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 10, 2016

Member

There are two strategies:

  1. shuffle once and take prefixes
  2. shuffle for each training set

Option 1. could result in smoother learning curves. Is there any benefit to 2.?

Ideally it should be possible to tell which strategy is being taken from the docstring.

Member

jnothman commented Oct 10, 2016

There are two strategies:

  1. shuffle once and take prefixes
  2. shuffle for each training set

Option 1. could result in smoother learning curves. Is there any benefit to 2.?

Ideally it should be possible to tell which strategy is being taken from the docstring.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 10, 2016

Member

Yes, you need to handle the incremental case too.

On 10 October 2016 at 12:40, NarineK notifications@github.com wrote:

@NarineK commented on this pull request.

In sklearn/model_selection/_validation.py
#7506 (review)
:

@@ -765,11 +773,20 @@ def learning_curve(estimator, X, y, groups=None,
clone(estimator), X, y, classes, train, test, train_sizes_abs,
scorer, verbose) for train, test in cv.split(X, y, groups))
else:

  •    if shuffle:
    
  •        rng = check_random_state(random_state)
    
  •    train_test_proportions = []
    
  •    for train, test in cv_iter:
    
  •        for n_train_samples in train_sizes_abs:
    
  •            if shuffle:
    
  •                train = rng.permutation(train)
    
  •            train_test_proportions.append((train[:n_train_samples], test))
    

I could have implemented it the following way:

    if shuffle:
        rng = check_random_state(random_state)
        train_test_proportions = [rng.permutation(train)[:n_train_samples], test)
            for train, test in cv for n_train_samples in train_sizes_abs]
    else:
        train_test_proportions = [(train[:n_train_samples], test)
            for train, test in cv for n_train_samples in train_sizes_abs]

but in this case line: for train, test in cv for n_train_samples in
train_sizes_abs] is duplicated.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7506 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz63H8iRH8iTNhEBlIyMyNrp0uWdn2ks5qyZehgaJpZM4KIS7h
.

Member

jnothman commented Oct 10, 2016

Yes, you need to handle the incremental case too.

On 10 October 2016 at 12:40, NarineK notifications@github.com wrote:

@NarineK commented on this pull request.

In sklearn/model_selection/_validation.py
#7506 (review)
:

@@ -765,11 +773,20 @@ def learning_curve(estimator, X, y, groups=None,
clone(estimator), X, y, classes, train, test, train_sizes_abs,
scorer, verbose) for train, test in cv.split(X, y, groups))
else:

  •    if shuffle:
    
  •        rng = check_random_state(random_state)
    
  •    train_test_proportions = []
    
  •    for train, test in cv_iter:
    
  •        for n_train_samples in train_sizes_abs:
    
  •            if shuffle:
    
  •                train = rng.permutation(train)
    
  •            train_test_proportions.append((train[:n_train_samples], test))
    

I could have implemented it the following way:

    if shuffle:
        rng = check_random_state(random_state)
        train_test_proportions = [rng.permutation(train)[:n_train_samples], test)
            for train, test in cv for n_train_samples in train_sizes_abs]
    else:
        train_test_proportions = [(train[:n_train_samples], test)
            for train, test in cv for n_train_samples in train_sizes_abs]

but in this case line: for train, test in cv for n_train_samples in
train_sizes_abs] is duplicated.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7506 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz63H8iRH8iTNhEBlIyMyNrp0uWdn2ks5qyZehgaJpZM4KIS7h
.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 10, 2016

Member

well method 2 is closer to what we actually want to estimate...

Member

amueller commented Oct 10, 2016

well method 2 is closer to what we actually want to estimate...

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 10, 2016

Contributor

I think both of them could work.
For option 2 we have the overhead of shuffling for each training size in every (train, test) pair.

While for option 1 we shuffle once per (train, test) pair and reuse it for multiple sizes.

That's how I understand @jnothman's comment. I've made the change after his comment.

Contributor

NarineK commented Oct 10, 2016

I think both of them could work.
For option 2 we have the overhead of shuffling for each training size in every (train, test) pair.

While for option 1 we shuffle once per (train, test) pair and reuse it for multiple sizes.

That's how I understand @jnothman's comment. I've made the change after his comment.

@jnothman

You've not currently tested that shuffle has any effect, I take it? Can we engineer a case like yours where not shuffling in a group k-fold case would train a bad model?

Show outdated Hide outdated sklearn/model_selection/_validation.py
@@ -718,7 +719,14 @@ def learning_curve(estimator, X, y, groups=None,
verbose : integer, optional
Controls the verbosity: the higher, the more messages.
Returns
shuffle : boolean, optional
Whether to shuffle traing data before using it based on

This comment has been minimized.

@jnothman

jnothman Oct 10, 2016

Member

*training

@jnothman

jnothman Oct 10, 2016

Member

*training

Show outdated Hide outdated sklearn/model_selection/_validation.py
clone(estimator), X, y, classes, train, test, train_sizes_abs,
scorer, verbose) for train, test in cv.split(X, y, groups))
clone(estimator), X, y, classes, train if not shuffle else
rng.permutation(train), test, train_sizes_abs, scorer, verbose)

This comment has been minimized.

@jnothman

jnothman Oct 10, 2016

Member

Can we please pull the generator out, perhaps into a separate function, to be shared by the incremental and non-incremental cases?

@jnothman

jnothman Oct 10, 2016

Member

Can we please pull the generator out, perhaps into a separate function, to be shared by the incremental and non-incremental cases?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 13, 2016

Member

I suspect you could simplify that case further, but I've not looked at it in detail.

Member

jnothman commented Oct 13, 2016

I suspect you could simplify that case further, but I've not looked at it in detail.

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 13, 2016

Contributor

Sure, where should I add that test case?
Somewhere here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/tests/test_validation.py#L575 or in test_split.py ?

Contributor

NarineK commented Oct 13, 2016

Sure, where should I add that test case?
Somewhere here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/tests/test_validation.py#L575 or in test_split.py ?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 14, 2016

Member

test_validation, please

On 14 October 2016 at 07:53, NarineK notifications@github.com wrote:

Sure, where should I add that test case?
Somewhere here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_
selection/tests/test_validation.py#L575 or in test_split.py ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7506 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz63WdKB1FUNK-JsHerX6-PkNxjk2Jks5qzpowgaJpZM4KIS7h
.

Member

jnothman commented Oct 14, 2016

test_validation, please

On 14 October 2016 at 07:53, NarineK notifications@github.com wrote:

Sure, where should I add that test case?
Somewhere here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_
selection/tests/test_validation.py#L575 or in test_split.py ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7506 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz63WdKB1FUNK-JsHerX6-PkNxjk2Jks5qzpowgaJpZM4KIS7h
.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 14, 2016

Member

Cross-validation should then give a measure of the robustness to different sampling approaches.

I don't understand that point. You're saying approach 1 is measuring the increase in performance conditioned on the sample we saw, and approach 2 is trying to estimate performance as a function of training set size, right? I would agree. But I'm not sure what the cross-validation means in the context of 1.

Anyhow, I don't feel strongly either way.

Member

amueller commented Oct 14, 2016

Cross-validation should then give a measure of the robustness to different sampling approaches.

I don't understand that point. You're saying approach 1 is measuring the increase in performance conditioned on the sample we saw, and approach 2 is trying to estimate performance as a function of training set size, right? I would agree. But I'm not sure what the cross-validation means in the context of 1.

Anyhow, I don't feel strongly either way.

NarineK added some commits Oct 15, 2016

Show outdated Hide outdated sklearn/model_selection/_validation.py
Returns
shuffle : boolean, optional
Whether to shuffle training data before using it based on
`train_sizes`

This comment has been minimized.

@jnothman

jnothman Oct 15, 2016

Member

I think you want double-backticks

@jnothman

jnothman Oct 15, 2016

Member

I think you want double-backticks

Show outdated Hide outdated sklearn/model_selection/_validation.py
@@ -718,7 +719,14 @@ def learning_curve(estimator, X, y, groups=None,
verbose : integer, optional
Controls the verbosity: the higher, the more messages.
Returns
shuffle : boolean, optional
Whether to shuffle training data before using it based on

This comment has been minimized.

@jnothman

jnothman Oct 15, 2016

Member

"using it based on" could be "taking prefixes of it based on"

@jnothman

jnothman Oct 15, 2016

Member

"using it based on" could be "taking prefixes of it based on"

Show outdated Hide outdated sklearn/model_selection/tests/test_validation.py
@@ -670,6 +670,31 @@ def test_learning_curve_batch_and_incremental_learning_are_equal():
test_scores_batch.mean(axis=1))
def test_learning_curve_batch_and_incremental_shuffle():

This comment has been minimized.

@jnothman

jnothman Oct 15, 2016

Member

Since this test does nothing but check that shuffle=True gives results akin to shuffle=False (rather than ensuring that it's shuffled), I think you should just add for shuffle in [True, False] to tests above, rather than create a new smoke test.

@jnothman

jnothman Oct 15, 2016

Member

Since this test does nothing but check that shuffle=True gives results akin to shuffle=False (rather than ensuring that it's shuffled), I think you should just add for shuffle in [True, False] to tests above, rather than create a new smoke test.

Show outdated Hide outdated sklearn/model_selection/tests/test_validation.py
[15, 16], [17, 18]])
y = np.array([1, 1, 1, 2, 3, 4, 1, 1, 2, 3, 4, 1, 2, 3, 4])
groups = np.array([1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 4, 4, 4, 4])
estimator = LogisticRegression(multi_class='multinomial', penalty='l2',

This comment has been minimized.

@jnothman

jnothman Oct 15, 2016

Member

If this also fails at master with an estimator supporting partial_fit, such as SGDClassifier, then this test can incorporate tests for incremental and non-incremental approaches. I would rather see both tested.

@jnothman

jnothman Oct 15, 2016

Member

If this also fails at master with an estimator supporting partial_fit, such as SGDClassifier, then this test can incorporate tests for incremental and non-incremental approaches. I would rather see both tested.

This comment has been minimized.

@NarineK

NarineK Oct 16, 2016

Contributor

Hi @jnothman,
specifically, SGDClassifier with incremental training does not fail without shuffle, but I believe that with shuffle it trains better model (just from looking at the scores):
Without shuffle:

>>> train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3,1.0,3), groups=groups, verbose=1, exploit_incremental_learning=True)
>>> train_scores
array([[ 1.        ,  1.        ],
       [ 0.2       ,  0.2       ],
       [ 0.22222222,  0.16666667]])

With shuffle:

>>> train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3,1.0,3), groups=groups, verbose=1, shuffle=True, random_state=2, exploit_incremental_learning=True)
>>> train_scores
array([[ 0.5       ,  0.5       ],
       [ 0.4       ,  0.2       ],
       [ 0.22222222,  0.5       ]])

@NarineK

NarineK Oct 16, 2016

Contributor

Hi @jnothman,
specifically, SGDClassifier with incremental training does not fail without shuffle, but I believe that with shuffle it trains better model (just from looking at the scores):
Without shuffle:

>>> train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3,1.0,3), groups=groups, verbose=1, exploit_incremental_learning=True)
>>> train_scores
array([[ 1.        ,  1.        ],
       [ 0.2       ,  0.2       ],
       [ 0.22222222,  0.16666667]])

With shuffle:

>>> train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3,1.0,3), groups=groups, verbose=1, shuffle=True, random_state=2, exploit_incremental_learning=True)
>>> train_scores
array([[ 0.5       ,  0.5       ],
       [ 0.4       ,  0.2       ],
       [ 0.22222222,  0.5       ]])

Show outdated Hide outdated sklearn/model_selection/_validation.py
@@ -779,6 +796,13 @@ def learning_curve(estimator, X, y, groups=None,
return train_sizes_abs, out[0], out[1]
def _get_train_indices(train, rng):

This comment has been minimized.

@jnothman

jnothman Oct 15, 2016

Member

Actually I think the refactoring that I want looks more like:

if shuffle:
    rng = check_random_state(random_state) 
    cv_iter = ((rng.permutation(train), test) for train, test in cv_iter)
@jnothman

jnothman Oct 15, 2016

Member

Actually I think the refactoring that I want looks more like:

if shuffle:
    rng = check_random_state(random_state) 
    cv_iter = ((rng.permutation(train), test) for train, test in cv_iter)
Show outdated Hide outdated sklearn/model_selection/tests/test_validation.py
@@ -713,6 +738,25 @@ def test_learning_curve_with_boolean_indices():
np.linspace(0.1, 1.0, 10))
def test_learning_curve_with_shuffle():
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [11, 12], [13, 14], [15, 16],

This comment has been minimized.

@jnothman

jnothman Oct 15, 2016

Member

Add a note here about why the test is designed this way, or perhaps just reference the issue number.

@jnothman

jnothman Oct 15, 2016

Member

Add a note here about why the test is designed this way, or perhaps just reference the issue number.

This comment has been minimized.

@jnothman

jnothman Oct 15, 2016

Member

Even just making clear that you assert in the test that the shuffle=False case breaks will help the maintainer.

@jnothman

jnothman Oct 15, 2016

Member

Even just making clear that you assert in the test that the shuffle=False case breaks will help the maintainer.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 16, 2016

Member

I'd rather the contrast between failure and success. Why are you looking at
train_scores rather than test_scores?

On 16 October 2016 at 11:25, NarineK notifications@github.com wrote:

@NarineK commented on this pull request.

In sklearn/model_selection/tests/test_validation.py
#7506:

@@ -713,6 +738,25 @@ def test_learning_curve_with_boolean_indices():
np.linspace(0.1, 1.0, 10))

+def test_learning_curve_with_shuffle():

  • X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [11, 12], [13, 14], [15, 16],
  •             [17, 18], [19, 20], [7, 8], [9, 10], [11, 12], [13, 14],
    
  •             [15, 16], [17, 18]])
    
  • y = np.array([1, 1, 1, 2, 3, 4, 1, 1, 2, 3, 4, 1, 2, 3, 4])
  • groups = np.array([1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 4, 4, 4, 4])
  • estimator = LogisticRegression(multi_class='multinomial', penalty='l2',

Hi @jnothman https://github.com/jnothman,
specifically, SGDClassifier with incremental training does not fail
without shuffle, but I believe that with shuffle it trains better model
(just from looking at the scores):
Without shuffle:

train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3,1.0,3), groups=groups, verbose=1, exploit_incremental_learning=True)
train_scores
array([[ 1. , 1. ],
[ 0.2 , 0.2 ],
[ 0.22222222, 0.16666667]])

With shuffle:

train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3,1.0,3), groups=groups, verbose=1, shuffle=True, random_state=2, exploit_incremental_learning=True)
train_scores
array([[ 0.5 , 0.5 ],
[ 0.4 , 0.2 ],
[ 0.22222222, 0.5 ]])


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7506, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AAEz68Y76C7mxZENiA23RHTNVE-mjhujks5q0W8IgaJpZM4KIS7h
.

Member

jnothman commented Oct 16, 2016

I'd rather the contrast between failure and success. Why are you looking at
train_scores rather than test_scores?

On 16 October 2016 at 11:25, NarineK notifications@github.com wrote:

@NarineK commented on this pull request.

In sklearn/model_selection/tests/test_validation.py
#7506:

@@ -713,6 +738,25 @@ def test_learning_curve_with_boolean_indices():
np.linspace(0.1, 1.0, 10))

+def test_learning_curve_with_shuffle():

  • X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [11, 12], [13, 14], [15, 16],
  •             [17, 18], [19, 20], [7, 8], [9, 10], [11, 12], [13, 14],
    
  •             [15, 16], [17, 18]])
    
  • y = np.array([1, 1, 1, 2, 3, 4, 1, 1, 2, 3, 4, 1, 2, 3, 4])
  • groups = np.array([1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 4, 4, 4, 4])
  • estimator = LogisticRegression(multi_class='multinomial', penalty='l2',

Hi @jnothman https://github.com/jnothman,
specifically, SGDClassifier with incremental training does not fail
without shuffle, but I believe that with shuffle it trains better model
(just from looking at the scores):
Without shuffle:

train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3,1.0,3), groups=groups, verbose=1, exploit_incremental_learning=True)
train_scores
array([[ 1. , 1. ],
[ 0.2 , 0.2 ],
[ 0.22222222, 0.16666667]])

With shuffle:

train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3,1.0,3), groups=groups, verbose=1, shuffle=True, random_state=2, exploit_incremental_learning=True)
train_scores
array([[ 0.5 , 0.5 ],
[ 0.4 , 0.2 ],
[ 0.22222222, 0.5 ]])


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7506, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AAEz68Y76C7mxZENiA23RHTNVE-mjhujks5q0W8IgaJpZM4KIS7h
.

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 16, 2016

Contributor

I'll add the test score too.

Contributor

NarineK commented Oct 16, 2016

I'll add the test score too.

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 16, 2016

Contributor

SGDClassifier itself is giving non deterministic scores. It is hard to write test cases for it.
but MultinomialNB is maybe a better option ?

Contributor

NarineK commented Oct 16, 2016

SGDClassifier itself is giving non deterministic scores. It is hard to write test cases for it.
but MultinomialNB is maybe a better option ?

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 16, 2016

Contributor

MultinomialNB doesn't fail if I set all labels the same. I'm not sure if this is by design, but it is not consistent with other algorithms. The output isn't helpful either.

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [11, 12], [13, 14],[15, 16],[17, 18],[19, 20],[7, 8], [9, 10], [11, 12], [13, 14],[15, 16],[17, 18]])
y= np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1])
groups = np.array([1,1,1,1,1,1,3,3,3,3,3,4,4,4,4])
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3, 1.0, 3),groups=groups, shuffle=True, exploit_incremental_learning=True)
>>> train_scores
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])
>>> test_scores
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])
Contributor

NarineK commented Oct 16, 2016

MultinomialNB doesn't fail if I set all labels the same. I'm not sure if this is by design, but it is not consistent with other algorithms. The output isn't helpful either.

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [11, 12], [13, 14],[15, 16],[17, 18],[19, 20],[7, 8], [9, 10], [11, 12], [13, 14],[15, 16],[17, 18]])
y= np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1])
groups = np.array([1,1,1,1,1,1,3,3,3,3,3,4,4,4,4])
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3, 1.0, 3),groups=groups, shuffle=True, exploit_incremental_learning=True)
>>> train_scores
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])
>>> test_scores
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

NarineK added some commits Oct 16, 2016

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 16, 2016

Contributor

I couldn't find an estimator for which (shuffle=False and exploit_incremental_learning=True) would fail. Tried:

        sklearn.naive_bayes.MultinomialNB
        sklearn.linear_model.SGDClassifier
        sklearn.linear_model.PassiveAggressiveClassifier
Contributor

NarineK commented Oct 16, 2016

I couldn't find an estimator for which (shuffle=False and exploit_incremental_learning=True) would fail. Tried:

        sklearn.naive_bayes.MultinomialNB
        sklearn.linear_model.SGDClassifier
        sklearn.linear_model.PassiveAggressiveClassifier
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 18, 2016

Member

I've realised why: it gets the list of classes from the whole dataset. So performing the test, despite not getting an error in master, is sufficient.

We still have an underlying problem that the metrics are being calculated incorrectly (assuming fewer classes than there should be), but that's a somewhat different issue, closely related to #6231.

Member

jnothman commented Oct 18, 2016

I've realised why: it gets the list of classes from the whole dataset. So performing the test, despite not getting an error in master, is sufficient.

We still have an underlying problem that the metrics are being calculated incorrectly (assuming fewer classes than there should be), but that's a somewhat different issue, closely related to #6231.

@jnothman

Otherwise LGTM. Please add an entry to what's new (under 0.19/enhancements)

Show outdated Hide outdated sklearn/model_selection/_validation.py
@@ -779,6 +795,14 @@ def learning_curve(estimator, X, y, groups=None,
return train_sizes_abs, out[0], out[1]
def _shuffle_train_indices(cv_iter, shuffle, random_state):

This comment has been minimized.

@jnothman

jnothman Oct 18, 2016

Member

I think we had a little misunderstanding in the creation of this function. Can you please put it back inline? Thanks.

@jnothman

jnothman Oct 18, 2016

Member

I think we had a little misunderstanding in the creation of this function. Can you please put it back inline? Thanks.

NarineK added some commits Oct 19, 2016

Addressed Joel's comments - removed _shuffle_train_indices, more test…
… cases and added new entry under 0.19/enhancements
Show outdated Hide outdated doc/whats_new.rst
@@ -31,6 +31,12 @@ Enhancements
<https://github.com/scikit-learn/scikit-learn/pull/4939>`_) by `Andrea
Esuli`_.
- Added ``shuffle`` and ``random_state`` parameters to shuffle training
data before taking prefixes of it based on training sizes in
``model_selection``s ``learning_curve``.

This comment has been minimized.

@jnothman

jnothman Oct 19, 2016

Member

this should be

:func:`model_selection.learning_curve`
@jnothman

jnothman Oct 19, 2016

Member

this should be

:func:`model_selection.learning_curve`
Show outdated Hide outdated doc/whats_new.rst
data before taking prefixes of it based on training sizes in
``model_selection``s ``learning_curve``.
(`#7506` <https://github.com/scikit-learn/scikit-learn/pull/7506>_) by
`Narine Kokhlikyan`_.

This comment has been minimized.

@jnothman

jnothman Oct 19, 2016

Member

need to add link target at bottom of file

@jnothman

jnothman Oct 19, 2016

Member

need to add link target at bottom of file

@jnothman jnothman changed the title from Learning curve: Add an option to randomly choose indices for different training sizes to [MRG+1] Learning curve: Add an option to randomly choose indices for different training sizes Oct 19, 2016

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 19, 2016

Contributor

Added the modifications in doc/whats_new.rst
Thanks for reviewing, @jnothman

Contributor

NarineK commented Oct 19, 2016

Added the modifications in doc/whats_new.rst
Thanks for reviewing, @jnothman

@@ -713,6 +716,37 @@ def test_learning_curve_with_boolean_indices():
np.linspace(0.1, 1.0, 10))
def test_learning_curve_with_shuffle():
"""Following test case was designed this way to verify the code

This comment has been minimized.

@amueller

amueller Oct 19, 2016

Member

Please use a comment, not a docstring for the test - that makes it easier to find out which test is run.
Also, I'm not sure I understand the test. Can you please add an explanation here?

@amueller

amueller Oct 19, 2016

Member

Please use a comment, not a docstring for the test - that makes it easier to find out which test is run.
Also, I'm not sure I understand the test. Can you please add an explanation here?

This comment has been minimized.

@amueller

amueller Oct 19, 2016

Member

After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

@amueller

amueller Oct 19, 2016

Member

After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

@amueller

LGTM apart from explaining the test.

@@ -713,6 +716,37 @@ def test_learning_curve_with_boolean_indices():
np.linspace(0.1, 1.0, 10))
def test_learning_curve_with_shuffle():
"""Following test case was designed this way to verify the code

This comment has been minimized.

@amueller

amueller Oct 19, 2016

Member

After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

@amueller

amueller Oct 19, 2016

Member

After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3, 1.0, 3),
groups=groups, shuffle=True, random_state=2,
exploit_incremental_learning=True)
assert_array_almost_equal(train_scores_inc.mean(axis=1),

This comment has been minimized.

@amueller

amueller Oct 19, 2016

Member

Any reason to use the mean here instead of everything?

@amueller

amueller Oct 19, 2016

Member

Any reason to use the mean here instead of everything?

This comment has been minimized.

@NarineK

NarineK Oct 19, 2016

Contributor

Thank you for the review, @amueller
I used mean instead of everything in order to be consistent with other test cases for learning curves.

@NarineK

NarineK Oct 19, 2016

Contributor

Thank you for the review, @amueller
I used mean instead of everything in order to be consistent with other test cases for learning curves.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller
Member

amueller commented Oct 19, 2016

thanks @NarineK

@amueller amueller merged commit 829efa5 into scikit-learn:master Oct 19, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 19, 2016

Contributor

oh, I haven't addressed this point yet, @amueller

After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

Is it too late now? I see you merged it.

Contributor

NarineK commented Oct 19, 2016

oh, I haven't addressed this point yet, @amueller

After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

Is it too late now? I see you merged it.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 19, 2016

Member

Damn too quick. I'll add the comment in master. I'm right about the intend, though?

Member

amueller commented Oct 19, 2016

Damn too quick. I'll add the comment in master. I'm right about the intend, though?

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 19, 2016

Contributor

yes, you're right. Thank you.

Contributor

NarineK commented Oct 19, 2016

yes, you're right. Thank you.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 19, 2016

Member

Either way thanks, @NarineK, for raising the issue and solving it even when we told you to solve it the wrong way at first!

Member

jnothman commented Oct 19, 2016

Either way thanks, @NarineK, for raising the issue and solving it even when we told you to solve it the wrong way at first!

@NarineK

This comment has been minimized.

Show comment
Hide comment
@NarineK

NarineK Oct 19, 2016

Contributor

No problem, my pleasure!

Contributor

NarineK commented Oct 19, 2016

No problem, my pleasure!

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

[MRG+1] Learning curve: Add an option to randomly choose indices for …
…different training sizes (#7506)

* Chooses randomly the indices for different training sizes

* Bring back deleted line

* Rewrote the description of 'shuffle' attribute

* use random.sample instead of np.random.choice

* replace tabs with spaces

* merge to master

* Added shuffle in model-selection's learning_curve method

* Added shuffle for incremental learning + addressed Joel's comment

* Shorten long lines

* Add 2 blank spaces between test cases

* Addressed Joel's review comments

* Added 2 blank lines between methods

* Added non regression test for learning_curve with shuffle

* Fixed indentions

* Fixed space issues

* Modified test cases + small code improvements

* Fix some style issues

* Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements

* Added some modifications in whats_new.rst

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] Learning curve: Add an option to randomly choose indices for …
…different training sizes (#7506)

* Chooses randomly the indices for different training sizes

* Bring back deleted line

* Rewrote the description of 'shuffle' attribute

* use random.sample instead of np.random.choice

* replace tabs with spaces

* merge to master

* Added shuffle in model-selection's learning_curve method

* Added shuffle for incremental learning + addressed Joel's comment

* Shorten long lines

* Add 2 blank spaces between test cases

* Addressed Joel's review comments

* Added 2 blank lines between methods

* Added non regression test for learning_curve with shuffle

* Fixed indentions

* Fixed space issues

* Modified test cases + small code improvements

* Fix some style issues

* Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements

* Added some modifications in whats_new.rst

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] Learning curve: Add an option to randomly choose indices for …
…different training sizes (#7506)

* Chooses randomly the indices for different training sizes

* Bring back deleted line

* Rewrote the description of 'shuffle' attribute

* use random.sample instead of np.random.choice

* replace tabs with spaces

* merge to master

* Added shuffle in model-selection's learning_curve method

* Added shuffle for incremental learning + addressed Joel's comment

* Shorten long lines

* Add 2 blank spaces between test cases

* Addressed Joel's review comments

* Added 2 blank lines between methods

* Added non regression test for learning_curve with shuffle

* Fixed indentions

* Fixed space issues

* Modified test cases + small code improvements

* Fix some style issues

* Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements

* Added some modifications in whats_new.rst

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] Learning curve: Add an option to randomly choose indices for …
…different training sizes (#7506)

* Chooses randomly the indices for different training sizes

* Bring back deleted line

* Rewrote the description of 'shuffle' attribute

* use random.sample instead of np.random.choice

* replace tabs with spaces

* merge to master

* Added shuffle in model-selection's learning_curve method

* Added shuffle for incremental learning + addressed Joel's comment

* Shorten long lines

* Add 2 blank spaces between test cases

* Addressed Joel's review comments

* Added 2 blank lines between methods

* Added non regression test for learning_curve with shuffle

* Fixed indentions

* Fixed space issues

* Modified test cases + small code improvements

* Fix some style issues

* Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements

* Added some modifications in whats_new.rst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment