[MRG+1] Learning curve: Add an option to randomly choose indices for different training sizes #7506
Conversation
Why not just pass a |
Thank you, @amueller for the prompt response. from sklearn.cross_validation import LabelKFold |
You are right, there's no shuffle in |
I see, do you mean adding shuffle=False here ? |
I was going to say that I don't mind adding a (Also, the |
Yes, adding shuffling to the equivalent in model_selection/_split.py On 29 September 2016 at 07:51, NarineK notifications@github.com wrote:
|
Great, thanks @NarineK :) |
I hope it's okay to start a new PR, thanks @NarineK . |
I think what you need here is actually a |
@jnothman or leave the subsetting to the cv object and use What would a stratified learning curve look like? Make sure that for each And do we then also add a group option to learning curve? Maybe |
Thinking about it more, I feel that the way we are doing the learning curves feels weird. @agramfort do you have literature on cross-validation with learning curves? |
We don't need to add a group option: the nice thing about the learning curve implementation is that any questions of dependency between training and test samples are handled by |
I prefer this option too. Should |
@jnothman that's true about correlations between the training and the test set, but that might make for very strange learning curves. I guess we can do the shuffle here. I'm not entirely convinced by the CV based approach but I guess it's too late to change anyhow. |
I assume the change is needed in sklearn/model_selection/_validation.py instead of sklearn/learning_curve.py |
Only the model_selection, I think. |
the overall patch is now nothing...? |
I merged to master to fix the conflicts. I'll push my changes in model selection soon. |
I'll add the test score too. |
SGDClassifier itself is giving non deterministic scores. It is hard to write test cases for it. |
MultinomialNB doesn't fail if I set all labels the same. I'm not sure if this is by design, but it is not consistent with other algorithms. The output isn't helpful either.
|
I couldn't find an estimator for which (shuffle=False and exploit_incremental_learning=True) would fail. Tried:
|
I've realised why: it gets the list of We still have an underlying problem that the metrics are being calculated incorrectly (assuming fewer classes than there should be), but that's a somewhat different issue, closely related to #6231. |
Otherwise LGTM. Please add an entry to what's new (under 0.19/enhancements) |
@@ -779,6 +795,14 @@ def learning_curve(estimator, X, y, groups=None, | |||
return train_sizes_abs, out[0], out[1] | |||
|
|||
|
|||
def _shuffle_train_indices(cv_iter, shuffle, random_state): |
jnothman
Oct 18, 2016
Member
I think we had a little misunderstanding in the creation of this function. Can you please put it back inline? Thanks.
I think we had a little misunderstanding in the creation of this function. Can you please put it back inline? Thanks.
… cases and added new entry under 0.19/enhancements
LGTM |
@@ -31,6 +31,12 @@ Enhancements | |||
<https://github.com/scikit-learn/scikit-learn/pull/4939>`_) by `Andrea | |||
Esuli`_. | |||
|
|||
- Added ``shuffle`` and ``random_state`` parameters to shuffle training | |||
data before taking prefixes of it based on training sizes in | |||
``model_selection``s ``learning_curve``. |
jnothman
Oct 19, 2016
Member
this should be
:func:`model_selection.learning_curve`
this should be
:func:`model_selection.learning_curve`
data before taking prefixes of it based on training sizes in | ||
``model_selection``s ``learning_curve``. | ||
(`#7506` <https://github.com/scikit-learn/scikit-learn/pull/7506>_) by | ||
`Narine Kokhlikyan`_. |
jnothman
Oct 19, 2016
Member
need to add link target at bottom of file
need to add link target at bottom of file
Added the modifications in doc/whats_new.rst |
@@ -713,6 +716,37 @@ def test_learning_curve_with_boolean_indices(): | |||
np.linspace(0.1, 1.0, 10)) | |||
|
|||
|
|||
def test_learning_curve_with_shuffle(): | |||
"""Following test case was designed this way to verify the code |
amueller
Oct 19, 2016
Member
Please use a comment, not a docstring for the test - that makes it easier to find out which test is run.
Also, I'm not sure I understand the test. Can you please add an explanation here?
Please use a comment, not a docstring for the test - that makes it easier to find out which test is run.
Also, I'm not sure I understand the test. Can you please add an explanation here?
amueller
Oct 19, 2016
Member
After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?
After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?
LGTM apart from explaining the test. |
@@ -713,6 +716,37 @@ def test_learning_curve_with_boolean_indices(): | |||
np.linspace(0.1, 1.0, 10)) | |||
|
|||
|
|||
def test_learning_curve_with_shuffle(): | |||
"""Following test case was designed this way to verify the code |
amueller
Oct 19, 2016
Member
After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?
After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?
estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3, 1.0, 3), | ||
groups=groups, shuffle=True, random_state=2, | ||
exploit_incremental_learning=True) | ||
assert_array_almost_equal(train_scores_inc.mean(axis=1), |
amueller
Oct 19, 2016
Member
Any reason to use the mean here instead of everything?
Any reason to use the mean here instead of everything?
NarineK
Oct 19, 2016
Author
Contributor
Thank you for the review, @amueller
I used mean instead of everything in order to be consistent with other test cases for learning curves.
Thank you for the review, @amueller
I used mean instead of everything in order to be consistent with other test cases for learning curves.
thanks @NarineK |
oh, I haven't addressed this point yet, @amueller After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here? Is it too late now? I see you merged it. |
Damn too quick. I'll add the comment in master. I'm right about the intend, though? |
yes, you're right. Thank you. |
Either way thanks, @NarineK, for raising the issue and solving it even when we told you to solve it the wrong way at first! |
No problem, my pleasure! |
…different training sizes (scikit-learn#7506) * Chooses randomly the indices for different training sizes * Bring back deleted line * Rewrote the description of 'shuffle' attribute * use random.sample instead of np.random.choice * replace tabs with spaces * merge to master * Added shuffle in model-selection's learning_curve method * Added shuffle for incremental learning + addressed Joel's comment * Shorten long lines * Add 2 blank spaces between test cases * Addressed Joel's review comments * Added 2 blank lines between methods * Added non regression test for learning_curve with shuffle * Fixed indentions * Fixed space issues * Modified test cases + small code improvements * Fix some style issues * Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements * Added some modifications in whats_new.rst
…different training sizes (scikit-learn#7506) * Chooses randomly the indices for different training sizes * Bring back deleted line * Rewrote the description of 'shuffle' attribute * use random.sample instead of np.random.choice * replace tabs with spaces * merge to master * Added shuffle in model-selection's learning_curve method * Added shuffle for incremental learning + addressed Joel's comment * Shorten long lines * Add 2 blank spaces between test cases * Addressed Joel's review comments * Added 2 blank lines between methods * Added non regression test for learning_curve with shuffle * Fixed indentions * Fixed space issues * Modified test cases + small code improvements * Fix some style issues * Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements * Added some modifications in whats_new.rst
…different training sizes (scikit-learn#7506) * Chooses randomly the indices for different training sizes * Bring back deleted line * Rewrote the description of 'shuffle' attribute * use random.sample instead of np.random.choice * replace tabs with spaces * merge to master * Added shuffle in model-selection's learning_curve method * Added shuffle for incremental learning + addressed Joel's comment * Shorten long lines * Add 2 blank spaces between test cases * Addressed Joel's review comments * Added 2 blank lines between methods * Added non regression test for learning_curve with shuffle * Fixed indentions * Fixed space issues * Modified test cases + small code improvements * Fix some style issues * Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements * Added some modifications in whats_new.rst
…different training sizes (scikit-learn#7506) * Chooses randomly the indices for different training sizes * Bring back deleted line * Rewrote the description of 'shuffle' attribute * use random.sample instead of np.random.choice * replace tabs with spaces * merge to master * Added shuffle in model-selection's learning_curve method * Added shuffle for incremental learning + addressed Joel's comment * Shorten long lines * Add 2 blank spaces between test cases * Addressed Joel's review comments * Added 2 blank lines between methods * Added non regression test for learning_curve with shuffle * Fixed indentions * Fixed space issues * Modified test cases + small code improvements * Fix some style issues * Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements * Added some modifications in whats_new.rst
Currently, training sizes are chosen sequentially from 0 to n_train_samples:
train[:n_train_samples]
If training data is sorted by the target variable, for small sizes of training data it will choose only one label and this will always lead to failures during model fitting.
For example:
The following always fails with: ValueError: The number of classes has to be greater than one; got 1
train_sizes, train_scores, valid_scores = learning_curve(SVC(kernel='linear'), X, y, train_sizes=[0.7, 1.0], cv=3)
The following runs successfully for most tries:
train_sizes, train_scores, valid_scores = learning_curve(SVC(kernel='linear'), X, y, train_sizes=[0.7, 1.0], cv=3, shuffle=True)
If we would have an option to shuffle the indices of training data before choosing ’n_train_samples’, that will increase our chances of not fitting data with the same label into the learner and have more label variety.
In the following pull request I did a small modification and added an option to shuffle and choose randomly the indices. We could do the same also for incremental learning.
Let me know what you think.
Thanks!
This change is