[MRG] add stratify and shuffle variants for GroupKFold #9413

andreasvc · 2017-07-19T16:15:24Z

Supersedes #5396

Adds an option "method" to GroupKFold to change the way groups are distributed over folds. Current default is to balance the sizes of the folds. This adds the alternative of stratifying on the y variable, or shuffling the groups to randomize the folds they end up in.

jnothman

What do you feel is work to still be done on this PR (i.e. why WIP?). I assume you need more tests of stratify and shuffle.

jnothman · 2017-07-20T08:41:50Z

sklearn/model_selection/_split.py

+    method: string, default='balance'
+        One of 'balance', 'stratify', 'shuffle'.
+        By default, try to equalize the sizes of the resulting folds.
+        If 'stratify', sort groups according to ``y`` variable and distribute


It only sorts groups according to the y variable of the first sample for that group, not, say the mean or the mode or whatever. I think the stratify case is most useful when there is a many-to-one relationship between group and target. And I wonder if we should enforce that relationship (throw an error if there are many y values for some group) just to make this logic explicable, straightforward, and invariant to reordering the samples.

It works best when there are many groups (i.e., groups are small), but it's not necessary to have a many-to-one relationship with the target variable. I now added an example where the target variable is a continuous variable. The plots show 1000 normally distributed data points, in 100 groups.

andreasvc · 2017-07-20T12:41:59Z

I'm not sure how to make tests for the stratify and shuffle cases. Maybe the example I added for stratify is enough? Other than that I suppose this PR is ready.

jnothman · 2017-07-20T13:39:34Z

am I right in saying that you're stratifying on the basis of the first y value in a group? Ideally it would not be so brittle, but at a minimum that needs documenting.

JeanKossaifi · 2017-07-20T17:13:37Z

Ideally the shuffling should not alter too much the balance in the folds. Otherwise, what is the difference with GroupShuffleSplit?

jnothman · 2017-07-20T21:49:38Z

The difference is that this ensures you use every group in testing exactly once. That is, in general, the difference between kfold and shuffle-split

…

On 21 Jul 2017 3:13 am, "Jean Kossaifi" ***@***.***> wrote: Ideally the shuffling should not alter *too much* the balance in the folds. Otherwise, what is the difference with GroupShuffleSplit <http://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GroupShuffleSplit.html#sklearn.model_selection.GroupShuffleSplit> ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9413 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-lYLPjG2HA7lUMWEY3N_I_CGPCxks5sP4rDgaJpZM4Oc906> .

andreasvc · 2017-07-21T11:48:25Z

I modified "stratify" to use the median value for each group.

The thing about balancing the folds vs. stratifying or shuffling is that it is a trade-off, and this PR allows one to make the choice explicitly. If your dataset has a small number of groups, they will strongly constrain the possible folds, and balancing might have priority. When there is a sufficient number of groups, or the groups are all the same size, the sizes of folds will be approximately balanced either way, and one can stratify or shuffle the folds instead.

jnothman · 2017-07-22T12:25:10Z

Tests appear to be failing

andreasvc · 2017-07-22T14:25:54Z

All tests pass now, so I went ahead and renamed this MRG.

amueller · 2017-07-26T15:25:48Z

It's not clear from the docs what this actually does. Can you maybe give an example in the docs?
So it computes the median y value in each group. What does that mean for categories? Wouldn't you want the mode? And for classification this would mostly have an impact if the folds have very different y distributions?

Can you please add a legend to the plots?
And would it be possible to show an effect with 5-fold CV?

And shuffle doesn't seem to assign groups randomly, but greedily in random order.
The description of balanced seems wrong, too. To do balanced you'd have to solve something like a bin-packing problem, right? We're only using a greedy heuristic.

andreasvc · 2017-07-26T17:07:45Z

Thanks for the comments. You're right, I didn't think the discrete case through, and the documentation should be improved.

What kind of API should I use for distinguishing the discrete and the continuous case? Autodetection is possible but probably not a good idea. I see that in the feature_selection module there are separate regression and classification functions; but since the code will be very similar, maybe a parameter is best?

The regular StratifiedKFold only supports the discrete case. Maybe I should add support for the continuous case there too?

You're right shuffling is not completely random, but otherwise the folds could be very uneven. Did you mean it should be better documented, or that the implementation should be different?
I understand that the balanced option does not give an optimal solution, I didn't mean to imply that.

amueller · 2017-07-26T18:01:07Z

My comments were mostly on the documentation. I think we should be very clear on what we are doing. Your solution seems good for regression, but is not the only possible way to stratify, I think.
I'm not opposed to adding this stratification strategy, but we should try to describe well what the different strategies are doing.

For feature_selection I think doing regression and classification in the same class has given us a lot of headache. So in some way I'd prefer different classes for regression and classification. On the other hand I like the way that stratification is implemented as an option here - it's somewhat different to the other classes, but we have a real explosion of different classes, and I might prefer to have it as a parameter.
But we can't really have different classes for regression and classification and have stratification as a parameter, that makes no sense. We could auto-detect using type_of_target, though that's a bit dangerous. If there are many different ints, how do you know if they are classes or regression targets?

maybe method="stratify_classification" and method="stratify_regression" would work so we don't have redundant parameters?

There is a PR somewhere for stratified cross-validation for regression. It uses binning, though, and I think sorting is better. I haven't looked at it in a while.

jnothman · 2017-07-26T23:55:07Z

personally i think it's more important to be clear on when you need each strategy than detail how it works. The code should be legible enough for the latter, and if you can't explain why you need it, we probably don't need so many strategies

…

On 27 Jul 2017 4:01 am, "Andreas Mueller" ***@***.***> wrote: My comments were mostly on the documentation. I think we should be very clear on what we are doing. Your solution seems good for regression, but is not the only possible way to stratify, I think. I'm not opposed to adding this stratification strategy, but we should try to describe well what the different strategies are doing. For feature_selection I think doing regression and classification in the same class has given us a lot of headache. So in some way I'd prefer different classes for regression and classification. On the other hand I like the way that stratification is implemented as an option here - it's somewhat different to the other classes, but we have a real explosion of different classes, and I might prefer to have it as a parameter. But we can't really have different classes for regression and classification and have stratification as a parameter, that makes no sense. We could auto-detect using type_of_target, though that's a bit dangerous. If there are many different ints, how do you know if they are classes or regression targets? maybe method="stratify_classification" and method="stratify_regression" would work so we don't have redundant parameters? There is a PR somewhere for stratified cross-validation for regression. It uses binning, though, and I think sorting is better. I haven't looked at it in a while. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9413 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68VLhUm0aoxrkv8vDOZ9JLoDv_hFks5sR37lgaJpZM4Oc906> .

amueller · 2017-07-27T15:59:29Z

I think both are important.

amueller · 2017-07-27T16:00:28Z

But yeah, definitely we need a motivation for each case. I think the case for having some stratified option is pretty clear, but I feel there are non-obvious choices to make.

adrinjalali · 2024-03-06T14:59:23Z

Closing this one as it needs a refresh and motivation, but related: #26821

add stratify and shuffle variants for GroupKFold

b1f2c8f

jnothman reviewed Jul 20, 2017

View reviewed changes

andreasvc mentioned this pull request Jul 20, 2017

[WIP] LabelKFold: balance folds without sorting #5396

Closed

add example

ee6567c

use median value for stratify

b625fa1

andreasvc added 2 commits July 22, 2017 14:53

fix doctest failure

2f4ce3a

fix flake8 issues

dc8af38

andreasvc changed the title ~~[WIP] add stratify and shuffle variants for GroupKFold~~ [MRG] add stratify and shuffle variants for GroupKFold Jul 22, 2017

add stratify_mode; improve documentation

9017380

jnothman mentioned this pull request Jul 3, 2018

cross_val_predict with groups input: is it used? #11406

Closed

This was referenced Apr 15, 2019

Stratified GroupKFold #13621

Closed

Shuffled GroupKFold #13619

Open

cdknorow mentioned this pull request Aug 5, 2019

[Feature] Adding a Group And Label Kfold Split Method to Model Selection #14524

Open

amueller added Superseded PR has been replace by a newer PR and removed Superseded PR has been replace by a newer PR labels Aug 5, 2019

github-actions bot added the module:model_selection label Mar 2, 2020

impiyush mentioned this pull request Mar 7, 2020

StratifiedGroupShuffleSplit #12076

Open

cmarmo added the Needs Decision Requires decision label Oct 22, 2020

Base automatically changed from master to main January 22, 2021 10:49

adrinjalali closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] add stratify and shuffle variants for GroupKFold #9413

[MRG] add stratify and shuffle variants for GroupKFold #9413

andreasvc commented Jul 19, 2017

jnothman left a comment

jnothman Jul 20, 2017

andreasvc Jul 20, 2017

andreasvc commented Jul 20, 2017

jnothman commented Jul 20, 2017 via email •

edited by amueller

JeanKossaifi commented Jul 20, 2017

jnothman commented Jul 20, 2017 via email

andreasvc commented Jul 21, 2017

jnothman commented Jul 22, 2017

andreasvc commented Jul 22, 2017

amueller commented Jul 26, 2017 •

edited

andreasvc commented Jul 26, 2017

amueller commented Jul 26, 2017

jnothman commented Jul 26, 2017 via email

amueller commented Jul 27, 2017

amueller commented Jul 27, 2017

adrinjalali commented Mar 6, 2024

[MRG] add stratify and shuffle variants for GroupKFold #9413

[MRG] add stratify and shuffle variants for GroupKFold #9413

Conversation

andreasvc commented Jul 19, 2017

jnothman left a comment

Choose a reason for hiding this comment

jnothman Jul 20, 2017

Choose a reason for hiding this comment

andreasvc Jul 20, 2017

Choose a reason for hiding this comment

andreasvc commented Jul 20, 2017

jnothman commented Jul 20, 2017 via email • edited by amueller

JeanKossaifi commented Jul 20, 2017

jnothman commented Jul 20, 2017 via email

andreasvc commented Jul 21, 2017

jnothman commented Jul 22, 2017

andreasvc commented Jul 22, 2017

amueller commented Jul 26, 2017 • edited

andreasvc commented Jul 26, 2017

amueller commented Jul 26, 2017

jnothman commented Jul 26, 2017 via email

amueller commented Jul 27, 2017

amueller commented Jul 27, 2017

adrinjalali commented Mar 6, 2024

jnothman commented Jul 20, 2017 via email •

edited by amueller

amueller commented Jul 26, 2017 •

edited