Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add stratify and shuffle variants for GroupKFold #9413

Closed
wants to merge 6 commits into from

Conversation

andreasvc
Copy link
Contributor

Supersedes #5396

Adds an option "method" to GroupKFold to change the way groups are distributed over folds. Current default is to balance the sizes of the folds. This adds the alternative of stratifying on the y variable, or shuffling the groups to randomize the folds they end up in.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you feel is work to still be done on this PR (i.e. why WIP?). I assume you need more tests of stratify and shuffle.

method: string, default='balance'
One of 'balance', 'stratify', 'shuffle'.
By default, try to equalize the sizes of the resulting folds.
If 'stratify', sort groups according to ``y`` variable and distribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only sorts groups according to the y variable of the first sample for that group, not, say the mean or the mode or whatever. I think the stratify case is most useful when there is a many-to-one relationship between group and target. And I wonder if we should enforce that relationship (throw an error if there are many y values for some group) just to make this logic explicable, straightforward, and invariant to reordering the samples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works best when there are many groups (i.e., groups are small), but it's not necessary to have a many-to-one relationship with the target variable. I now added an example where the target variable is a continuous variable. The plots show 1000 normally distributed data points, in 100 groups.

cmp

@andreasvc
Copy link
Contributor Author

I'm not sure how to make tests for the stratify and shuffle cases. Maybe the example I added for stratify is enough? Other than that I suppose this PR is ready.

@jnothman
Copy link
Member

jnothman commented Jul 20, 2017 via email

@JeanKossaifi
Copy link
Contributor

Ideally the shuffling should not alter too much the balance in the folds. Otherwise, what is the difference with GroupShuffleSplit?

@jnothman
Copy link
Member

jnothman commented Jul 20, 2017 via email

@andreasvc
Copy link
Contributor Author

I modified "stratify" to use the median value for each group.

The thing about balancing the folds vs. stratifying or shuffling is that it is a trade-off, and this PR allows one to make the choice explicitly. If your dataset has a small number of groups, they will strongly constrain the possible folds, and balancing might have priority. When there is a sufficient number of groups, or the groups are all the same size, the sizes of folds will be approximately balanced either way, and one can stratify or shuffle the folds instead.

@jnothman
Copy link
Member

Tests appear to be failing

@andreasvc andreasvc changed the title [WIP] add stratify and shuffle variants for GroupKFold [MRG] add stratify and shuffle variants for GroupKFold Jul 22, 2017
@andreasvc
Copy link
Contributor Author

All tests pass now, so I went ahead and renamed this MRG.

@amueller
Copy link
Member

amueller commented Jul 26, 2017

It's not clear from the docs what this actually does. Can you maybe give an example in the docs?
So it computes the median y value in each group. What does that mean for categories? Wouldn't you want the mode? And for classification this would mostly have an impact if the folds have very different y distributions?

Can you please add a legend to the plots?
And would it be possible to show an effect with 5-fold CV?

And shuffle doesn't seem to assign groups randomly, but greedily in random order.
The description of balanced seems wrong, too. To do balanced you'd have to solve something like a bin-packing problem, right? We're only using a greedy heuristic.

@andreasvc
Copy link
Contributor Author

Thanks for the comments. You're right, I didn't think the discrete case through, and the documentation should be improved.

What kind of API should I use for distinguishing the discrete and the continuous case? Autodetection is possible but probably not a good idea. I see that in the feature_selection module there are separate regression and classification functions; but since the code will be very similar, maybe a parameter is best?

The regular StratifiedKFold only supports the discrete case. Maybe I should add support for the continuous case there too?

You're right shuffling is not completely random, but otherwise the folds could be very uneven. Did you mean it should be better documented, or that the implementation should be different?
I understand that the balanced option does not give an optimal solution, I didn't mean to imply that.

@amueller
Copy link
Member

My comments were mostly on the documentation. I think we should be very clear on what we are doing. Your solution seems good for regression, but is not the only possible way to stratify, I think.
I'm not opposed to adding this stratification strategy, but we should try to describe well what the different strategies are doing.

For feature_selection I think doing regression and classification in the same class has given us a lot of headache. So in some way I'd prefer different classes for regression and classification. On the other hand I like the way that stratification is implemented as an option here - it's somewhat different to the other classes, but we have a real explosion of different classes, and I might prefer to have it as a parameter.
But we can't really have different classes for regression and classification and have stratification as a parameter, that makes no sense. We could auto-detect using type_of_target, though that's a bit dangerous. If there are many different ints, how do you know if they are classes or regression targets?

maybe method="stratify_classification" and method="stratify_regression" would work so we don't have redundant parameters?

There is a PR somewhere for stratified cross-validation for regression. It uses binning, though, and I think sorting is better. I haven't looked at it in a while.

@jnothman
Copy link
Member

jnothman commented Jul 26, 2017 via email

@amueller
Copy link
Member

I think both are important.

@amueller
Copy link
Member

But yeah, definitely we need a motivation for each case. I think the case for having some stratified option is pretty clear, but I feel there are non-obvious choices to make.

@cmarmo cmarmo added the Needs Decision Requires decision label Oct 22, 2020
Base automatically changed from master to main January 22, 2021 10:49
@adrinjalali
Copy link
Member

Closing this one as it needs a refresh and motivation, but related: #26821

@adrinjalali adrinjalali closed this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants