Skip to content

ENH add shuffle to GroupKFold#28519

Merged
adrinjalali merged 36 commits intoscikit-learn:mainfrom
zvealey:feature/shuffle-group-k-fold
Oct 28, 2024
Merged

ENH add shuffle to GroupKFold#28519
adrinjalali merged 36 commits intoscikit-learn:mainfrom
zvealey:feature/shuffle-group-k-fold

Conversation

@zvealey
Copy link
Contributor

@zvealey zvealey commented Feb 23, 2024

Reference Issues/PRs

closes #13619
Partially addressing #20520

What does this implement/fix? Explain your changes.

This update introduces a shuffle feature to GroupKFold, along with a new class, RepeatedGroupKFold, which supports repeating GroupKFold n times. When shuffle=False, the behavior remains unchanged, ensuring that groups are split to achieve folds of as equal size as possible. Setting shuffle=True shuffles the unique groups before they are assigned to folds, without prioritizing the size of the individual groups.

Any other comments?

Although enabling shuffling does not guarantee that folds will be of similar size, in scenarios where groups are roughly equal in size (a common situation) the size difference between folds should be minimal.

@github-actions
Copy link

github-actions bot commented Feb 23, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 3fa6a23. Link to the linter CI: here

@NegatedObjectIdentity
Copy link

I would like to see such functionality in scikit-learn, since right now I use a external function to perform exactly this.

@glemaitre glemaitre self-requested a review March 15, 2024 16:58
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that I forgot to post my review that I started a while ago. @zvealey would you be able to address the comments?

@zvealey
Copy link
Contributor Author

zvealey commented Jul 5, 2024

Hi @glemaitre! I missed when you posted your review! I updated the PR based on your comments!

@glemaitre glemaitre self-requested a review July 22, 2024 12:14
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a small commit just to ensure that one of the test was stable across a large amount of random seeds.

I took the opportunity to address a couple of comments that I would have raised.

@glemaitre glemaitre added this to the 1.6 milestone Jul 23, 2024
@glemaitre
Copy link
Member

LGTM. We will need a second review.

@adrinjalali adrinjalali enabled auto-merge (squash) October 22, 2024 13:21
@glemaitre glemaitre self-requested a review October 28, 2024 17:48
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a small fix for the docstring.

@adrinjalali adrinjalali merged commit 0d37bd9 into scikit-learn:main Oct 28, 2024
@avm19
Copy link
Contributor

avm19 commented Aug 30, 2025

Currently, as a result of this PR, the docstring says:

The folds are approximately balanced in the sense that the number of
samples is approximately the same in each test fold when shuffle is True.

but it should probably say "...when shuffle is False", because when shuffle is True the group sizes are not even computed in the code.

I guess, the typo comes from that it is easy to confuse balancing the number of groups and the number of samples. And I also got it wrong when writing a now-deleted comment and linking to the discussion above.

The best clarification is found in cross_validation.rst docs, and I suggest to copy this explanation verbatim into GroupKFold's docstring to avoid any confusion:

While :class:`GroupKFold` attempts to place the same number of samples in each
fold when ``shuffle=False``, when ``shuffle=True`` it attempts to place equal
number of distinct groups in each fold (but doesn not account for group sizes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

Shuffled GroupKFold

5 participants