ENH add shuffle to GroupKFold#28519
Conversation
Perform shuffling of groups prior to creating folds. Also add corresponding RepeatedGroupKFold class.
…aley/scikit-learn into feature/shuffle-group-k-fold
|
I would like to see such functionality in scikit-learn, since right now I use a external function to perform exactly this. |
Perform shuffling of groups prior to creating folds. Also add corresponding RepeatedGroupKFold class.
…aley/scikit-learn into feature/shuffle-group-k-fold
|
Hi @glemaitre! I missed when you posted your review! I updated the PR based on your comments! |
glemaitre
left a comment
There was a problem hiding this comment.
I pushed a small commit just to ensure that one of the test was stable across a large amount of random seeds.
I took the opportunity to address a couple of comments that I would have raised.
|
LGTM. We will need a second review. |
glemaitre
left a comment
There was a problem hiding this comment.
I pushed a small fix for the docstring.
|
Currently, as a result of this PR, the docstring says:
but it should probably say "...when I guess, the typo comes from that it is easy to confuse balancing the number of groups and the number of samples. And I also got it wrong when writing a now-deleted comment and linking to the discussion above. The best clarification is found in cross_validation.rst docs, and I suggest to copy this explanation verbatim into GroupKFold's docstring to avoid any confusion: scikit-learn/doc/modules/cross_validation.rst Lines 670 to 672 in 3fa6a23 |
Reference Issues/PRs
closes #13619
Partially addressing #20520
What does this implement/fix? Explain your changes.
This update introduces a shuffle feature to
GroupKFold, along with a new class,RepeatedGroupKFold, which supports repeatingGroupKFoldn times. Whenshuffle=False, the behavior remains unchanged, ensuring that groups are split to achieve folds of as equal size as possible. Settingshuffle=Trueshuffles the unique groups before they are assigned to folds, without prioritizing the size of the individual groups.Any other comments?
Although enabling shuffling does not guarantee that folds will be of similar size, in scenarios where groups are roughly equal in size (a common situation) the size difference between folds should be minimal.