Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH add shuffle to GroupKFold #28519

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

zvealey
Copy link

@zvealey zvealey commented Feb 23, 2024

Reference Issues/PRs

closes #13619
Partially addressing #20520

What does this implement/fix? Explain your changes.

This update introduces a shuffle feature to GroupKFold, along with a new class, RepeatedGroupKFold, which supports repeating GroupKFold n times. When shuffle=False, the behavior remains unchanged, ensuring that groups are split to achieve folds of as equal size as possible. Setting shuffle=True shuffles the unique groups before they are assigned to folds, without prioritizing the size of the individual groups.

Any other comments?

Although enabling shuffling does not guarantee that folds will be of similar size, in scenarios where groups are roughly equal in size (a common situation) the size difference between folds should be minimal.

Copy link

github-actions bot commented Feb 23, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 88d599b. Link to the linter CI: here

@NegatedObjectIdentity
Copy link

I would like to see such functionality in scikit-learn, since right now I use a external function to perform exactly this.

@glemaitre glemaitre self-requested a review March 15, 2024 16:58
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that I forgot to post my review that I started a while ago. @zvealey would you be able to address the comments?

Comment on lines +607 to +610
if isinstance(random_state, np.random.RandomState):
self.rng = random_state
else:
self.rng = np.random.RandomState(random_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be here. Instead you should use check_random_state in the _iter_test_indices in the if self.shuffle


def _iter_test_indices(self, X, y, groups):
if groups is None:
raise ValueError("The 'groups' parameter should not be None.")
groups = check_array(groups, input_name="groups", ensure_2d=False, dtype=None)

unique_groups, groups = np.unique(groups, return_inverse=True)
unique_groups, group_inds = np.unique(groups, return_inverse=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group_idx or group_indices would be more inline with what we usually use as a name.

@@ -606,29 +623,39 @@ def _iter_test_indices(self, X, y, groups):
" than the number of groups: %d." % (self.n_splits, n_groups)
)

# Weight groups by their number of occurrences
n_samples_per_group = np.bincount(groups)
if self.shuffle:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update the documentation that states:

    The folds are approximately balanced in the sense that the number of
    distinct groups is approximately the same in each fold.

because it is only true when shuffle=False.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user guide also need to be updated because it stated that GroupKFold is not randomized.

into different folds when `shuffle=True`.
:pr:`28519` by :user:`Zachary Vealey <zvealey>`.

- |Feature| :class:`model_selection.RepeatedGroupKFold` has been created for repeating
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do a separate PR for this one. Basically, we prefer to merge a single feature by pull-request. It is more atomic and ease the review process.

@@ -552,6 +552,17 @@ class GroupKFold(GroupsConsumerMixin, _BaseKFold):
.. versionchanged:: 0.22
``n_splits`` default value changed from 3 to 5.

shuffle : bool, default=False
Whether to shuffle the groups before splitting into batches.
Note that the samples within each split will not be shuffled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that the samples within each split will not be shuffled.
Note that the samples within each split will not be shuffled.
.. versionadded:: 1.6

indices, which controls the randomness of each fold. Otherwise, this
parameter has no effect.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary <random_state>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See :term:`Glossary <random_state>`.
See :term:`Glossary <random_state>`.
.. versionadded:: 1.6

@@ -171,6 +171,14 @@ Changelog
- |Enhancement| :term:`CV splitters <CV splitter>` that ignores the group parameter now
raises a warning when groups are passed in to :term:`split`. :pr:`28210` by

- |Feature| :class:`model_selection.GroupKFold` now has the ability to shuffle groups
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to move the entry in 1.6 since we are releasing today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Shuffled GroupKFold
3 participants