-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add shuffle to GroupKFold #28519
base: main
Are you sure you want to change the base?
Conversation
Perform shuffling of groups prior to creating folds. Also add corresponding RepeatedGroupKFold class.
…aley/scikit-learn into feature/shuffle-group-k-fold
I would like to see such functionality in scikit-learn, since right now I use a external function to perform exactly this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that I forgot to post my review that I started a while ago. @zvealey would you be able to address the comments?
if isinstance(random_state, np.random.RandomState): | ||
self.rng = random_state | ||
else: | ||
self.rng = np.random.RandomState(random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be here. Instead you should use check_random_state
in the _iter_test_indices
in the if self.shuffle
|
||
def _iter_test_indices(self, X, y, groups): | ||
if groups is None: | ||
raise ValueError("The 'groups' parameter should not be None.") | ||
groups = check_array(groups, input_name="groups", ensure_2d=False, dtype=None) | ||
|
||
unique_groups, groups = np.unique(groups, return_inverse=True) | ||
unique_groups, group_inds = np.unique(groups, return_inverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group_idx
or group_indices
would be more inline with what we usually use as a name.
@@ -606,29 +623,39 @@ def _iter_test_indices(self, X, y, groups): | |||
" than the number of groups: %d." % (self.n_splits, n_groups) | |||
) | |||
|
|||
# Weight groups by their number of occurrences | |||
n_samples_per_group = np.bincount(groups) | |||
if self.shuffle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update the documentation that states:
The folds are approximately balanced in the sense that the number of
distinct groups is approximately the same in each fold.
because it is only true when shuffle=False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user guide also need to be updated because it stated that GroupKFold
is not randomized.
into different folds when `shuffle=True`. | ||
:pr:`28519` by :user:`Zachary Vealey <zvealey>`. | ||
|
||
- |Feature| :class:`model_selection.RepeatedGroupKFold` has been created for repeating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do a separate PR for this one. Basically, we prefer to merge a single feature by pull-request. It is more atomic and ease the review process.
@@ -552,6 +552,17 @@ class GroupKFold(GroupsConsumerMixin, _BaseKFold): | |||
.. versionchanged:: 0.22 | |||
``n_splits`` default value changed from 3 to 5. | |||
|
|||
shuffle : bool, default=False | |||
Whether to shuffle the groups before splitting into batches. | |||
Note that the samples within each split will not be shuffled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the samples within each split will not be shuffled. | |
Note that the samples within each split will not be shuffled. | |
.. versionadded:: 1.6 |
indices, which controls the randomness of each fold. Otherwise, this | ||
parameter has no effect. | ||
Pass an int for reproducible output across multiple function calls. | ||
See :term:`Glossary <random_state>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See :term:`Glossary <random_state>`. | |
See :term:`Glossary <random_state>`. | |
.. versionadded:: 1.6 |
@@ -171,6 +171,14 @@ Changelog | |||
- |Enhancement| :term:`CV splitters <CV splitter>` that ignores the group parameter now | |||
raises a warning when groups are passed in to :term:`split`. :pr:`28210` by | |||
|
|||
- |Feature| :class:`model_selection.GroupKFold` now has the ability to shuffle groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to move the entry in 1.6 since we are releasing today.
Reference Issues/PRs
closes #13619
Partially addressing #20520
What does this implement/fix? Explain your changes.
This update introduces a shuffle feature to
GroupKFold
, along with a new class,RepeatedGroupKFold
, which supports repeatingGroupKFold
n times. Whenshuffle=False
, the behavior remains unchanged, ensuring that groups are split to achieve folds of as equal size as possible. Settingshuffle=True
shuffles the unique groups before they are assigned to folds, without prioritizing the size of the individual groups.Any other comments?
Although enabling shuffling does not guarantee that folds will be of similar size, in scenarios where groups are roughly equal in size (a common situation) the size difference between folds should be minimal.