Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StratifiedGroupShuffleSplit and StratifiedGroupKFold #15239

Closed

Conversation

hermidalc
Copy link
Contributor

@hermidalc hermidalc commented Oct 13, 2019

This implements a Stratified version of the GroupShuffleSplit and GroupKFold cross-validators. Note that GroupKFold differs from StratifiedGroupKFold in that GroupKFold attempts to approximately balance the number of groups in each fold regardless of group class, whereas StratifiedGroupKFold attempts to stratify the group class percentages in each fold to be the same as that of the entire data.

There are two important points regarding the implementation logic:

  1. All samples in each group are of the same class
  2. Stratification is done on the group class level

This makes the logic straightforward and covers a lot of use cases (at least the ones I needed them for in my work).

TODO:

  1. Tests

@NicolasHug
Copy link
Member

Could you please add tests @hermidalc ;)

folds are made by preserving the percentage of groups for each class.

Note: like the StratifiedShuffleSplit strategy, stratified random group
splits do not guarantee that all folds will be different, although this is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence. What do you mean by "folds not being different"?

Copy link
Contributor Author

@hermidalc hermidalc Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That text is copied from StratifiedShuffleSplit and the meaning behind it is that shuffle splitting does not guarantee that each split will be different than another.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I think in sklearn we call folds the partitions in the split, not the repetitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think partitions/splits/folds that's what is meant here and I believe in StratifiedShuffleSplit. With randomized splits there is no guarantee that a partition will be different than other ones.

@hermidalc hermidalc changed the title Stratified GroupShuffleSplit StratifiedGroupShuffleSplit and StratifiedGroupKFold Mar 18, 2020
@tedthizzy
Copy link

Hey @hermidalc, any update on this?

@jnothman
Copy link
Member

Can we get tests by any chance, @hermidalc?

@hermidalc
Copy link
Contributor Author

Can we get tests by any chance, @hermidalc?

See reply in #13621

@cmarmo cmarmo added the Superseded PR has been replace by a newer PR label Oct 20, 2020
Base automatically changed from master to main January 22, 2021 10:51
@cmarmo
Copy link
Member

cmarmo commented Mar 20, 2021

closed by #18649

@cmarmo cmarmo closed this Mar 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:model_selection Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants