Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stratified GroupKFold #13621

Closed
aditya1702 opened this issue Apr 11, 2019 · 62 comments · Fixed by #18649
Closed

Stratified GroupKFold #13621

aditya1702 opened this issue Apr 11, 2019 · 62 comments · Fixed by #18649

Comments

@aditya1702
Copy link
Contributor

aditya1702 commented Apr 11, 2019

Description

Currently sklearn does not have a stratified group kfold feature. Either we can use stratification or we can use group kfold. However, it would be good to have both.

I would like to implement it, if we decide to have it.

@aditya1702
Copy link
Contributor Author

aditya1702 commented Apr 12, 2019

@TomDLT @NicolasHug What do you think?

@NicolasHug
Copy link
Member

Might be interesting in theory, but I'm not sure how useful it'd be in practice. We can certainly keep the issue open and see how many people request this feature

@jnothman
Copy link
Member

Do you assume that each group is in a single class?

@jnothman
Copy link
Member

See also #9413

@aditya1702
Copy link
Contributor Author

aditya1702 commented Apr 15, 2019

@jnothman Yes, I had a similar thing in mind. However, I see that the pull request is still open. I meant that a group will not be repeated across folds. If we have ID as groups then a same ID will not occur across multiple folds

@arc12
Copy link

arc12 commented Apr 29, 2019

I understand this is relevant to use of RFECV.
Currently this defaults to using a StratifiedKFold cv. Its fit() also takes groups=
However: it appears that groups is not respected when executing fit(). No warning (might be considered a bug).

Grouping AND stratification are useful for quite imbalanced datasets with inter-record dependency
(in my case, the same individual has multiple records, but there are still a large number of groups=people relative to the number of splits; I imagine there would be practical problems as the number of unique groups in the minority class gets anywhere near the number of splits).

So: +1!

@jambo6
Copy link

jambo6 commented May 13, 2019

This would definitely be useful. For instance, working with highly imbalanced time-series medical data, keeping patients separate but (approximately) balance the imbalanced class in each fold.

I have also found that StratifiedKFold takes groups as a parameter but doesn't group according to them, should probably be flagged up.

@guillermo-carrasco
Copy link

Another good use of this feature would be financial data, which is usually very imbalanced. In my case, I have a highly imbalanced dataset with several records for the same entity (just different points in time). We want to do a GroupKFold to avoid leakage, but also stratify since due to the high imbalance, we could end up with groups with very few or none positives.

@amueller
Copy link
Member

amueller commented Aug 8, 2019

also see #14524 I think?

@hermidalc
Copy link
Contributor

Another use case for Stratified GroupShuffleSplit and GroupKFold is biological "repeated measures" designs, where you have multiple samples per subject or other parent biological unit. Also in many real world datasets in biology there is class imbalance. Each group of samples has the same class. So it's important to stratify and keep groups together.

@jvel07
Copy link

jvel07 commented Nov 11, 2019

Description

Currently sklearn does not have a stratified group kfold feature. Either we can use stratification or we can use group kfold. However, it would be good to have both.

I would like to implement it, if we decide to have it.

Hi, I think it would be quite useful for medicine ML. Is it implemented already?

@aditya1702
Copy link
Contributor Author

@amueller Do you think we should implement this, given that people are interested in this?

@fcoppey
Copy link

fcoppey commented Nov 12, 2019

I'm very interested too... it would be really useful in spectroscopy when you have several replicates measures for each of your sample, they really need to stay in the same fold during cross-validation. And if you have several unbalanced classes that you are trying to classify you really want to use the stratify feature too. Therefore I vote for it too! Sorry I'm not good enough to participate in the development but for those who will take part in that you can be sure it will be used :-)
thumbs up for the all team. thanks!

@hermidalc
Copy link
Contributor

Please look at referenced issues and PRs in this thread as work has at least been attempted on StratifiedGroupKFold. I've already done a StratifiedGroupShuffleSplit #15239 which just needs tests but I've already used for my own work quite a bit.

@amueller
Copy link
Member

I think we should implement it, but I think I still don't know what we actually want. @hermidalc has a restriction that members of the same group must be of the same class. That's not the general case, right?

It would be good if people that are interested could describe their use-case and what they really want out of this.

There are #15239 #14524 and #9413 which I remember all having different semantics.

@fcoppey
Copy link

fcoppey commented Nov 12, 2019

@amueller totally agree with you, I spent a few hours today looking for something between the different versions available (#15239 #14524 and #9413) but couldn't really understand if any of these would fit my need. So here is my use case if it can help:
I have 1000 samples. each sample has been measured 3 times with a NIR Spectrometer, so each sample has 3 replicates that I want to stay together all the way...
These 1000 samples belong to 6 different classes with very different number of samples in each:
class 1: 400 samples
class 2: 300 samples
class 3: 100 samples
class 4: 100 samples
class 5: 70 samples
class 6: 30 samples
I want to build a classifier for each class. So class 1 vs all other classes, then class 2 vs all other classes, etc.
To maximize the accuracy of each of my classifier it is important that I have samples of the 6 classes represented in each of the fold, because my classes are not so different therefore it really helps to create an accurate border to have always the 6 classes represented in each fold.

This is why I believe a stratified (Always my 6 classes represented in each fold) group (keep always the 3 replicate measures of each of my sample together) kfold seems to be very much what I am looking for here.
Any opinion?

@hermidalc
Copy link
Contributor

hermidalc commented Nov 12, 2019

My use case and why I wrote up StratifiedGroupShuffleSplit is to support repeated measures designs https://en.wikipedia.org/wiki/Repeated_measures_design. In my use cases members of the same group must be of the same class.

@amueller
Copy link
Member

amueller commented Nov 12, 2019

@fcoppey For you, the samples within a group always have the same class, right?

@hermidalc I'm not very familiar with the terminology, but from wikipedia it sounds like "repeated measure design" doesn't mean the same group must be within the same class as it says "A crossover trial has a repeated measures design in which each patient is assigned to a sequence of two or more treatments, of which one may be a standard treatment or a placebo."
Relating this to an ML setting, you could either try to predict from measurements whether an individual just received treatment or placebo, or you could try to predict an outcome given the treatment.
For either of those the class for the same individual could change, right?

Irrespective of the name, it sounds to me like you both have the same use case, while I was thinking about a case similar to what's described in the crossover study. Or maybe a bit more simple: you could have a patient become sick over time (or get better), so the outcome for a patient could change.

@amueller
Copy link
Member

Actually the wikipedia article you link to explicitly says "Longitudinal analysis—Repeated measure designs allow researchers to monitor how participants change over time, both long- and short-term situations.", so I think that means that changing the class is included.
If there's another word that means that the measurement is done under the same conditions then we could use that word?

@hermidalc
Copy link
Contributor

hermidalc commented Nov 13, 2019

@amueller yes you’re right, I realized I miswrote above where I meant to say in my use cases of this design not in this use case in general.

There can be many quite elaborate types of repeated measures designs, though in the two types I’ve needed StratifiedGroupShuffleSplit the within group same class restriction holds (longitudinal sampling before and after treatment when predicting treatment response, multiple pre-treatment samples per subject at different body locations when predicting treatment response).

I needed something right away that works so wanted to put it out there for others to use and to get something started on sklearn, plus if I’m not mistaken it’s more complicated to design the stratification logic when within group class labels can be different.

@fcoppey
Copy link

fcoppey commented Nov 13, 2019

@amueller yes always. They are replicates of a same measure in order to include the intravariability of the device in the prediction.

@amueller
Copy link
Member

@hermidalc yes, this case is much easier. If it's a common need, I'm happy for us to add it. We should just make sure that from the name it's somewhat clear what it does, and we should think about whether these two versions should live in the same class.

It should be quite easy to make StratifiedKFold do this. There's two options: ensure that each fold contains a similar number of samples, or ensure each fold contains a simliar number of groups.
The second one is trivially to do (by just pretending each group is a single point and passing to StratifiedKFold). That's what you do in your PR, it looks like.

GroupKFold I think heuristically trades off the two off them by adding to the smallest fold first. I'm not sure how that would translate to the stratified case, so I'm happy with using your approach.

Should we also add GroupStratifiedKFold in the same PR? Or leave that for later?
The other PRs have slightly different goals. It would be good if someone could write up what the different use-cases are (I probably don't have the time right now).

@jnothman
Copy link
Member

+1 for separately handling the group constraint where all samples have the same class.

@hermidalc
Copy link
Contributor

@hermidalc yes, this case is much easier. If it's a common need, I'm happy for us to add it. We should just make sure that from the name it's somewhat clear what it does, and we should think about whether these two versions should live in the same class.

I'm not totally understanding this, a StratifiedGroupShuffleSplit and StratifiedGroupKFold where you can have members of each group be of different classes should have the exact same split behavior when the user specifies all group members to be of the same class. When can just improve the internals later and existing behavior will be the same?

The second one is trivially to do (by just pretending each group is a single point and passing to StratifiedKFold). That's what you do in your PR, it looks like.

GroupKFold I think heuristically trades off the two off them by adding to the smallest fold first. I'm not sure how that would translate to the stratified case, so I'm happy with using your approach.

Should we also add GroupStratifiedKFold in the same PR? Or leave that for later?
The other PRs have slightly different goals. It would be good if someone could write up what the different use-cases are (I probably don't have the time right now).

I will add StatifiedGroupKFold using the "each group single sample" approach I used.

@mrunibe
Copy link

mrunibe commented Nov 23, 2019

It would be good if people that are interested could describe their use-case and what they really want out of this.

Very common use-case in medicine and biology when you have repeated measures.
An example: Assume you want to classify a disease, e.g. Alzheimer's disease (AD) vs. healthy controls from MR images. For the same subject, you might have several scans (from follow-up sessions or longitudinal data). Let's assume you have a total of 1000 subjects, 200 of them being diagnosed with AD (imbalanced classes). Most subjects have one scan, but for some of them 2 or 3 images are available. When training/testing the classifier, you want to make sure that images from the same subject are always in the same fold to avoid data leakage.
It's best to use StratifiedGroupKFold for this: stratify to account for class imbalance but with the group constraint that a subject must not appear in different folds.
NB: It would be nice to make it repeatable.

Below an example implementation, inspired by kaggle-kernel.

import numpy as np
from collections import Counter, defaultdict
from sklearn.utils import check_random_state

class RepeatedStratifiedGroupKFold():

    def __init__(self, n_splits=5, n_repeats=1, random_state=None):
        self.n_splits = n_splits
        self.n_repeats = n_repeats
        self.random_state = random_state
        
    # Implementation based on this kaggle kernel:
    #    https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
    def split(self, X, y=None, groups=None):
        k = self.n_splits
        def eval_y_counts_per_fold(y_counts, fold):
            y_counts_per_fold[fold] += y_counts
            std_per_label = []
            for label in range(labels_num):
                label_std = np.std(
                    [y_counts_per_fold[i][label] / y_distr[label] for i in range(k)]
                )
                std_per_label.append(label_std)
            y_counts_per_fold[fold] -= y_counts
            return np.mean(std_per_label)
            
        rnd = check_random_state(self.random_state)
        for repeat in range(self.n_repeats):
            labels_num = np.max(y) + 1
            y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
            y_distr = Counter()
            for label, g in zip(y, groups):
                y_counts_per_group[g][label] += 1
                y_distr[label] += 1

            y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
            groups_per_fold = defaultdict(set)
        
            groups_and_y_counts = list(y_counts_per_group.items())
            rnd.shuffle(groups_and_y_counts)

            for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
                best_fold = None
                min_eval = None
                for i in range(k):
                    fold_eval = eval_y_counts_per_fold(y_counts, i)
                    if min_eval is None or fold_eval < min_eval:
                        min_eval = fold_eval
                        best_fold = i
                y_counts_per_fold[best_fold] += y_counts
                groups_per_fold[best_fold].add(g)

            all_groups = set(groups)
            for i in range(k):
                train_groups = all_groups - groups_per_fold[i]
                test_groups = groups_per_fold[i]

                train_indices = [i for i, g in enumerate(groups) if g in train_groups]
                test_indices = [i for i, g in enumerate(groups) if g in test_groups]

                yield train_indices, test_indices

Comparing RepeatedStratifiedKFold (sample of same group might appear in both folds) with RepeatedStratifiedGroupKFold:

import matplotlib.pyplot as plt
from sklearn import model_selection

def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=10):
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        ax.scatter(range(len(indices)), [ii + .5] * len(indices),
                   c=indices, marker='_', lw=lw, cmap=plt.cm.coolwarm,
                   vmin=-.2, vmax=1.2)

    ax.scatter(range(len(X)), [ii + 1.5] * len(X), c=y, marker='_',
               lw=lw, cmap=plt.cm.Paired)
    ax.scatter(range(len(X)), [ii + 2.5] * len(X), c=group, marker='_',
               lw=lw, cmap=plt.cm.tab20c)

    yticklabels = list(range(n_splits)) + ['class', 'group']
    ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
           xlabel='Sample index', ylabel="CV iteration",
           ylim=[n_splits+2.2, -.2], xlim=[0, 100])
    ax.set_title('{}'.format(type(cv).__name__), fontsize=15)

    
# demonstration
np.random.seed(1338)
n_splits = 4
n_repeats=5


# Generate the class/group data
n_points = 100
X = np.random.randn(100, 10)

percentiles_classes = [.4, .6]
y = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])

# Evenly spaced groups
g = np.hstack([[ii] * 5 for ii in range(20)])


fig, ax = plt.subplots(1,2, figsize=(14,4))

cv_nogrp = model_selection.RepeatedStratifiedKFold(n_splits=n_splits,
                                                   n_repeats=n_repeats,
                                                   random_state=1338)
cv_grp = RepeatedStratifiedGroupKFold(n_splits=n_splits,
                                      n_repeats=n_repeats,
                                      random_state=1338)

plot_cv_indices(cv_nogrp, X, y, g, ax[0], n_splits * n_repeats)
plot_cv_indices(cv_grp, X, y, g, ax[1], n_splits * n_repeats)

plt.show()

RepeatedStratifiedGroupKFold_demo

@RachelOwl
Copy link

RachelOwl commented Jan 23, 2020

+1 for stratifiedGroupKfold. I am trying to detect falls of seniors, taking sensors from the samrt watch. since we don't have much fall data - we do simulations with different watches that get different classes. I also do augmentations on the data before I train it. from each data point I create 9 points- and this is a group. it is important that a group will not be both in train and test as explained

@limjiayi
Copy link

limjiayi commented Jan 25, 2020

I would like to be able to use StratifiedGroupKFold as well. I am looking at a dataset for predicting financial crises, where the years before, after and during each crisis is its own group. During training and cross-validation, members of each group should not leak between the folds.

@mohammadmoein
Copy link

Is there anyway to generalize that for multilabel scenario (Multilabel_
stratifiedGroupKfold)?

@philip-iv
Copy link

+1 for this. We're analyzing user accounts for spam, so we want to group by user, but also stratify because spam is relatively low-incidence. For our use case, any user who spams once is flagged as a spammer in all data, so a group member will always have the same label.

@dispink
Copy link

dispink commented Jul 9, 2020

@hermidalc Hope your PhD work has been succeeded!
I'm looking forwards to see this implement done as well since my PhD work in Geosciences needs this stratification feature with group control. I've spent some hours on implementing this idea of splitting manually on my project. But I gave up finish it due to the same reason...PhD progress. So, I can totally understand how PhD work can torture a person's time. LOL No pressure. For now, I use GroupShuffleSplit as an alternative.

Cheers

@hermidalc
Copy link
Contributor

@bfeeny @dispink it's very easy to use the two classes I wrote above. Create a file e.g. split.py with the following. Then in your user code if the script is in the same directory as split.py you simply import from split import StratifiedGroupKFold, RepeatedStratifiedGroupKFold

from collections import Counter, defaultdict

import numpy as np

from sklearn.model_selection._split import _BaseKFold, _RepeatedSplits
from sklearn.utils.validation import check_random_state


class StratifiedGroupKFold(_BaseKFold):
    """Stratified K-Folds iterator variant with non-overlapping groups.

    This cross-validation object is a variation of StratifiedKFold that returns
    stratified folds with non-overlapping groups. The folds are made by
    preserving the percentage of samples for each class.

    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).

    The difference between GroupKFold and StratifiedGroupKFold is that
    the former attempts to create balanced folds such that the number of
    distinct groups is approximately the same in each fold, whereas
    StratifiedGroupKFold attempts to create folds which preserve the
    percentage of samples for each class.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

    shuffle : bool, default=False
        Whether to shuffle each class's samples before splitting into batches.
        Note that the samples within each split will not be shuffled.

    random_state : int or RandomState instance, default=None
        When `shuffle` is True, `random_state` affects the ordering of the
        indices, which controls the randomness of each fold for each class.
        Otherwise, leave `random_state` as `None`.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import StratifiedGroupKFold
    >>> X = np.ones((17, 2))
    >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
    >>> cv = StratifiedGroupKFold(n_splits=3)
    >>> for train_idxs, test_idxs in cv.split(X, y, groups):
    ...     print("TRAIN:", groups[train_idxs])
    ...     print("      ", y[train_idxs])
    ...     print(" TEST:", groups[test_idxs])
    ...     print("      ", y[test_idxs])
    TRAIN: [2 2 4 5 5 5 5 6 6 7]
           [1 1 1 0 0 0 0 0 0 0]
     TEST: [1 1 3 3 3 8 8]
           [0 0 1 1 1 0 0]
    TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
           [0 0 1 1 1 1 0 0 0 0 0 0]
     TEST: [2 2 6 6 7]
           [1 1 0 0 0]
    TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
           [0 0 1 1 1 1 1 0 0 0 0 0]
     TEST: [4 5 5 5 5]
           [1 0 0 0 0]

    See also
    --------
    StratifiedKFold: Takes class information into account to build folds which
        retain class distributions (for binary or multiclass classification
        tasks).

    GroupKFold: K-fold iterator variant with non-overlapping groups.
    """

    def __init__(self, n_splits=5, shuffle=False, random_state=None):
        super().__init__(n_splits=n_splits, shuffle=shuffle,
                         random_state=random_state)

    # Implementation based on this kaggle kernel:
    # https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
    def _iter_test_indices(self, X, y, groups):
        labels_num = np.max(y) + 1
        y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
        y_distr = Counter()
        for label, group in zip(y, groups):
            y_counts_per_group[group][label] += 1
            y_distr[label] += 1

        y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
        groups_per_fold = defaultdict(set)

        groups_and_y_counts = list(y_counts_per_group.items())
        rng = check_random_state(self.random_state)
        if self.shuffle:
            rng.shuffle(groups_and_y_counts)

        for group, y_counts in sorted(groups_and_y_counts,
                                      key=lambda x: -np.std(x[1])):
            best_fold = None
            min_eval = None
            for i in range(self.n_splits):
                y_counts_per_fold[i] += y_counts
                std_per_label = []
                for label in range(labels_num):
                    std_per_label.append(np.std(
                        [y_counts_per_fold[j][label] / y_distr[label]
                         for j in range(self.n_splits)]))
                y_counts_per_fold[i] -= y_counts
                fold_eval = np.mean(std_per_label)
                if min_eval is None or fold_eval < min_eval:
                    min_eval = fold_eval
                    best_fold = i
            y_counts_per_fold[best_fold] += y_counts
            groups_per_fold[best_fold].add(group)

        for i in range(self.n_splits):
            test_indices = [idx for idx, group in enumerate(groups)
                            if group in groups_per_fold[i]]
            yield test_indices


class RepeatedStratifiedGroupKFold(_RepeatedSplits):
    """Repeated Stratified K-Fold cross validator.

    Repeats Stratified K-Fold with non-overlapping groups n times with
    different randomization in each repetition.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

    n_repeats : int, default=10
        Number of times cross-validator needs to be repeated.

    random_state : int or RandomState instance, default=None
        Controls the generation of the random states for each repetition.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import RepeatedStratifiedGroupKFold
    >>> X = np.ones((17, 2))
    >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
    >>> cv = RepeatedStratifiedGroupKFold(n_splits=2, n_repeats=2,
    ...                                   random_state=36851234)
    >>> for train_index, test_index in cv.split(X, y, groups):
    ...     print("TRAIN:", groups[train_idxs])
    ...     print("      ", y[train_idxs])
    ...     print(" TEST:", groups[test_idxs])
    ...     print("      ", y[test_idxs])
    TRAIN: [2 2 4 5 5 5 5 8 8]
           [1 1 1 0 0 0 0 0 0]
     TEST: [1 1 3 3 3 6 6 7]
           [0 0 1 1 1 0 0 0]
    TRAIN: [1 1 3 3 3 6 6 7]
           [0 0 1 1 1 0 0 0]
     TEST: [2 2 4 5 5 5 5 8 8]
           [1 1 1 0 0 0 0 0 0]
    TRAIN: [3 3 3 4 7 8 8]
           [1 1 1 1 0 0 0]
     TEST: [1 1 2 2 5 5 5 5 6 6]
           [0 0 1 1 0 0 0 0 0 0]
    TRAIN: [1 1 2 2 5 5 5 5 6 6]
           [0 0 1 1 0 0 0 0 0 0]
     TEST: [3 3 3 4 7 8 8]
           [1 1 1 1 0 0 0]

    Notes
    -----
    Randomized CV splitters may return different results for each call of
    split. You can make the results identical by setting `random_state`
    to an integer.

    See also
    --------
    RepeatedStratifiedKFold: Repeats Stratified K-Fold n times.
    """

    def __init__(self, n_splits=5, n_repeats=10, random_state=None):
        super().__init__(StratifiedGroupKFold, n_splits=n_splits,
                         n_repeats=n_repeats, random_state=random_state)

@dispink
Copy link

dispink commented Jul 9, 2020

@hermidalc Thank you for the positive reply!
I quickly adopt it as you described. However, I can only get the splits that only have data in the training or test set. As far as I understanding the code description, there is no parameter to specify the proportion between training and test sets, right?
I know it's a conflict between Stratification, group control and datasets proportion... That why I gave up continuing... But maybe we can still find compromising to work around.
image

Sincerely

@hermidalc
Copy link
Contributor

hermidalc commented Jul 9, 2020

@hermidalc Thank you for the positive reply!
I quickly adopt it as you described. However, I can only get the splits that only have data in the training or test set. As far as I understanding the code description, there is no parameter to specify the proportion between training and test sets, right?
I know it's a conflict between Stratification, group control and datasets proportion... That why I gave up continuing... But maybe we can still find compromising to work around.

To test I made the split.py and the ran this example in ipython and it works. I've been using these custom CV iterators in my work for a long time and they have no issues on my side. BTW I'm using scikit-learn 0.22.2 not 0.23.x, so not sure if that is the cause of issue. Could you please try to run this example below and see if you can reproduce it? If you can, then it might be something with the y and groups in your work.

In [6]: import numpy as np 
   ...: from split import StratifiedGroupKFold 
   ...:  
   ...: X = np.ones((17, 2)) 
   ...: y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 
   ...: groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8]) 
   ...: cv = StratifiedGroupKFold(n_splits=3, shuffle=True, random_state=777) 
   ...: for train_idxs, test_idxs in cv.split(X, y, groups): 
   ...:     print("TRAIN:", groups[train_idxs]) 
   ...:     print("      ", y[train_idxs]) 
   ...:     print(" TEST:", groups[test_idxs]) 
   ...:     print("      ", y[test_idxs]) 
   ...:                                                                                                                                                                                                    
TRAIN: [2 2 4 5 5 5 5 6 6 7]
       [1 1 1 0 0 0 0 0 0 0]
 TEST: [1 1 3 3 3 8 8]
       [0 0 1 1 1 0 0]
TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
       [0 0 1 1 1 1 0 0 0 0 0 0]
 TEST: [2 2 6 6 7]
       [1 1 0 0 0]
TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
       [0 0 1 1 1 1 1 0 0 0 0 0]
 TEST: [4 5 5 5 5]
       [1 0 0 0 0]

@jnothman
Copy link
Member

jnothman commented Jul 9, 2020 via email

@dispink
Copy link

dispink commented Jul 9, 2020

@hermidalc 'You have to make sure that every sample in the same group has the same class label.' Obviously that's the problem. My samples in the same group don't share the same class. Mmm...it seems to be another branch of development.
Thank you very much anyway.

@hermidalc
Copy link
Contributor

@hermidalc 'You have to make sure that every sample in the same group has the same class label.' Obviously that's the problem. My samples in the same group don't share the same class. Mmm...it seems to be another branch of development.
Thank you very much anyway.

Yes this has been discussed in various threads here. It's another more complex use case that is useful, but many like myself don't need that use case currently but needed something with keeps groups together yet stratifies on the samples. The requirement of the code above is that all the samples in each group belong to the same class.

Actually @dispink I was wrong, this algorithm does not require that all members of a group belong to the same class. For example:

In [2]: X = np.ones((17, 2)) 
   ...: y =      np.array([0, 2, 1, 1, 2, 0, 0, 1, 2, 1, 1, 1, 0, 2, 0, 1, 0]) 
   ...: groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8]) 
   ...: cv = StratifiedGroupKFold(n_splits=3) 
   ...: for train_idxs, test_idxs in cv.split(X, y, groups): 
   ...:     print("TRAIN:", groups[train_idxs]) 
   ...:     print("      ", y[train_idxs]) 
   ...:     print(" TEST:", groups[test_idxs]) 
   ...:     print("      ", y[test_idxs]) 
   ...:                                                                                                                                                                                                    
TRAIN: [1 1 2 2 3 3 3 4 8 8]
       [0 2 1 1 2 0 0 1 1 0]
 TEST: [5 5 5 5 6 6 7]
       [2 1 1 1 0 2 0]
TRAIN: [1 1 4 5 5 5 5 6 6 7 8 8]
       [0 2 1 2 1 1 1 0 2 0 1 0]
 TEST: [2 2 3 3 3]
       [1 1 2 0 0]
TRAIN: [2 2 3 3 3 5 5 5 5 6 6 7]
       [1 1 2 0 0 2 1 1 1 0 2 0]
 TEST: [1 1 4 8 8]
       [0 2 1 1 0]

So I'm not quite sure what is going on with your data, since even with your screenshots you cannot truly see what your data layout is and what might be happening. I would suggest you first reproduce the examples I showed in here to make sure it's not a scikit-learn version issue (since I'm using 0.22.2) and if you can reproduce it then I would suggest you start from small parts of your data and test it. Using ~104k samples is difficult to troubleshoot.

@dispink
Copy link

dispink commented Jul 10, 2020

@hermidalc Thank you for the reply!
I actually can reproduce the result above, so I'm troubleshooting with a smaller data now.

@GustavoGianotti
Copy link

+1

@marrodion
Copy link
Contributor

marrodion commented Oct 4, 2020

Anyone mind if I pick this issue up?
Seems that #15239 together with the #13621 (comment) have an implementation already and only unit tests are left to do.

@ddofer
Copy link

ddofer commented Feb 7, 2021

+1

@peterjesus
Copy link

Hi there, any news about this feature? Dealing with a project that requires this kind of folding and have missed it!

@yoni2k
Copy link

yoni2k commented Mar 7, 2021

Totally necessary!
I would love to use it for content recommendation engine!

@justinas-kazanavicius
Copy link

It makes it really confusing that the documentation for StratifiedKFold and RepeatedStratifiedKFold includes groups as a parameter to the split function, but in reality, this parameter does not affect the splits in any way. Either the solutions in this thread should be merged into the existing classes (so the parameter actually does something), or there should be new classes (StratifiedGroupKFold and RepeatedStratifiedGroupKFold) and the useless group parameter should be taken out of the non-group classes.

@mdanb
Copy link

mdanb commented Jun 1, 2021

@hermidalc how can I use StratifiedGroupKFold with GridSearchCV?

@hermidalc
Copy link
Contributor

hermidalc commented Jun 1, 2021

@hermidalc how can I use StratifiedGroupKFold with GridSearchCV?

Like you would use any other CV iterator with GridSearchCV, pass an instance of it to the cv parameter during gscv = GridSearchCV(cv=StratifiedGroupKFold()) construction and pass your group labels during gscv.fit(X, y, groups=groups)

@yoni2k
Copy link

yoni2k commented Jul 18, 2021

@hermidalc
I think my use case is similar but with slight differences:

  1. What if I want to group by 1 feature in X, and stratify by another feature in X (and not by y)?
  2. There seems to be an assumption in this code that what we are stratifying by are consecutive integers starting from 0, right? What if the "label" itself is not an integer / not consecutive?

I think I solved both above with the following code:

`def stratified_group_k_fold(X, y,
                            categories_stratify_by, # assumed to be a Pandas Series
                            groups, k, seed=None):
    unique_cats = categories_stratify_by.unique()
    cats_index = {cat:ind for ind, cat in enumerate(unique_cats)}
    cats_num = len(unique_cats)
    cat_counts_per_group = defaultdict(lambda: np.zeros(cats_num))
    cat_distr = Counter()
    for cat, g in zip(categories_stratify_by, groups):
        cat_counts_per_group[g][cats_index[cat]] += 1
        cat_distr[cat] += 1

    cat_counts_per_fold = defaultdict(lambda: np.zeros(cats_num))
    groups_per_fold = defaultdict(set)

    def eval_cat_counts_per_fold(cat_counts, fold):
        cat_counts_per_fold[fold] += cat_counts
        std_per_cat = []
        for cat in unique_cats:
            cat_std = np.std([cat_counts_per_fold[i][cats_index[cat]] / cat_distr[cat] for i in range(k)])
            std_per_cat.append(cat_std)
        cat_counts_per_fold[fold] -= cat_counts
        return np.mean(std_per_cat)
    
    groups_and_cat_counts = list(cat_counts_per_group.items())
    random.Random(seed).shuffle(groups_and_cat_counts)

    for g, cat_counts in tqdm(sorted(groups_and_cat_counts, key=lambda x: -np.std(x[1]))):
        best_fold = None
        min_eval = None
        for i in range(k):
            fold_eval = eval_cat_counts_per_fold(cat_counts, i)
            if min_eval is None or fold_eval < min_eval:
                min_eval = fold_eval
                best_fold = i
        cat_counts_per_fold[best_fold] += cat_counts
        groups_per_fold[best_fold].add(g)

    all_groups = set(groups)
    for i in range(k):
        train_groups = all_groups - groups_per_fold[i]
        test_groups = groups_per_fold[i]

        train_indices = [i for i, g in enumerate(groups) if g in train_groups]
        test_indices = [i for i, g in enumerate(groups) if g in test_groups]

        yield train_indices, test_indices`

Does this code seem right to you?

In addition, and mainly, the code is extremely slow (perhaps because the number of unique values in the feature I'm stratifying by is pretty large and not 2 like in the examples above). Any ideas on how to make it more efficient?

Thanks!

@aisosalo
Copy link

Adding to the discussion, I have situation where I have PatientID's as group and PatientAge as y and what I would like to do is to have medical examinations with the same PatientID in one fold, but to also have stratified folds preserving the percentage of samples for each class according to y (in my case PatientAge).

@hermidalc
Copy link
Contributor

hermidalc commented Dec 17, 2021

Adding to the discussion, I have situation where I have PatientID's as group and PatientAge as y and what I would like to do is to have medical examinations with the same PatientID in one fold, but to also have stratified folds preserving the percentage of samples for each class according to y (in my case PatientAge).

PatientAge as your target would be a regression not classification problem? Otherwise, if you have subdivided PatientAge into range categories each with class label 0..n then StratifiedGroupKFold will work as intended.

@aisosalo
Copy link

aisosalo commented Dec 17, 2021

PatientAge as your target would be a regression not classification problem? Otherwise, if you have subdivided PatientAge into range categories each with class label 0..n then StratifiedGroupKFold will work as intended.

My intention is to split dataset of radiographs, and to make sure samples of different age groups are represented in accordance to the original distribution.

@hermidalc
Copy link
Contributor

PatientAge as your target would be a regression not classification problem? Otherwise, if you have subdivided PatientAge into range categories each with class label 0..n then StratifiedGroupKFold will work as intended.

My intention is to split dataset of radiographs, and have samples of different age groups represented.

If I'm understanding correctly and you are converting PatientAge into classes with integer labels 0..n then this CV iterator will work as intended.

@aisosalo
Copy link

aisosalo commented Dec 17, 2021

It would be optimal if y can be a code word (str) so that there could be more than one categorial to be used in making the splits, e.g., with two categories y1 ranging 1..9 and y2 ranging 1..4, i.e., '11', '12',..., '94', so that the default functionality can be extended by concatenating different categorials.

@hermidalc
Copy link
Contributor

It would be optimal if y can be a code word (str) so that there could be more than one categorial to be used in making the splits, e.g., with two categories y1 ranging 1..9 and y2 ranging 1..4, i.e., '11', '12',..., '94', so that the default functionality can be extended by concatenating different categorials.

I would review https://scikit-learn.org/stable/modules/multiclass.html. Stratified CV iterators in scikit-learn work with multiclass problems. So for any such dataset, each sample should have a class label from 0..n. In your case you would simply create a new class label y = 0..n for each unique combination of y1 and y2 represented in your data?

@aisosalo
Copy link

In your case you would simply create a new class label y = 0..n for each unique combination of y1 and y2 represented in your data?

Sure, I can and have added one additional column to the metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.