Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

permutation_test_score shuffling when a group enforces a specific label #21543

Open
hermidalc opened this issue Nov 3, 2021 · 15 comments
Open

Comments

@hermidalc
Copy link
Contributor

hermidalc commented Nov 3, 2021

Describe the bug

I believe there is a bug or design expectation misunderstanding with permutation_test_score label shuffle function code. Maybe there is a use case for only shuffling within groups, but to me the default use case should be to shuffle all the samples while ensuring all samples within a group have the same shuffled label. This would be the equivalent behavior as if no groups existed, i.e. each label was in it's own unique group, which is the way I expected it to work.

Steps/Code to Reproduce

from sklearn.model_selection._validation import _shuffle
from sklearn.utils import check_random_state

y = [1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
groups = [1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

random_state = check_random_state(0)
for _ in range(10):
    print(_shuffle(y, groups, random_state), "\n", groups, "\n", sep="")

print()

# this should work the same as if groups=None but it doesn't
groups = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for _ in range(10):
    print(_shuffle(y, groups, random_state), "\n", groups, "\n", sep="")

Expected Results

Labels should be shuffled even when there are label groups. Label groups are to make sure all labels in that group simply stay together during CV shuffling and I expected that behavior here too. The labels for a group should be treated like a single sample where they get shuffled together with another sample label group.

Actual Results

No labels get shuffled:

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]


[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Versions

1.0.1 and all previous versions with permutation_test_score (I looked at source code it hasn't changed)

I specifically ran my test on:

System:
python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0]
executable: /home/hermidalc/soft/miniconda3/envs/sklearn-bio-workflows-r36/bin/python
machine: Linux-5.14.14-200.fc34.x86_64-x86_64-with-glibc2.10

Python dependencies:
pip: 21.1.2
setuptools: 49.6.0.post20210108
sklearn: 0.22.2.post1
numpy: 1.21.0
scipy: 1.5.3
Cython: None
pandas: 1.2.5
matplotlib: 3.4.2
joblib: 1.0.1

Built with OpenMP: True

@hermidalc hermidalc changed the title permutation_test_score doesn't shuffle labels as expected when groups are present permutation_test_score doesn't shuffle labels as expected when there are groups Nov 3, 2021
@hermidalc hermidalc changed the title permutation_test_score doesn't shuffle labels as expected when there are groups permutation_test_score doesn't shuffle labels as expected when there are label groups Nov 3, 2021
@hermidalc
Copy link
Contributor Author

hermidalc commented Nov 7, 2021

The original design intent must’ve been to support the use case where samples in the same group do not all have the same class label, therefore in this case I think the only thing you can do is shuffle within groups.

But I believe this is not the most common and default use case for label groups when it comes to CV and model selection? Aren't groups most typically used to CV group replicate samples? Not sure, either way the current implementation doesn’t support the use case I described which I believe is very common.

@hermidalc
Copy link
Contributor Author

hermidalc commented Nov 8, 2021

Here's an initial implementation to replace permutation_test_score _shuffle that does the shuffling the way I expected label groups to constrain the behavior, in the same way label groups constrain behavior in CV implementations, while still supporting the previous behavior.

import numpy as np
from sklearn.utils import check_array, check_random_state, _safe_indexing

def _shuffle(y, groups, random_state):
    """Return a shuffled copy of y"""
    if groups is None or np.unique(groups).size == len(groups):
        indices = random_state.permutation(len(y))
        return _safe_indexing(y, indices)
    (unique_groups, unique_groups_y), group_indices = np.unique(
        np.vstack((groups, y)), axis=1, return_inverse=True)
    if unique_groups.size != np.unique(groups).size:
        indices = np.arange(len(groups))
        for group in unique_groups:
            this_mask = groups == group
            indices[this_mask] = random_state.permutation(indices[this_mask])
        return _safe_indexing(y, indices)
    shuffled_unique_groups_y = unique_groups_y[
        random_state.permutation(len(unique_groups_y))]
    if isinstance(y, list):
        shuffled_unique_groups_y = list(shuffled_unique_groups_y)
    return _safe_indexing(shuffled_unique_groups_y, group_indices)
y = [1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
groups = [1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

random_state = check_random_state(0)
for _ in range(10):
    print(_shuffle(y, groups, random_state), "\n", groups, "\n", sep="")

Output below. You will notice there's always still four 0 and four 1 labels if you collapse the group replicates, and groups 4 and 7 (which are the groups that have replicate samples) consistently have each of their replicates assigned the same shuffled label.

[0, 0, 0, 1, 1, 1, 1, 1, 1, 0]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[0, 0, 1, 1, 1, 1, 0, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 1, 0, 0, 1, 1, 0, 0, 0]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[0, 1, 0, 1, 1, 1, 0, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[0, 1, 0, 0, 0, 1, 0, 1, 1, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[0, 0, 1, 0, 0, 1, 0, 1, 1, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[0, 0, 1, 1, 1, 0, 0, 1, 1, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 1, 0, 0, 1, 1, 0, 0, 0]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[0, 1, 1, 0, 0, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 1, 0, 0, 0, 1, 1, 1, 0]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

@glemaitre
Copy link
Member

Groups allow structuring in some way your dataset. Let's give a concrete example:

  • we have data from N hospitals (that define our groups)
  • we acquire data from subjects in all different hospitals that have condition A or B

Making CV aware of the groups allows you to not "leak" information from one hospital to another and you actually evaluate how good your model will be good to predict on a complete unseen hospital during the test. Enforcing the shuffling by the group as the same spirit: you want to make sure to keep the target distribution by group as-is.

Imagine that hospital X has 30% of condition A and 70% of condition B and the opposite for hospital Y.
By not keeping the shuffling by group, you will indeed break this specific distribution that you observe in the original group. So it makes sense that the grouping takes care of groups when shuffling.

@hermidalc
Copy link
Contributor Author

hermidalc commented Nov 23, 2021

Groups allow structuring in some way your dataset. Let's give a concrete example:

* we have data from N hospitals (that define our groups)

* we acquire data from subjects in all different hospitals that have condition A or B

Making CV aware of the groups allows you to not "leak" information from one hospital to another and you actually evaluate how good your model will be good to predict on a complete unseen hospital during the test. Enforcing the shuffling by the group as the same spirit: you want to make sure to keep the target distribution by group as-is.

Imagine that hospital X has 30% of condition A and 70% of condition B and the opposite for hospital Y. By not keeping the shuffling by group, you will indeed break this specific distribution that you observe in the original group. So it makes sense that the grouping takes care of groups when shuffling.

I apologize but I really don't understand your rationale for closing this issue. I just showed you a VERY common use case for how users use groups where for this case the permutation test shuffling does not work. It doesn't shuffle any labels.

It is a very common use case that one would utilize the groups feature to group together repeated measures from the same subject, meaning each sample within a group must have the same class label across the entire dataset. So for example the CV splitters always keep the repeated measures from the same group together in train/test splits. Remember in this use case too the user must also be passing sample weights for each group sample to the estimator such that subjects with more or fewer repeated measures are weighted appropriately.

In this use case the permutation test should shuffle the labels across groups making sure all samples from the same group get the same shuffled label. So in the end I'm not understanding your rationale for believing this would change the target distribution by group, it doesn't since with sample weighting by group you are effectively resulting in a scenario where you are permuting labels when there is one sample per group (no groups).

@glemaitre
Copy link
Member

I apologize but I really don't understand your rationale for closing this issue.

I was triaging issue and it might happen that I get sloppy reading only the original summary. But no worries we still have a button to reopen the issue :)

shuffle all the samples while ensuring all samples within a group have the same shuffled label

I think this is where I am getting confused (and now I see that it was your second post). Here, it seems that you constrain the samples of a group to have the same label. While I can see some use cases where this is true, I can also see some use cases where this constraint is too high (typically the use case that I gave). It all comes back to the level of granularity used to define the groups.

Since your use case seems to impose a stronger constraint, I would argue that this is not potentially the best default.
However, your use case is as well valid.

However, I am not sure about the API. Changing behaviour depending on the input seems to be magical but would be necessary in the special case where a group specifically belong to a label.

ping @thomasjpfan @jnothman @ogrisel regarding this case.

@glemaitre glemaitre reopened this Nov 23, 2021
@glemaitre glemaitre changed the title permutation_test_score doesn't shuffle labels as expected when there are label groups permutation_test_score shuffling when a goup enforces a specific label Nov 23, 2021
@hermidalc
Copy link
Contributor Author

Yep I totally understand, there are other more complex use cases where the class labels for a group are not the same, but it is also quite common to use groups functionality for the use case I described (repeated measures by subject)

In the enhancement code it supports both #21543 (comment). It works for me while still supporting the old behavior of only within group shuffling for the more complex use cases you described.

@glemaitre
Copy link
Member

It works for me while still supporting the old behavior of only within group shuffling for the more complex use cases you described.

Yep, I saw it while answering. But I don't know if we should only switch behaviour using a new parameter or magic would be fine in this case.

@hermidalc
Copy link
Contributor Author

hermidalc commented Nov 24, 2021

It works for me while still supporting the old behavior of only within group shuffling for the more complex use cases you described.

Yep, I saw it while answering. But I don't know if we should only switch behaviour using a new parameter or magic would be fine in this case.

The code I added figures out if you have the, "repeated measures" use case let's call it, or not automatically with that if statement.

I think it should be automatic because if you have each group with the same class label then the current behavior of shuffling within a group doesn't ever shuffle any labels at all, and I would guess no one wants that.

@hermidalc hermidalc changed the title permutation_test_score shuffling when a goup enforces a specific label permutation_test_score shuffling when a group enforces a specific label Nov 24, 2021
@ogrisel
Copy link
Member

ogrisel commented Nov 24, 2021

I am not sure what a permutation test is supposed to do in this case. I think we could improve scikit-learn to detect this pathological case and issue a UserWarning in scikit-learn 1.1 and raise a ValueError 1.3 with an informative error message in scikit-learn stating that permutation_test_score is not well-defined in this case.

It could be possible to resample random labels from the empirical prior distribution based on the relative class frequencies measured on the full dataset, or do disregard groups when doing the label permutation across all folds instead of within-folds but I think those two option would be too far way from the original reference linked in the docstring.

I think users facing this situation should write code to do exactly what they want to achieve because I don't think there is a single recommended way to evaluate classifiers non-random performance with this kind of sampling structure in the data.

@hermidalc
Copy link
Contributor Author

hermidalc commented Nov 24, 2021

I've used the code enhancement I wrote in #21543 (comment) successfully in real world problems and the permutation_test_score results are in line with what one would expect.

The issue I have with doing nothing with the current implementation is that if you come with the (what I believe is a common) groups use case I've described it doesn't shuffle any labels and permutation_test_score doesn't work. It also doesn't shuffle properly or work if the user happens to have groups defined as one sample per group where the behavior should be the same as groups=None.

So with the code enhancement above it fixes these issues while behaving the same way as before in every other case.

@hermidalc
Copy link
Contributor Author

I have a related basic Python question regarding permutation_test_score, in the current implementation how the CV object is handled

cv = check_cv(cv, y, classifier=is_classifier(estimator))
scorer = check_scoring(estimator, scoring=scoring)
random_state = check_random_state(random_state)
# We clone the estimator to make sure that all the folds are
# independent, and that it is pickle-able.
score = _permutation_test_score(
clone(estimator), X, y, groups, cv, scorer, fit_params=fit_params
)
permutation_scores = Parallel(n_jobs=n_jobs, verbose=verbose)(
delayed(_permutation_test_score)(
clone(estimator),
X,
_shuffle(y, groups, random_state),
groups,
cv,
scorer,
fit_params=fit_params,
)
for _ in range(n_permutations)
)
permutation_scores = np.array(permutation_scores)
pvalue = (np.sum(permutation_scores >= score) + 1.0) / (n_permutations + 1)
return score, permutation_scores, pvalue
def _permutation_test_score(estimator, X, y, groups, cv, scorer, fit_params):
"""Auxiliary function for permutation_test_score"""
# Adjust length of sample weights
fit_params = fit_params if fit_params is not None else {}
avg_score = []
for train, test in cv.split(X, y, groups):
X_train, y_train = _safe_split(estimator, X, y, train)
X_test, y_test = _safe_split(estimator, X, y, test, train)
fit_params = _check_fit_params(X, fit_params, train)
estimator.fit(X_train, y_train, **fit_params)
avg_score.append(scorer(estimator, X_test, y_test))
return np.mean(avg_score)

Since the same CV object instance is being passed to Parallel and then within the parallelized permutation_test_score cv.split(X, y, groups) is being called, I was wondered since I thought objects are passed by reference in Python then why doesn't this produce inconsistent splits across the parallel processes because the split method call is being performed on the same CV instance? I ran a couple tests and it looks like no I'm wrong, but I don't understand why

@glemaitre
Copy link
Member

glemaitre commented Nov 24, 2021 via email

@hermidalc
Copy link
Contributor Author

hermidalc commented Nov 24, 2021

Could you be explicit by what you mean by inconsistent splits.

Sorry I wasn't clear. I really might not be understanding something about Python correctly... but I thought when you pass mutable objects they get passed by reference, e.g.

def func(l, n):
    l.append(n)

l = []
func(l, 3)

This appends to l outside the function. So analogously I thought since in the implementation you are running _permutation_test_score function in parallel wouldn't each parallel process be calling cv.split() at the same time on the same CV object instance? Thus not producing the same train/test splits on the data within the for loop of each parallel process? Sorry I could be missing something obvious here...

I did some basic tests and proved I'm not correct here, but don't understand why as I thought each parallel process would need it's own copied CV instance such that it produces the same splits in the for loop across parallel processes and they aren't pulling the next split iteration from the same CV instance on top of each other. Could be something I'm also missing about how CV object in scikit-learn operate and they do not iterate like regular iterators.

@glemaitre
Copy link
Member

glemaitre commented Nov 24, 2021 via email

@hermidalc
Copy link
Contributor Author

Thus not producing the same train/test splits on the data within the for loop of each parallel process?

Each call will produce different splits indeed. But I don’t understand why is it an issue. If really, you want the CV to be deterministic, you need to set the random_state of the CV object.

Right and I'm doing that, I just thought that since you are passing the same single CV object instance to the parallelized function and in that function each parallel process is doing a cv.split() for loop on that same single instance, and imagine that CV splitter was instantiated with 10 splits, that I initially thought each parallel process would only get part of those 10 splits in a race condition with the other parallel processes, but it doesn't work like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants