-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
permutation_test_score shuffling when a group enforces a specific label #21543
Comments
The original design intent must’ve been to support the use case where samples in the same group do not all have the same class label, therefore in this case I think the only thing you can do is shuffle within groups. But I believe this is not the most common and default use case for label groups when it comes to CV and model selection? Aren't groups most typically used to CV group replicate samples? Not sure, either way the current implementation doesn’t support the use case I described which I believe is very common. |
Here's an initial implementation to replace import numpy as np
from sklearn.utils import check_array, check_random_state, _safe_indexing
def _shuffle(y, groups, random_state):
"""Return a shuffled copy of y"""
if groups is None or np.unique(groups).size == len(groups):
indices = random_state.permutation(len(y))
return _safe_indexing(y, indices)
(unique_groups, unique_groups_y), group_indices = np.unique(
np.vstack((groups, y)), axis=1, return_inverse=True)
if unique_groups.size != np.unique(groups).size:
indices = np.arange(len(groups))
for group in unique_groups:
this_mask = groups == group
indices[this_mask] = random_state.permutation(indices[this_mask])
return _safe_indexing(y, indices)
shuffled_unique_groups_y = unique_groups_y[
random_state.permutation(len(unique_groups_y))]
if isinstance(y, list):
shuffled_unique_groups_y = list(shuffled_unique_groups_y)
return _safe_indexing(shuffled_unique_groups_y, group_indices) y = [1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
groups = [1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
random_state = check_random_state(0)
for _ in range(10):
print(_shuffle(y, groups, random_state), "\n", groups, "\n", sep="") Output below. You will notice there's always still four 0 and four 1 labels if you collapse the group replicates, and groups 4 and 7 (which are the groups that have replicate samples) consistently have each of their replicates assigned the same shuffled label. [0, 0, 0, 1, 1, 1, 1, 1, 1, 0]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[0, 0, 1, 1, 1, 1, 0, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 1, 0, 0, 1, 1, 0, 0, 0]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[0, 1, 0, 1, 1, 1, 0, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[0, 1, 0, 0, 0, 1, 0, 1, 1, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[0, 0, 1, 0, 0, 1, 0, 1, 1, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[0, 0, 1, 1, 1, 0, 0, 1, 1, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 1, 0, 0, 1, 1, 0, 0, 0]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[0, 1, 1, 0, 0, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 1, 0, 0, 0, 1, 1, 1, 0]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8] |
Groups allow structuring in some way your dataset. Let's give a concrete example:
Making CV aware of the groups allows you to not "leak" information from one hospital to another and you actually evaluate how good your model will be good to predict on a complete unseen hospital during the test. Enforcing the shuffling by the group as the same spirit: you want to make sure to keep the target distribution by group as-is. Imagine that hospital X has 30% of condition A and 70% of condition B and the opposite for hospital Y. |
I apologize but I really don't understand your rationale for closing this issue. I just showed you a VERY common use case for how users use groups where for this case the permutation test shuffling does not work. It doesn't shuffle any labels. It is a very common use case that one would utilize the groups feature to group together repeated measures from the same subject, meaning each sample within a group must have the same class label across the entire dataset. So for example the CV splitters always keep the repeated measures from the same group together in train/test splits. Remember in this use case too the user must also be passing sample weights for each group sample to the estimator such that subjects with more or fewer repeated measures are weighted appropriately. In this use case the permutation test should shuffle the labels across groups making sure all samples from the same group get the same shuffled label. So in the end I'm not understanding your rationale for believing this would change the target distribution by group, it doesn't since with sample weighting by group you are effectively resulting in a scenario where you are permuting labels when there is one sample per group (no groups). |
I was triaging issue and it might happen that I get sloppy reading only the original summary. But no worries we still have a button to reopen the issue :)
I think this is where I am getting confused (and now I see that it was your second post). Here, it seems that you constrain the samples of a group to have the same label. While I can see some use cases where this is true, I can also see some use cases where this constraint is too high (typically the use case that I gave). It all comes back to the level of granularity used to define the groups. Since your use case seems to impose a stronger constraint, I would argue that this is not potentially the best default. However, I am not sure about the API. Changing behaviour depending on the input seems to be magical but would be necessary in the special case where a group specifically belong to a label. ping @thomasjpfan @jnothman @ogrisel regarding this case. |
Yep I totally understand, there are other more complex use cases where the class labels for a group are not the same, but it is also quite common to use groups functionality for the use case I described (repeated measures by subject) In the enhancement code it supports both #21543 (comment). It works for me while still supporting the old behavior of only within group shuffling for the more complex use cases you described. |
Yep, I saw it while answering. But I don't know if we should only switch behaviour using a new parameter or magic would be fine in this case. |
The code I added figures out if you have the, "repeated measures" use case let's call it, or not automatically with that if statement. I think it should be automatic because if you have each group with the same class label then the current behavior of shuffling within a group doesn't ever shuffle any labels at all, and I would guess no one wants that. |
I am not sure what a permutation test is supposed to do in this case. I think we could improve scikit-learn to detect this pathological case and issue a It could be possible to resample random labels from the empirical prior distribution based on the relative class frequencies measured on the full dataset, or do disregard groups when doing the label permutation across all folds instead of within-folds but I think those two option would be too far way from the original reference linked in the docstring. I think users facing this situation should write code to do exactly what they want to achieve because I don't think there is a single recommended way to evaluate classifiers non-random performance with this kind of sampling structure in the data. |
I've used the code enhancement I wrote in #21543 (comment) successfully in real world problems and the The issue I have with doing nothing with the current implementation is that if you come with the (what I believe is a common) groups use case I've described it doesn't shuffle any labels and So with the code enhancement above it fixes these issues while behaving the same way as before in every other case. |
I have a related basic Python question regarding scikit-learn/sklearn/model_selection/_validation.py Lines 1300 to 1337 in f8f77b4
Since the same CV object instance is being passed to |
Could you be explicit by what you mean by inconsistant splits.
|
Sorry I wasn't clear. I really might not be understanding something about Python correctly... but I thought when you pass mutable objects they get passed by reference, e.g. def func(l, n):
l.append(n)
l = []
func(l, 3) This appends to I did some basic tests and proved I'm not correct here, but don't understand why as I thought each parallel process would need it's own copied CV instance such that it produces the same splits in the for loop across parallel processes and they aren't pulling the next split iteration from the same CV instance on top of each other. Could be something I'm also missing about how CV object in scikit-learn operate and they do not iterate like regular iterators. |
Thus not producing the same train/test splits on the data within the for loop of each parallel process?
Each call will produce different splits indeed. But I don’t understand why is it an issue. If really, you want the CV to be deterministic, you need to set the `random_state` of the CV object.
|
Right and I'm doing that, I just thought that since you are passing the same single CV object instance to the parallelized function and in that function each parallel process is doing a |
Describe the bug
I believe there is a bug or design expectation misunderstanding with
permutation_test_score
label shuffle function code. Maybe there is a use case for only shuffling within groups, but to me the default use case should be to shuffle all the samples while ensuring all samples within a group have the same shuffled label. This would be the equivalent behavior as if no groups existed, i.e. each label was in it's own unique group, which is the way I expected it to work.Steps/Code to Reproduce
Expected Results
Labels should be shuffled even when there are label groups. Label groups are to make sure all labels in that group simply stay together during CV shuffling and I expected that behavior here too. The labels for a group should be treated like a single sample where they get shuffled together with another sample label group.
Actual Results
No labels get shuffled:
Versions
1.0.1 and all previous versions with
permutation_test_score
(I looked at source code it hasn't changed)I specifically ran my test on:
System:
python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0]
executable: /home/hermidalc/soft/miniconda3/envs/sklearn-bio-workflows-r36/bin/python
machine: Linux-5.14.14-200.fc34.x86_64-x86_64-with-glibc2.10
Python dependencies:
pip: 21.1.2
setuptools: 49.6.0.post20210108
sklearn: 0.22.2.post1
numpy: 1.21.0
scipy: 1.5.3
Cython: None
pandas: 1.2.5
matplotlib: 3.4.2
joblib: 1.0.1
Built with OpenMP: True
The text was updated successfully, but these errors were encountered: