-
Notifications
You must be signed in to change notification settings - Fork 34
[TO_REVIEW] Add automatic target label masking to prevent data leakage #330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[TO_REVIEW] Add automatic target label masking to prevent data leakage #330
Conversation
…a da_pipeline with SelectSourceTarget
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #330 +/- ##
==========================================
- Coverage 96.41% 88.77% -7.65%
==========================================
Files 63 63
Lines 6919 7020 +101
==========================================
- Hits 6671 6232 -439
- Misses 248 788 +540 🚀 New features to boost your workflow:
|
@@ -263,6 +263,7 @@ | |||
PCA(n_components=2), | |||
SelectSource(SVC()), | |||
default_selector=SelectSourceTarget, | |||
mask_target_labels=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it false here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why the use of SelectSourceTarget?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you do that you have one PCA for source anc one for target but SVC is traine donly on source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should be able to mask target label with SelectSourceTarget no ? We don't want data leakage even if we have one PCA for source and one for target ?
skada/_utils.py
Outdated
unmasked_idx = y != _DEFAULT_MASKED_TARGET_CLASSIFICATION_LABEL | ||
elif y_type == Y_Type.CONTINUOUS: | ||
unmasked_idx = np.isfinite(y) | ||
if "sample_domain" in params: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these two lines, we avoid semi-supervised DA. I think it's a residue of before no?
@@ -307,7 +308,11 @@ def predict(self, X, sample_weight=None): | |||
assert sample_weight is None | |||
return X | |||
|
|||
clf = make_da_pipeline(DensityReweightAdapter(), mediator, FakeEstimator()) | |||
clf = make_da_pipeline( | |||
Shared(DensityReweightAdapter(), mask_target_labels=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it false here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we mask the target label it breaks when we use SelectTarget for the standard scaler. That means that with the selector SelectTarget, the source domain is not propagate in the pipeline :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to fix in an other issue I think
This PR introduces a mechanism to automatically mask target labels in unsupervised domain adaptation settings. This feature prevent data leakage from the target domain during the fit process of the estimators.
Key Changes:
Automatic Label Masking: A new
_auto_mask_target_labels
method has been added to automatically replace target labels with a default masked value before they are passed to the estimators. This is enabled by default to ensure that no data leakage can occur.Control via mask_target_labels parameter: The masking behavior can be controlled with the
mask_target_labels
parameter inmake_da_pipeline
and the selectors (Shared
,PerDomain
, etc.).