Skip to content

[TO_REVIEW] Add automatic target label masking to prevent data leakage #330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

YanisLalou
Copy link
Collaborator

This PR introduces a mechanism to automatically mask target labels in unsupervised domain adaptation settings. This feature prevent data leakage from the target domain during the fit process of the estimators.

Key Changes:

  • Automatic Label Masking: A new _auto_mask_target_labels method has been added to automatically replace target labels with a default masked value before they are passed to the estimators. This is enabled by default to ensure that no data leakage can occur.

  • Control via mask_target_labels parameter: The masking behavior can be controlled with the mask_target_labels parameter in make_da_pipeline and the selectors (Shared, PerDomain, etc.).

Copy link

codecov bot commented Jun 25, 2025

Codecov Report

Attention: Patch coverage is 98.21429% with 2 lines in your changes missing coverage. Please review.

Project coverage is 88.77%. Comparing base (7880eb1) to head (91a8925).

❗ There is a different number of reports uploaded between BASE (7880eb1) and HEAD (91a8925). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (7880eb1) HEAD (91a8925)
2 1
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #330      +/-   ##
==========================================
- Coverage   96.41%   88.77%   -7.65%     
==========================================
  Files          63       63              
  Lines        6919     7020     +101     
==========================================
- Hits         6671     6232     -439     
- Misses        248      788     +540     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@YanisLalou YanisLalou changed the title [WIP] Add automatic target label masking to prevent data leakage [TO_REVIEW] Add automatic target label masking to prevent data leakage Jun 25, 2025
@@ -263,6 +263,7 @@
PCA(n_components=2),
SelectSource(SVC()),
default_selector=SelectSourceTarget,
mask_target_labels=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it false here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why the use of SelectSourceTarget?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you do that you have one PCA for source anc one for target but SVC is traine donly on source

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should be able to mask target label with SelectSourceTarget no ? We don't want data leakage even if we have one PCA for source and one for target ?

skada/_utils.py Outdated
unmasked_idx = y != _DEFAULT_MASKED_TARGET_CLASSIFICATION_LABEL
elif y_type == Y_Type.CONTINUOUS:
unmasked_idx = np.isfinite(y)
if "sample_domain" in params:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these two lines, we avoid semi-supervised DA. I think it's a residue of before no?

@@ -307,7 +308,11 @@ def predict(self, X, sample_weight=None):
assert sample_weight is None
return X

clf = make_da_pipeline(DensityReweightAdapter(), mediator, FakeEstimator())
clf = make_da_pipeline(
Shared(DensityReweightAdapter(), mask_target_labels=False),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it false here ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we mask the target label it breaks when we use SelectTarget for the standard scaler. That means that with the selector SelectTarget, the source domain is not propagate in the pipeline :/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to fix in an other issue I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants