# Train/Test Split & Data Leakage — Forensic Engineering Study

This notebook intentionally:

- Breaks ML workflows
- Quantifies metric inflation
- Inspects internal model state
- Demonstrates subtle target leakage

Goal:
Develop leakage detection instincts, not just API fluency.


# The Golden Rule

The moment you split:

$$
\text{Train} \;|\; \text{Test}
$$

A wall is created.

No `fit()` operation may cross that wall.

Allowed:
    fit(train)
    transform(train)
    transform(test)
    predict(test)

Forbidden:
    fit(test)
    fit(full_data_before_split)

Once violated, evaluation is contaminated.


## Visualization of the Isolation Principle

### Correct Flow: The Gold Standard
In this scenario, the preprocessing parameters (like mean or scaling) are learned **only** from the training set to prevent "seeing into the future."

```text
    Raw Data
        |
      Split
     /     \
  Train    Test
    |        |
  fit()    (no fit)
    |        |
transform(train)
transform(test)


Incorrect Flow (Leakage):
      Raw Data
        |
      fit()  <-- ERROR: Crosses future boundary
        |
      Split
```

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

np.random.seed(42)


In [2]:
X, y = make_classification(
    n_samples=1000,
    n_features=5,
    n_informative=3,
    random_state=42
)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(5)])
df["target"] = y

df.head()


Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,target
0,-0.0653,-0.717214,0.393952,-0.934473,1.681514,0
1,0.567015,-0.044606,1.612851,-1.350174,2.488878,0
2,-0.247215,-0.650569,-0.7435,-1.21419,0.84111,0
3,1.14587,0.974224,1.562506,-2.27701,2.276521,1
4,0.599605,-0.427545,2.374472,-1.50351,3.604959,0


# Experiment 1 — Proper Isolation

1. Split first.
2. Fit scaler only on training data.
3. Transform test data without fitting.
4. Train model.
5. Evaluate honestly.


In [3]:
X = df.drop(columns="target")
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [4]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [5]:
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

correct_accuracy = accuracy_score(y_test, y_pred)
correct_accuracy


0.925

# Experiment 2 — Scaling BEFORE Splitting (WRONG)

Here we:

1. Fit scaler on full dataset.
2. Then split.
3. Then train.

This allows test distribution statistics to influence training.


In [6]:
scaler_leak = StandardScaler()

X_scaled_full = scaler_leak.fit_transform(X)


In [7]:
X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(
    X_scaled_full, y, test_size=0.2, random_state=42
)


In [8]:
model_leak = LogisticRegression()
model_leak.fit(X_train_l, y_train_l)

y_pred_leak = model_leak.predict(X_test_l)

leak_accuracy = accuracy_score(y_test_l, y_pred_leak)
leak_accuracy


0.925

In [9]:
pd.DataFrame({
    "Correct_Scaler_Mean": scaler.mean_,
    "Leaked_Scaler_Mean": scaler_leak.mean_,
    "Difference": scaler_leak.mean_ - scaler.mean_
})


Unnamed: 0,Correct_Scaler_Mean,Leaked_Scaler_Mean,Difference
0,0.188897,0.195463,0.006566
1,-0.008599,0.009486,0.018086
2,0.464861,0.48206,0.017199
3,-0.532723,-0.501864,0.030859
4,0.823838,0.799633,-0.024205


If the means differ, then the leaked scaler saw data
that the correct scaler did not.

That difference represents:

$$
\mu_{\text{test}}
$$

leaking into training.

Even if accuracy difference is small,
the contamination is real.


# Experiment 3 — Subtle Target Leakage

Instead of directly adding the target,
we create a noisy proxy:

$$
\text{proxy} = y + \epsilon
$$

where:

$$
\epsilon \sim \mathcal{N}(0, 0.01)
$$

This simulates real-world leakage
where a feature is strongly correlated
with the outcome due to post-event information.


In [10]:
df_leaky = df.copy()

df_leaky["proxy_target"] = df_leaky["target"] + np.random.normal(
    0, 0.01, size=len(df_leaky)
)

df_leaky.head()


Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,target,proxy_target
0,-0.0653,-0.717214,0.393952,-0.934473,1.681514,0,0.004967
1,0.567015,-0.044606,1.612851,-1.350174,2.488878,0,-0.001383
2,-0.247215,-0.650569,-0.7435,-1.21419,0.84111,0,0.006477
3,1.14587,0.974224,1.562506,-2.27701,2.276521,1,1.01523
4,0.599605,-0.427545,2.374472,-1.50351,3.604959,0,-0.002342


In [11]:
X_leak = df_leaky.drop(columns="target")
y_leak = df_leaky["target"]

X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(
    X_leak, y_leak, test_size=0.2, random_state=42
)

model_target_leak = LogisticRegression()
model_target_leak.fit(X_train_t, y_train_t)

y_pred_t = model_target_leak.predict(X_test_t)

target_leak_accuracy = accuracy_score(y_test_t, y_pred_t)
target_leak_accuracy


1.0

If accuracy jumps dramatically (e.g., > 0.95),
the model appears exceptional.

But in reality:
- The proxy encodes outcome information.
- That information would not exist at prediction time.

Professional engineers treat
"too good to be true"
as a leakage red flag.


# Metric Comparison Summary

Correct Workflow Accuracy:
$$
{correct_accuracy}
$$

Preprocessing Leakage Accuracy:
$$
{leak_accuracy}
$$

Target Leakage Accuracy:
$$
{target_leak_accuracy}
$$

High accuracy does not imply correctness.
Isolation discipline determines validity.


# Hard Rules Going Forward

1. Split immediately after loading data.
2. Never fit on full dataset.
3. Transformers fit ONLY on training data.
4. Treat suspiciously high metrics as leakage until proven otherwise.
5. Inspect internal state when in doubt.

Leakage rarely throws errors.
It rewards you with false confidence.
