# 03 - Target vs. Features, Leakage, and Data Quality

---

## Learning Objectives

By the end of this notebook you will be able to:

- Define **feature**, **target**, **label**, **predictor**, and **response variable**
- Explain what **data leakage** is and identify its two main types
- Demonstrate the effect of leakage on model performance with code
- Apply correct preprocessing patterns (fit on train, transform on test)
- Identify common **data quality** issues: missing values, duplicates, inconsistent types
- Understand the concept of **generalization**

---

## Prerequisites

- Completed **Notebook 01** (train/test splits) and **Notebook 02** (cross-validation)
- Basic NumPy, Pandas, and scikit-learn usage

---

## Table of Contents

1. [Terminology: Features, Targets, and Friends](#1-terminology-features-targets-and-friends)
2. [What Is Data Leakage?](#2-what-is-data-leakage)
3. [Leakage Demo: Scaling Before vs. After Split](#3-leakage-demo-scaling-before-vs-after-split)
4. [More Leakage Examples](#4-more-leakage-examples)
5. [Generalization: The Big Picture](#5-generalization-the-big-picture)
6. [Data Quality Essentials](#6-data-quality-essentials)
7. [Common Mistakes](#7-common-mistakes)
8. [Exercise](#8-exercise)

---

## 1. Terminology: Features, Targets, and Friends

Machine learning literature uses many names for the same concepts. Here is a reference table:

| ML Term | Statistics Term | scikit-learn Convention | Description |
|---------|----------------|------------------------|-------------|
| **Feature** | Predictor / Independent variable / Covariate | `X` (capital) | Input columns the model uses to make predictions |
| **Target** | Response / Dependent variable / Outcome | `y` (lowercase) | The value we want to predict |
| **Label** | (same as target in classification) | `y` | Categorical target in classification tasks |
| **Sample** | Observation / Data point | One row of `X` | A single example |

In a supervised learning setting:

$$
\hat{y} = f(X) + \epsilon
$$

- $X \in \mathbb{R}^{n \times p}$ -- feature matrix ($n$ samples, $p$ features)
- $y \in \mathbb{R}^{n}$ -- target vector
- $f$ -- the function we are trying to learn
- $\epsilon$ -- irreducible noise

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Create a synthetic dataset we will use throughout
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=8,
    n_redundant=4,
    n_classes=2,
    random_state=42,
)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
df["target"] = y

print("Dataset shape:", df.shape)
df.head()

In [None]:
# Clearly separate features (X) from target (y)
feature_cols = [c for c in df.columns if c != "target"]
X = df[feature_cols].values
y = df["target"].values

print(f"Features (X): {X.shape}  -- {len(feature_cols)} predictors")
print(f"Target   (y): {y.shape}  -- binary classification")

---

## 2. What Is Data Leakage?

**Data leakage** occurs when information from outside the training set is used to create the model. This leads to **overly optimistic** performance estimates that do not hold up in production.

### Two main types

| Type | Description | Example |
|------|-------------|---------|
| **Target leakage** | A feature contains information that is derived from or strongly correlated with the target *and would not be available at prediction time* | Using "treatment outcome" to predict "disease diagnosis" |
| **Train-test contamination** | Information from the test set leaks into the training process | Fitting a scaler on the full dataset, then splitting |

---

## 3. Leakage Demo: Scaling Before vs. After Split

This is the most common form of train-test contamination. Let's measure the impact.

### 3a. WRONG: Scale all data, then split

In [None]:
# ---- WRONG approach: scale everything first ----
scaler_wrong = StandardScaler()
X_scaled_all = scaler_wrong.fit_transform(X)  # fit on ALL data (leakage!)

X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(
    X_scaled_all, y, test_size=0.20, random_state=42, stratify=y
)

model_wrong = LogisticRegression(max_iter=1000, random_state=42)
model_wrong.fit(X_train_w, y_train_w)
score_wrong = model_wrong.score(X_test_w, y_test_w)

print(f"WRONG (scale-then-split) test accuracy: {score_wrong:.4f}")

### 3b. CORRECT: Split first, then scale

In [None]:
# ---- CORRECT approach: split first, then fit scaler on train only ----
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

scaler_correct = StandardScaler()
X_train_c_scaled = scaler_correct.fit_transform(X_train_c)  # fit on TRAIN only
X_test_c_scaled = scaler_correct.transform(X_test_c)        # transform test

model_correct = LogisticRegression(max_iter=1000, random_state=42)
model_correct.fit(X_train_c_scaled, y_train_c)
score_correct = model_correct.score(X_test_c_scaled, y_test_c)

print(f"CORRECT (split-then-scale) test accuracy: {score_correct:.4f}")

In [None]:
print(f"\nDifference: {abs(score_wrong - score_correct):.4f}")
print(
    "On this clean synthetic dataset the difference is small, but on real-world\n"
    "data with noisy, high-dimensional features the gap can be substantial.\n"
    "The WRONG approach is always theoretically invalid regardless of the gap size."
)

### 3c. Best practice: use a Pipeline

A `Pipeline` guarantees that preprocessing is fitted **only** on training data during cross-validation.

In [None]:
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, random_state=42)),
])

cv_scores = cross_val_score(
    pipe, X, y,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring="accuracy",
)

print(f"Pipeline + 5-fold CV accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
print("The Pipeline ensures fit_transform runs ONLY on training folds.")

---

## 4. More Leakage Examples

### 4a. Using future data (time-based leakage)

If you are predicting stock prices tomorrow, you cannot use tomorrow's trading volume as a feature. This seems obvious, but in practice it happens when:

- A database join brings in records timestamped after the prediction date
- Aggregated features (e.g., "average rating for this product") include future reviews

### 4b. Proxy variables (target leakage)

A feature that is a **direct consequence** of the target:

- Predicting "will the patient be readmitted?" using "number of follow-up appointments" -- follow-ups happen *because* the patient was readmitted
- Predicting "will the customer churn?" using "cancellation_reason" -- that column is only populated *after* the customer churns

In [None]:
# Demonstrate target leakage with a proxy variable
rng = np.random.RandomState(42)

# Suppose target = 1 means "churned"
n = 500
legit_feature = rng.randn(n)
target = (legit_feature + rng.randn(n) * 0.5 > 0.3).astype(int)

# A leaky feature: "days_since_cancellation" -- only known AFTER churn
leaky_feature = np.where(target == 1, rng.randint(1, 30, n), 0)

# Build a DataFrame
df_leak = pd.DataFrame({
    "legit_feature": legit_feature,
    "days_since_cancellation": leaky_feature,
    "churned": target,
})

# Model WITH the leaky feature
X_leak = df_leak[["legit_feature", "days_since_cancellation"]].values
y_leak = df_leak["churned"].values

scores_leak = cross_val_score(
    LogisticRegression(max_iter=1000, random_state=42),
    X_leak, y_leak,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
)

# Model WITHOUT the leaky feature
X_clean = df_leak[["legit_feature"]].values

scores_clean = cross_val_score(
    LogisticRegression(max_iter=1000, random_state=42),
    X_clean, y_leak,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
)

print(f"With leaky feature:    accuracy = {scores_leak.mean():.4f}")
print(f"Without leaky feature: accuracy = {scores_clean.mean():.4f}")
print(f"\nThe leaky feature inflates accuracy by {scores_leak.mean() - scores_clean.mean():.4f}")
print("This performance would NOT hold in production where the feature is unavailable.")

---

## 5. Generalization: The Big Picture

Everything in this module series connects to one central idea: **generalization** -- the model's ability to perform well on data it has never seen.

```
+---------------------------------------------------------------------------+
|                        ALL AVAILABLE DATA                                 |
|                                                                           |
|   +---------------------------+  +-------------+  +-------------------+   |
|   |      TRAINING SET         |  |  VALIDATION  |  |     TEST SET      |   |
|   |                           |  |     SET      |  |                   |   |
|   |  - Learn patterns         |  |  - Tune      |  |  - Final, honest  |   |
|   |  - Fit model parameters   |  |    hyper-    |  |    performance    |   |
|   |  - Fit preprocessors      |  |    params    |  |    estimate       |   |
|   |                           |  |  - Model     |  |  - Touch ONCE     |   |
|   |                           |  |    selection |  |                   |   |
|   +---------------------------+  +-------------+  +-------------------+   |
|                                                                           |
|   Generalization gap = Test performance - Training performance            |
|   Large gap => overfitting                                                |
|   Both low  => underfitting                                               |
+---------------------------------------------------------------------------+
```

**Key principles**:

- Split data *before* any preprocessing or feature engineering
- Never use test data for any decision (model selection, hyperparameter tuning, feature selection)
- Use cross-validation for a more robust estimate of generalization
- Watch for data leakage -- it creates an illusion of good generalization

---

## 6. Data Quality Essentials

Before modeling, always audit the raw data for quality issues.

In [None]:
# Create a "messy" DataFrame to illustrate common issues
rng = np.random.RandomState(42)

df_messy = pd.DataFrame({
    "age": [25, 30, np.nan, 45, 30, 25, 52, np.nan, 38, 30],
    "income": [50000, 60000, 45000, 80000, 60000, 50000, 95000, 72000, 68000, 60000],
    "city": ["NYC", "LA", "NYC", "Chicago", "LA", "NYC", "LA", "Chicago", "nyc", "la"],
    "signup_date": [
        "2023-01-15", "2023-02-20", "2023-03-10", "2023/04/05", "2023-02-20",
        "2023-01-15", "2023-06-01", "2023-07-12", "2023-08-25", "2023-02-20",
    ],
    "purchased": [1, 0, 1, 1, 0, 1, 0, 1, 0, 0],
})

print("Raw data:")
df_messy

### 6a. Missing values

In [None]:
print("Missing values per column:")
print(df_messy.isnull().sum())
print(f"\nTotal missing: {df_messy.isnull().sum().sum()} out of {df_messy.size} cells "
      f"({df_messy.isnull().sum().sum() / df_messy.size * 100:.1f}%)")

**Common strategies** for missing values (covered in depth in later modules):

- **Drop rows** -- simple but loses data
- **Impute** with mean/median/mode
- **Impute** with a model (e.g., KNN imputer)
- **Flag** missingness as a feature itself

### 6b. Duplicate rows

In [None]:
n_dupes = df_messy.duplicated().sum()
print(f"Duplicate rows: {n_dupes}")
print("\nDuplicated rows:")
df_messy[df_messy.duplicated(keep=False)]

Duplicates can:

- **Bias** the model towards repeated patterns
- Cause **leakage** if the same row appears in both train and test sets
- Inflate metrics artificially

### 6c. Inconsistent types and values

In [None]:
# City names: inconsistent casing
print("Unique city values (raw):", df_messy["city"].unique())

# Fix: normalize to lowercase
df_clean = df_messy.copy()
df_clean["city"] = df_clean["city"].str.lower()
print("Unique city values (cleaned):", df_clean["city"].unique())

In [None]:
# Date formats: mixed separators
print("Raw dates:")
print(df_messy["signup_date"].values)

# Fix: parse to datetime
df_clean["signup_date"] = pd.to_datetime(df_clean["signup_date"])
print("\nParsed dates:")
print(df_clean["signup_date"].values)
print(f"dtype: {df_clean['signup_date'].dtype}")

### Data quality checklist

Before any modeling, run through this quick audit:

1. `df.info()` -- check dtypes and non-null counts
2. `df.isnull().sum()` -- missing values
3. `df.duplicated().sum()` -- duplicate rows
4. `df.describe()` -- check ranges, means, and outliers
5. For categorical columns: `df[col].value_counts()` -- check for inconsistent labels

---

## 7. Common Mistakes

| Mistake | Why it is harmful | Fix |
|---------|-------------------|-----|
| **Fitting preprocessor on full dataset** | Test set statistics leak into training; inflates performance | Use `Pipeline` or manually `fit` on train, `transform` on test |
| **Feature derived from target** | Model sees the answer directly or via a strong proxy | Audit features: ask "would I have this at prediction time?" |
| **Using future data as features** | Temporal leakage -- model gets information from the future | Ensure all features are available *before* the prediction timestamp |
| **Ignoring duplicates across splits** | Same sample in train and test leads to overoptimistic scores | Deduplicate before splitting |
| **Not checking data types** | Strings encoded as numbers (e.g., zip codes) get treated as continuous | Always run `df.info()` and `df.describe()` first |

---

## 8. Exercise

**Task**: You are given the following DataFrame. Identify and fix the leakage problem, then compare model performance before and after the fix.

1. Create the DataFrame below.
2. Train a `LogisticRegression` using 5-fold stratified CV **with all columns as features** (including the leaky one). Record accuracy.
3. Remove the leaky feature and repeat. Record accuracy.
4. Explain why the first accuracy is unrealistically high.

```python
rng = np.random.RandomState(42)
n = 300
hours_studied = rng.uniform(1, 10, n)
passed = (hours_studied + rng.randn(n) * 1.5 > 5).astype(int)
# Leaky feature: "grade" is assigned AFTER the exam
grade = np.where(passed == 1, rng.uniform(70, 100, n), rng.uniform(20, 60, n))

df_ex = pd.DataFrame({
    "hours_studied": hours_studied,
    "grade": grade,
    "passed": passed,
})
```

In [None]:
# YOUR CODE HERE
# ----------------------------------------------------------------
# Step 1: Create df_ex as shown above
# Step 2: CV with all features -> score_all
# Step 3: CV without 'grade' -> score_no_leak
# Step 4: Print and compare
# ----------------------------------------------------------------