<a href="https://colab.research.google.com/github/scigeek72/Tridib_Portfolio/blob/main/DataLeakage_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Required Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


# 1. Load and Prepare Titanic Dataset



In [None]:
# Load dataset
df = sns.load_dataset('titanic')

# Drop rows with missing target and select a subset of features
df = df.dropna(subset=['survived'])
df = df[['survived', 'pclass', 'sex', 'age', 'fare', 'embarked']]

# Drop rows with missing values just for simplicity in this example
df = df.dropna()

# Split features and target
X = df.drop('survived', axis=1)
y = df['survived']


# ⚠️ 2. Leaky Model (Wrong Approach)

In [None]:
# Encode and scale before splitting (leakage!)
numeric_features = ['age', 'fare']
categorical_features = ['pclass', 'sex', 'embarked']

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Fit the transformer on full data (this causes leakage)
X_processed = preprocessor.fit_transform(X)

# Then split
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

# Train model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
print("Leaky Accuracy:", accuracy_score(y_test, y_pred))


Leaky Accuracy: 0.7622377622377622


# ✅ 3. Clean Pipeline (Correct Approach)


In [None]:
# Split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Reuse the same preprocessing steps
numeric_features = ['age', 'fare']
categorical_features = ['pclass', 'sex', 'embarked']

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Build pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Train and evaluate
pipeline.fit(X_train, y_train)
y_pred_clean = pipeline.predict(X_test)
print("Clean Accuracy:", accuracy_score(y_test, y_pred_clean))


Clean Accuracy: 0.7692307692307693


# 📉 4. Compare Results

In [None]:
print("\n--- Leaky Model Report ---")
print(classification_report(y_test, y_pred))

print("\n--- Clean Model Report ---")
print(classification_report(y_test, y_pred_clean))



--- Leaky Model Report ---
              precision    recall  f1-score   support

           0       0.77      0.81      0.79        80
           1       0.75      0.70      0.72        63

    accuracy                           0.76       143
   macro avg       0.76      0.76      0.76       143
weighted avg       0.76      0.76      0.76       143


--- Clean Model Report ---
              precision    recall  f1-score   support

           0       0.78      0.82      0.80        80
           1       0.76      0.70      0.73        63

    accuracy                           0.77       143
   macro avg       0.77      0.76      0.76       143
weighted avg       0.77      0.77      0.77       143



Note that in this example, the accuracy improved by only tiny bit. But keep in mind that it is toy example with toy dataset.

# 🧭 5. Summary Cell

In [None]:
from IPython.display import Markdown

Markdown("""
### ✅ Key Takeaways

- Data leakage can inflate your metrics without throwing any error.
- Always **split your data before** doing any transformation.
- Use **pipelines** to encapsulate preprocessing steps and avoid leakage.
- If your model's accuracy feels *too good to be true*... it might be.

""")



### ✅ Key Takeaways

- Data leakage can inflate your metrics without throwing any error.
- Always **split your data before** doing any transformation.
- Use **pipelines** to encapsulate preprocessing steps and avoid leakage.
- If your model's accuracy feels *too good to be true*... it might be.

