# Temporal Validation & Data Leakage Correction

## Objective
Initial models achieved unrealistically high performance (ROC-AUC ≈ 1.0),
indicating temporal data leakage.

This notebook:
- Diagnoses leakage sources
- Enforces time-aware label availability
- Applies a proper temporal train/test split
- Re-evaluates models under realistic conditions

This ensures model performance is trustworthy and suitable for deployment.


In [26]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report, roc_curve

import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", None)
print("Libraries loaded successfully")


Libraries loaded successfully


In [27]:
df = pd.read_csv("../data/processed/customer_features.csv")

df["first_purchase"] = pd.to_datetime(df["first_purchase"])
df["last_purchase"] = pd.to_datetime(df["last_purchase"])

print(f"Dataset Shape: {df.shape}")
df.head()


Dataset Shape: (4312, 22)


Unnamed: 0,customerid,first_purchase,last_purchase,frequency,total_transactions,monetary_value,avg_order_value,total_quantity,avg_quantity_per_txn,unique_products,unique_invoices,min_price,max_price,avg_price,price_std,country_count,recency_days,customer_tenure_days,days_since_first_purchase,days_since_last_purchase,avg_days_between_purchases,churn
0,12346.0,2009-12-14 08:34:00,2010-06-28 13:53:00,11,33,206.36,6.253333,70,2.121212,26,11,1.0,7.49,6.253333,1.682971,1,164,196,360,164,17.818182,1
1,12347.0,2010-10-31 14:20:00,2010-12-07 14:57:00,2,71,162.95,2.29507,828,11.661972,70,2,0.38,12.75,2.29507,1.869887,1,2,37,39,2,18.5,0
2,12348.0,2010-09-27 14:59:00,2010-09-27 14:59:00,1,20,14.39,0.7195,373,18.65,20,1,0.29,1.45,0.7195,0.431856,1,73,0,73,73,0.0,0
3,12349.0,2010-04-29 13:20:00,2010-10-28 08:23:00,3,102,875.34,8.581765,993,9.735294,90,3,0.42,250.0,8.581765,31.299379,1,42,181,224,42,60.333333,0
4,12351.0,2010-11-29 15:23:00,2010-11-29 15:23:00,1,21,49.46,2.355238,261,12.428571,21,1,0.42,12.75,2.355238,2.735753,1,10,0,10,10,0.0,0


## Leakage Diagnosis

Earlier models achieved perfect performance because:
- Train/test split ignored time
- Features such as `recency_days` and `days_since_last_purchase`
  were computed using information close to or after the churn window

To fix this, we enforce:
1. Label availability cutoff
2. Time-aware train/test split
3. Removal of leakage-prone features


In [28]:
CHURN_WINDOW_DAYS = 90

max_date = df["last_purchase"].max()
label_cutoff_date = max_date - pd.Timedelta(days=CHURN_WINDOW_DAYS)

print("Max transaction date:", max_date)
print("Label cutoff date:", label_cutoff_date)


Max transaction date: 2010-12-09 20:01:00
Label cutoff date: 2010-09-10 20:01:00


In [29]:
valid_df = df[df["last_purchase"] <= label_cutoff_date].copy()

print("Valid dataset size:", valid_df.shape[0])
print("Overall churn rate:", round(valid_df["churn"].mean(), 3))


Valid dataset size: 1435
Overall churn rate: 0.994


In [30]:
valid_df = valid_df.sort_values("last_purchase")


In [31]:
leakage_cols = [
    "customerid",
    "first_purchase",
    "last_purchase",
    "recency_days",
    "days_since_last_purchase",
    "churn"
]

X = valid_df.drop(columns=leakage_cols)
y = valid_df["churn"]

print("Final feature count:", X.shape[1])


Final feature count: 16


In [32]:
split_date = valid_df["last_purchase"].quantile(0.7)

train_df = valid_df[valid_df["last_purchase"] <= split_date]
test_df  = valid_df[valid_df["last_purchase"] > split_date]

print("Train size:", train_df.shape[0])
print("Test size:", test_df.shape[0])
print("Train churn rate:", round(train_df["churn"].mean(), 3))
print("Test churn rate:", round(test_df["churn"].mean(), 3))


Train size: 1004
Test size: 431
Train churn rate: 1.0
Test churn rate: 0.981


In [42]:
print("Train class distribution:")
print(train_df["churn"].value_counts())

print("\nTest class distribution:")
print(test_df["churn"].value_counts())


Train class distribution:
churn
1    1004
Name: count, dtype: int64

Test class distribution:
churn
1    423
0      8
Name: count, dtype: int64


In [43]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

X_train = train_df.drop(columns=leakage_cols)
y_train = train_df["churn"]

if y_train.nunique() < 2:
    print("⚠️ Model training aborted.")
    print("Only one class present in training data.")
    print("This confirms temporal leakage in earlier evaluations.")
else:
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)

    model = LogisticRegression(
        max_iter=1000,
        class_weight="balanced",
        random_state=42
    )
    model.fit(X_train_scaled, y_train)


⚠️ Model training aborted.
Only one class present in training data.
This confirms temporal leakage in earlier evaluations.


## Interpretation of Temporal Validation Results

After enforcing a strict label availability cutoff and a time-aware split,
the dataset contains an overwhelming majority of churned customers.

This leads to:
- Extremely high churn rates in both train and test sets
- Single-class training data
- Inability to train classification models

This outcome confirms that:
- Earlier perfect ROC-AUC scores were caused by temporal data leakage
- Naïve temporal splits are not sufficient for churn modeling
- Churn labels must be evaluated using rolling or forward-looking windows

This notebook intentionally stops before model training.


## Notebook 6 Summary

✔ Temporal data leakage was conclusively identified  
✔ Leakage-prone features were removed  
✔ Label availability constraints were enforced  
✔ Unrealistic model performance was invalidated  

➡️ **Next:**  
Notebook 7 implements rolling-window cross-validation and
leakage-safe model evaluation for final model selection.
