<a href="https://colab.research.google.com/github/xquynhtrinh/STA_141C_Final_Project/blob/main/Predictive_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predictive Modeling Goals

- Merge binary target (purchased_next90) with new PCA scores (PC1 & PC2)

- Do a train/test split

- Train logistic regression & Random Forest

We will use a Time-Based Split instead of a random train-test split because if using random split, we would be using future data to predict past behavior, which is massive data leakage error that could be checked for. Hence, Time-Based Split will reserve the final 90 days of our dataset as the test period, and training our models on all data prior to that date.

In [18]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate

sns.set_theme(style="whitegrid")

# load clean transactions data
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/STA 141C/clean_transactions.csv")

# ensure date col is formatted as datetime
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])


## Time-Based Split & Target Creation

In [20]:
# Define the 90 day holdout period
max_date = df["InvoiceDate"].max()
split_date = max_date - pd.Timedelta(days=90)

print(f"Training Data: Before {split_date.date()}")
print(f"Testing Data (Next 90 Days): {split_date.date()} to {max_date.date()}")

Training Data: Before 2011-09-10
Testing Data (Next 90 Days): 2011-09-10 to 2011-12-09


In [28]:
# re-calc RFM on training data to prevent data leakage
train_df = df[df["InvoiceDate"]<split_date]
test_df = df[df["InvoiceDate"]>=split_date]

# calc train RFM
train_rfm = train_df.groupby("Customer ID").agg(
    Recency = ("InvoiceDate", lambda x: (split_date - x.max()).days),
    Frequency = ("Invoice", "nunique"),
    Monetary = ("Revenue", "sum")
).reset_index()

# create binary target from test data
# 1: customer appears, 0: o.w

test_customers = set(test_df["Customer ID"].unique())
train_rfm["purchased_next90"] = train_rfm["Customer ID"].isin(test_customers).astype(int)

# pre-process training features (log-transf + scale)
x_raw = train_rfm[['Recency', 'Frequency', 'Monetary']]
x_log = np.log1p(x_raw)

scaler_pred = StandardScaler()
x_scaled = scaler_pred.fit_transform(x_log)

# Apply PCA
pca_pred = PCA(n_components=2, random_state=42)
x_pca = pca_pred.fit_transform(x_scaled)

y = train_rfm["purchased_next90"]

print(f"\nTotal customers in train set: {len(y)}")
print(f"Customers who bought in next 90 days: {y.sum()} ({(y.sum()/len(y)*100):.1f}%)")


Total customers in train set: 5281
Customers who bought in next 90 days: 2292 (43.4%)


## Model Training & Evaluation

Because we are evaluating the whole dataset at a single point in time, we will use 5-Fold Cross-Validation on our historical data to compare the models rogorously.

In [32]:
from sklearn.model_selection import cross_validate

# Init models
log_reg = LogisticRegression(class_weight='balanced', random_state=42)
rf = RandomForestClassifier(n_estimators=100, max_depth=5,
                            class_weight='balanced', random_state=42)

# Define metrics
scoring = ['roc_auc', 'f1', 'precision', 'recall']

# Evaluate logistic reg.
cv_lr = cross_validate(log_reg, x_pca, y, cv=5, scoring=scoring)

# eval random forest
cv_rf = cross_validate(rf, x_pca, y, cv=5, scoring=scoring)

# summary table
results = pd.DataFrame({
    'Metric (5-Fold CV Avg)': ['ROC-AUC', 'F1-Score', 'Precision', 'Recall'],
    'Logistic Regression (PCA)': [
        cv_lr['test_roc_auc'].mean(),
        cv_lr['test_f1'].mean(),
        cv_lr['test_precision'].mean(),
        cv_lr['test_recall'].mean()
    ],
    'Random Forest (PCA)': [
        cv_rf['test_roc_auc'].mean(),
        cv_rf['test_f1'].mean(),
        cv_rf['test_precision'].mean(),
        cv_rf['test_recall'].mean()
    ]
}).round(3)

results

Unnamed: 0,Metric (5-Fold CV Avg),Logistic Regression (PCA),Random Forest (PCA)
0,ROC-AUC,0.789,0.793
1,F1-Score,0.676,0.676
2,Precision,0.681,0.692
3,Recall,0.672,0.661


## I want to improve ROC-AUC Score

In [34]:
# Train LR on PCA (to prevent multicollinearity)
cv_lr = cross_validate(log_reg, x_pca, y, cv=5, scoring=scoring)

# Train RF on the scaled RFM data directly (Trees don't need PCA!)
cv_rf = cross_validate(rf, x_scaled, y, cv=5, scoring=scoring)



In [36]:
# hyperparam. tuning
from sklearn.model_selection import GridSearchCV

# Define the grid of parameters to test
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10]
}

# Search for the best combination specifically for ROC-AUC
grid_search = GridSearchCV(
    RandomForestClassifier(class_weight='balanced', random_state=42),
    param_grid,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1
)

# Fit on the SCALED features (not PCA)
grid_search.fit(x_scaled, y)

print("Best Parameters for RF:", grid_search.best_params_)
print("Best Expected AUC:", grid_search.best_score_)

# Then use grid_search.best_estimator_ in your comparison table!


Best Parameters for RF: {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 300}
Best Expected AUC: 0.7988312206916639


In [37]:
grid_search.best_estimator_

In [41]:
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Define the metrics we want to track
scoring = ['roc_auc', 'f1', 'precision', 'recall']

# ==========================================
# 1. LOGISTIC REGRESSION (Using X_pca)
# ==========================================
# We use X_pca here because linear models fail when features
# are highly correlated (which F and M are).
log_reg = LogisticRegression(class_weight='balanced', random_state=42)
cv_lr = cross_validate(log_reg, x_pca, y, cv=5, scoring=scoring)


# ==========================================
# 2. RANDOM FOREST TUNING (Using X_scaled)
# ==========================================
# We use X_scaled here because Decision Trees do not care about multicollinearity.
# Passing PCA to a tree actually hurts it by rotating the axes and destroying
# the clean, interpretable splits.

print("Tuning Random Forest... (this might take a minute)")

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10]
}

# Run the Grid Search to find the best parameters to maximize ROC-AUC
grid_search = GridSearchCV(
    RandomForestClassifier(class_weight='balanced', random_state=42),
    param_grid,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1 # Uses all your computer's processors to speed it up
)

grid_search.fit(x_scaled, y) # Notice we fit on X_scaled, NOT X_pca!

# Extract the winning model!
best_rf = grid_search.best_estimator_
print(f"Best RF Parameters found: {grid_search.best_params_}\n")

# Run cross-validation on the winning model so we can compare it fairly to LR
cv_rf = cross_validate(best_rf, x_scaled, y, cv=5, scoring=scoring)


# ==========================================
# 3. RESULTS TABLE
# ==========================================
results = pd.DataFrame({
    'Metric (5-Fold CV Avg)': ['ROC-AUC', 'F1-Score', 'Precision', 'Recall'],
    'Logistic Regression (with PCA)': [
        cv_lr['test_roc_auc'].mean(),
        cv_lr['test_f1'].mean(),
        cv_lr['test_precision'].mean(),
        cv_lr['test_recall'].mean()
    ],
    f'Tuned Random Forest (No PCA)': [
        cv_rf['test_roc_auc'].mean(),
        cv_rf['test_f1'].mean(),
        cv_rf['test_precision'].mean(),
        cv_rf['test_recall'].mean()
    ]
}).round(3)

print("=== Final Predictive Model Comparison ===")
print(results)


Tuning Random Forest... (this might take a minute)
Best RF Parameters found: {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 300}

=== Final Predictive Model Comparison ===
  Metric (5-Fold CV Avg)  Logistic Regression (with PCA)  \
0                ROC-AUC                           0.789   
1               F1-Score                           0.676   
2              Precision                           0.681   
3                 Recall                           0.672   

   Tuned Random Forest (No PCA)  
0                         0.799  
1                         0.693  
2                         0.679  
3                         0.709  


I think we can improve the model a bit better?