# 3.1 Systematic Model Comparison: Logistic Regression vs. Random Forest vs. XGBoost

## Course 3: Advanced Classification Models for Student Success

## Introduction

Throughout this course, we have built three families of classification models:

1. **Module 1**: Regularized Logistic Regression (L1, L2, ElasticNet)
2. **Module 2**: Tree-Based Models (Decision Tree, Random Forest, XGBoost)

In this module, we bring the **three most practical models** together for a head-to-head comparison:

- **Regularized Logistic Regression** — the interpretable baseline
- **Random Forest** — the robust ensemble
- **XGBoost** — the performance champion

### Why These Three?

These represent the models you will use most often in practice. They cover the full spectrum of the interpretability-performance trade-off, and each excels in different scenarios. Other models (LightGBM, CatBoost, Neural Networks) are covered in the Special Topics module.

### Learning Objectives

1. Train and evaluate all three models on the same dataset with optimized hyperparameters
2. Compare performance across multiple metrics (AUC, F1, Precision, Recall)
3. Analyze the interpretability vs. performance trade-off
4. Apply model selection criteria appropriate for higher education
5. Make a justified recommendation for deployment

## 1. Setup and Data Preparation

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, precision_recall_curve, average_precision_score,
    confusion_matrix, brier_score_loss, log_loss, classification_report)
from sklearn.model_selection import cross_val_score, StratifiedKFold
import time

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Load data
train_df = pd.read_csv('../../data/training.csv')
test_df = pd.read_csv('../../data/testing.csv')
train_df['DEPARTED'] = (train_df['SEM_3_STATUS'] != 'E').astype(int)
test_df['DEPARTED'] = (test_df['SEM_3_STATUS'] != 'E').astype(int)

numeric_features = ['HS_GPA','HS_MATH_GPA','HS_ENGL_GPA','UNITS_ATTEMPTED_1','UNITS_ATTEMPTED_2',
    'UNITS_COMPLETED_1','UNITS_COMPLETED_2','DFW_UNITS_1','DFW_UNITS_2','GPA_1','GPA_2',
    'DFW_RATE_1','DFW_RATE_2','GRADE_POINTS_1','GRADE_POINTS_2']
categorical_features = ['RACE_ETHNICITY','GENDER','FIRST_GEN_STATUS','COLLEGE']

train_enc = pd.get_dummies(train_df[numeric_features + categorical_features],
                           columns=categorical_features, drop_first=True)
test_enc = pd.get_dummies(test_df[numeric_features + categorical_features],
                          columns=categorical_features, drop_first=True)
train_enc, test_enc = train_enc.align(test_enc, join='left', axis=1, fill_value=0)
train_enc = train_enc.fillna(train_enc.median())
test_enc = test_enc.fillna(test_enc.median())

X_train, y_train = train_enc, train_df['DEPARTED']
X_test, y_test = test_enc, test_df['DEPARTED']

# Scale for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training: {X_train.shape[0]:,} | Testing: {X_test.shape[0]:,} | Features: {X_train.shape[1]}")
print(f"Departure rate: {y_train.mean():.2%} (train), {y_test.mean():.2%} (test)")

## 2. Train the Three Models

In [None]:
# === MODEL 1: Regularized Logistic Regression (L2) ===
start = time.time()
lr = LogisticRegression(penalty='l2', C=0.1, solver='lbfgs', max_iter=1000, random_state=RANDOM_STATE)
lr.fit(X_train_scaled, y_train)
lr_time = time.time() - start
lr_prob = lr.predict_proba(X_test_scaled)[:, 1]
lr_pred = lr.predict(X_test_scaled)

# === MODEL 2: Random Forest ===
start = time.time()
rf = RandomForestClassifier(n_estimators=200, max_depth=12, min_samples_leaf=5,
    class_weight='balanced', n_jobs=-1, random_state=RANDOM_STATE)
rf.fit(X_train, y_train)
rf_time = time.time() - start
rf_prob = rf.predict_proba(X_test)[:, 1]
rf_pred = rf.predict(X_test)

# === MODEL 3: XGBoost ===
start = time.time()
xgb = XGBClassifier(n_estimators=150, learning_rate=0.1, max_depth=5, subsample=0.8,
    colsample_bytree=0.8, scale_pos_weight=len(y_train[y_train==0])/len(y_train[y_train==1]),
    use_label_encoder=False, eval_metric='logloss', random_state=RANDOM_STATE)
xgb.fit(X_train, y_train)
xgb_time = time.time() - start
xgb_prob = xgb.predict_proba(X_test)[:, 1]
xgb_pred = xgb.predict(X_test)

print("All three models trained successfully!")

## 3. Performance Comparison

In [None]:
# Comprehensive metrics
results = []
for name, pred, prob, t in [('Regularized Logistic', lr_pred, lr_prob, lr_time),
                              ('Random Forest', rf_pred, rf_prob, rf_time),
                              ('XGBoost', xgb_pred, xgb_prob, xgb_time)]:
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, pred),
        'Precision': precision_score(y_test, pred),
        'Recall': recall_score(y_test, pred),
        'F1 Score': f1_score(y_test, pred),
        'ROC-AUC': roc_auc_score(y_test, prob),
        'Avg Precision': average_precision_score(y_test, prob),
        'Brier Score': brier_score_loss(y_test, prob),
        'Train Time (s)': t
    })

results_df = pd.DataFrame(results).set_index('Model')

print("=" * 90)
print("HEAD-TO-HEAD MODEL COMPARISON")
print("=" * 90)
print(results_df.round(4).to_string())
print("=" * 90)

In [None]:
# Performance bar chart
fig = go.Figure()
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC']
colors = ['#9b59b6', '#3498db', '#e74c3c']

for i, model in enumerate(results_df.index):
    fig.add_trace(go.Bar(name=model, x=metrics,
        y=[results_df.loc[model, m] for m in metrics], marker_color=colors[i]))

fig.update_layout(title='Three-Way Model Comparison', barmode='group', height=450,
    yaxis_title='Score')
fig.show()

## 4. ROC and Precision-Recall Curves

In [None]:
fig = make_subplots(rows=1, cols=2, subplot_titles=('ROC Curves', 'Precision-Recall Curves'))

for i, (name, prob) in enumerate([('Regularized Logistic', lr_prob),
                                    ('Random Forest', rf_prob), ('XGBoost', xgb_prob)]):
    # ROC
    fpr, tpr, _ = roc_curve(y_test, prob)
    auc = roc_auc_score(y_test, prob)
    fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines',
        name=f'{name} (AUC={auc:.3f})', line=dict(color=colors[i], width=2)), row=1, col=1)

    # PR
    prec, rec, _ = precision_recall_curve(y_test, prob)
    ap = average_precision_score(y_test, prob)
    fig.add_trace(go.Scatter(x=rec, y=prec, mode='lines',
        name=f'{name} (AP={ap:.3f})', line=dict(color=colors[i], width=2, dash='dot'),
        showlegend=False), row=1, col=2)

fig.add_trace(go.Scatter(x=[0,1], y=[0,1], mode='lines', line=dict(color='gray', dash='dash'),
    showlegend=False), row=1, col=1)

fig.update_layout(height=450, title_text='ROC and Precision-Recall Curves')
fig.update_xaxes(title_text='FPR', row=1, col=1)
fig.update_yaxes(title_text='TPR', row=1, col=1)
fig.update_xaxes(title_text='Recall', row=1, col=2)
fig.update_yaxes(title_text='Precision', row=1, col=2)
fig.show()

## 5. Interpretability vs. Performance Trade-off

In [None]:
# Radar chart comparing the three models
categories = ['ROC-AUC', 'Interpretability', 'Training Speed',
              'Handles Non-linearity', 'Ease of Deployment', 'Stakeholder Trust']

scores = {
    'Regularized Logistic': [results_df.loc['Regularized Logistic', 'ROC-AUC']*10, 9, 10, 3, 10, 9],
    'Random Forest': [results_df.loc['Random Forest', 'ROC-AUC']*10, 5, 6, 9, 7, 6],
    'XGBoost': [results_df.loc['XGBoost', 'ROC-AUC']*10, 4, 7, 10, 6, 4]
}

fig = go.Figure()
for i, (name, vals) in enumerate(scores.items()):
    fig.add_trace(go.Scatterpolar(r=vals + [vals[0]], theta=categories + [categories[0]],
        name=name, line=dict(color=colors[i], width=2), fill='toself', opacity=0.3))

fig.update_layout(polar=dict(radialaxis=dict(visible=True, range=[0, 10])),
    title='Model Capabilities Radar Chart', height=550)
fig.show()

## 6. Recommendations for Higher Education

### Decision Framework

| Your Priority | Recommended Model | Why |
|:-------------|:-----------------|:----|
| **Explainability to advisors** | Regularized Logistic Regression | Coefficients show clear factor contributions |
| **Institutional reports & compliance** | Regularized Logistic Regression | Auditable, transparent |
| **Reliable risk scoring** | Random Forest | Robust, handles messy data well |
| **Maximum predictive accuracy** | XGBoost | Typically highest AUC and F1 |
| **Limited IT resources** | Regularized Logistic Regression | Simplest to deploy and maintain |
| **Research publications** | XGBoost | Best metrics for academic papers |

### Practical Recommendation

For most higher education institutions, we recommend a **two-model approach**:

1. **Regularized Logistic Regression** for stakeholder-facing outputs (reports, advisor dashboards, compliance)
2. **Random Forest or XGBoost** for backend risk scoring where performance matters most

This gives you the best of both worlds: interpretability where it's needed, and performance where it counts.

## 7. Summary

### Key Findings

1. All three models are strong performers on student departure prediction
2. The performance gap between models is often smaller than expected—interpretability and deployment considerations often matter more
3. Regularized Logistic Regression remains highly competitive while being fully transparent
4. XGBoost typically edges ahead on raw metrics but requires more infrastructure
5. Random Forest provides an excellent middle ground

### Next Module

**Proceed to:** `Module 4: Unsupervised Learning` (coming soon)