# 2.4 Evaluating Tree-Based Models

## Course 3: Advanced Classification Models for Student Success

## Introduction

In this notebook, we perform a **thorough evaluation** of our three tuned tree-based models. We go beyond simple accuracy to examine ROC curves, precision-recall trade-offs, confusion matrices, calibration, and feature importance—all critical for deploying models in a higher education context.

### Learning Objectives

1. Generate and interpret ROC and Precision-Recall curves
2. Analyze confusion matrices for each model
3. Compare feature importances across models
4. Understand probability calibration and threshold selection
5. Make an informed recommendation for deployment

## 1. Setup

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, precision_recall_curve, average_precision_score,
    confusion_matrix, classification_report, brier_score_loss)

RANDOM_STATE = 42

# Load and prepare data
train_df = pd.read_csv('../../data/training.csv')
test_df = pd.read_csv('../../data/testing.csv')
train_df['DEPARTED'] = (train_df['SEM_3_STATUS'] != 'E').astype(int)
test_df['DEPARTED'] = (test_df['SEM_3_STATUS'] != 'E').astype(int)

numeric_features = ['HS_GPA','HS_MATH_GPA','HS_ENGL_GPA','UNITS_ATTEMPTED_1','UNITS_ATTEMPTED_2',
    'UNITS_COMPLETED_1','UNITS_COMPLETED_2','DFW_UNITS_1','DFW_UNITS_2','GPA_1','GPA_2',
    'DFW_RATE_1','DFW_RATE_2','GRADE_POINTS_1','GRADE_POINTS_2']
categorical_features = ['RACE_ETHNICITY','GENDER','FIRST_GEN_STATUS','COLLEGE']

train_enc = pd.get_dummies(train_df[numeric_features + categorical_features],
                           columns=categorical_features, drop_first=True)
test_enc = pd.get_dummies(test_df[numeric_features + categorical_features],
                          columns=categorical_features, drop_first=True)
train_enc, test_enc = train_enc.align(test_enc, join='left', axis=1, fill_value=0)
train_enc = train_enc.fillna(train_enc.median())
test_enc = test_enc.fillna(test_enc.median())

X_train, y_train = train_enc, train_df['DEPARTED']
X_test, y_test = test_enc, test_df['DEPARTED']

# Train models with good hyperparameters
models = {
    'Decision Tree': DecisionTreeClassifier(max_depth=8, min_samples_split=20,
        min_samples_leaf=10, class_weight='balanced', random_state=RANDOM_STATE),
    'Random Forest': RandomForestClassifier(n_estimators=200, max_depth=12,
        min_samples_leaf=5, class_weight='balanced', n_jobs=-1, random_state=RANDOM_STATE),
    'XGBoost': XGBClassifier(n_estimators=150, learning_rate=0.1, max_depth=5,
        subsample=0.8, colsample_bytree=0.8, use_label_encoder=False, eval_metric='logloss',
        scale_pos_weight=len(y_train[y_train==0])/len(y_train[y_train==1]),
        random_state=RANDOM_STATE)
}

predictions, probabilities = {}, {}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions[name] = model.predict(X_test)
    probabilities[name] = model.predict_proba(X_test)[:, 1]

print("All models trained and predictions generated!")

## 2. ROC Curve Comparison

In [None]:
fig = go.Figure()
colors = ['#2ecc71', '#3498db', '#e74c3c']

for i, (name, prob) in enumerate(probabilities.items()):
    fpr, tpr, _ = roc_curve(y_test, prob)
    auc = roc_auc_score(y_test, prob)
    fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines',
        name=f'{name} (AUC={auc:.3f})', line=dict(color=colors[i], width=2)))

fig.add_trace(go.Scatter(x=[0,1], y=[0,1], mode='lines', name='Random',
    line=dict(color='gray', dash='dash', width=1)))

fig.update_layout(title='ROC Curves: Tree-Based Models', height=500,
    xaxis_title='False Positive Rate', yaxis_title='True Positive Rate',
    yaxis=dict(scaleanchor='x', scaleratio=1))
fig.show()

## 3. Precision-Recall Curves

In [None]:
fig = go.Figure()
for i, (name, prob) in enumerate(probabilities.items()):
    prec, rec, _ = precision_recall_curve(y_test, prob)
    ap = average_precision_score(y_test, prob)
    fig.add_trace(go.Scatter(x=rec, y=prec, mode='lines',
        name=f'{name} (AP={ap:.3f})', line=dict(color=colors[i], width=2)))

prevalence = y_test.mean()
fig.add_hline(y=prevalence, line_dash='dash', line_color='gray',
              annotation_text=f'Baseline ({prevalence:.1%})')

fig.update_layout(title='Precision-Recall Curves: Tree-Based Models', height=500,
    xaxis_title='Recall', yaxis_title='Precision')
fig.show()

## 4. Confusion Matrices

In [None]:
fig = make_subplots(rows=1, cols=3, subplot_titles=list(models.keys()))

for col, (name, pred) in enumerate(predictions.items(), 1):
    cm = confusion_matrix(y_test, pred)
    labels = [['TN', 'FP'], ['FN', 'TP']]
    text = [[f'{labels[i][j]}<br>{cm[i][j]}' for j in range(2)] for i in range(2)]

    fig.add_trace(go.Heatmap(z=cm, text=text, texttemplate='%{text}',
        colorscale='Blues', showscale=False,
        x=['Predicted Enrolled', 'Predicted Departed'],
        y=['Actual Enrolled', 'Actual Departed']), row=1, col=col)

fig.update_layout(title='Confusion Matrices', height=400)
fig.show()

## 5. Feature Importance Comparison

In [None]:
importance_df = pd.DataFrame({'Feature': X_train.columns})
for name, model in models.items():
    importance_df[name] = model.feature_importances_

# Top 10 by average importance
importance_df['Average'] = importance_df[list(models.keys())].mean(axis=1)
top_features = importance_df.nlargest(10, 'Average')

fig = go.Figure()
for i, name in enumerate(models.keys()):
    sorted_df = top_features.sort_values(name)
    fig.add_trace(go.Bar(y=sorted_df['Feature'], x=sorted_df[name],
        orientation='h', name=name, marker_color=colors[i]))

fig.update_layout(title='Top 10 Features by Importance', barmode='group',
    height=450, xaxis_title='Importance')
fig.show()

print("Top features consistently important across all models:")
for _, row in top_features.nlargest(5, 'Average').iterrows():
    print(f"  {row['Feature']}: avg importance = {row['Average']:.4f}")

## 6. Summary and Recommendations

### Model Comparison Summary

| Criterion | Decision Tree | Random Forest | XGBoost |
|:----------|:-------------|:-------------|:--------|
| **Best for** | Explainability | Reliability | Performance |
| **Use when** | Advisors need to understand | You want a safe default | Accuracy matters most |
| **Avoid when** | Performance is critical | Training speed matters | Simplicity is required |

### For Higher Education Deployment

1. **Advisor-facing tools**: Use Decision Tree (or extract rules from it)
2. **Batch risk scoring**: Use Random Forest or XGBoost
3. **Research/grants**: Use XGBoost (best metrics for publications)
4. **Early warning systems**: Use Random Forest (reliable, handles missing data gracefully)

### Next Module

We will compare these tree-based models against Regularized Logistic Regression in a systematic model comparison framework.

**Proceed to:** `Module 3: Model Comparison and Selection`