# 2.3 Tuning Tree-Based Models

## Course 3: Advanced Classification Models for Student Success

## Introduction

In notebook 2.2, we built all three tree-based models using default or reasonable hyperparameters. Now we learn to **tune** these models systematically to improve performance. Again, the approach is consistent across all three models—scikit-learn provides a unified tuning API.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Use `GridSearchCV` and `RandomizedSearchCV` with any tree-based model
2. Identify the most impactful hyperparameters for each model
3. Apply early stopping with XGBoost to prevent overfitting
4. Select optimal models using cross-validated performance

## 1. Setup

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import (GridSearchCV, RandomizedSearchCV,
                                      StratifiedKFold, cross_val_score)
from sklearn.metrics import roc_auc_score, make_scorer
import time

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Load and prepare data (same as 2.2)
train_df = pd.read_csv('../../data/training.csv')
test_df = pd.read_csv('../../data/testing.csv')
train_df['DEPARTED'] = (train_df['SEM_3_STATUS'] != 'E').astype(int)
test_df['DEPARTED'] = (test_df['SEM_3_STATUS'] != 'E').astype(int)

numeric_features = ['HS_GPA','HS_MATH_GPA','HS_ENGL_GPA','UNITS_ATTEMPTED_1','UNITS_ATTEMPTED_2',
    'UNITS_COMPLETED_1','UNITS_COMPLETED_2','DFW_UNITS_1','DFW_UNITS_2','GPA_1','GPA_2',
    'DFW_RATE_1','DFW_RATE_2','GRADE_POINTS_1','GRADE_POINTS_2']
categorical_features = ['RACE_ETHNICITY','GENDER','FIRST_GEN_STATUS','COLLEGE']

train_enc = pd.get_dummies(train_df[numeric_features + categorical_features],
                           columns=categorical_features, drop_first=True)
test_enc = pd.get_dummies(test_df[numeric_features + categorical_features],
                          columns=categorical_features, drop_first=True)
train_enc, test_enc = train_enc.align(test_enc, join='left', axis=1, fill_value=0)
train_enc = train_enc.fillna(train_enc.median())
test_enc = test_enc.fillna(test_enc.median())

X_train, y_train = train_enc, train_df['DEPARTED']
X_test, y_test = test_enc, test_df['DEPARTED']

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
print(f"Data loaded: {X_train.shape[0]:,} training, {X_test.shape[0]:,} testing samples")

## 2. The Universal Tuning Pattern

Just like `fit/predict`, hyperparameter tuning follows the same pattern for all models:

```python
# 1. Define the model
model = SomeClassifier()

# 2. Define the search space
param_grid = {'param1': [val1, val2], 'param2': [val3, val4]}

# 3. Run the search
search = GridSearchCV(model, param_grid, cv=5, scoring='roc_auc')
search.fit(X_train, y_train)

# 4. Get the best model
best_model = search.best_estimator_
print(search.best_params_)
```

## 3. Tuning Decision Trees

In [None]:
# Decision Tree hyperparameter search
print("Tuning Decision Tree...")
start = time.time()

dt_param_grid = {
    'max_depth': [3, 5, 8, 12],
    'min_samples_split': [10, 20, 50],
    'min_samples_leaf': [5, 10, 20],
    'class_weight': ['balanced', None]
}

dt_search = GridSearchCV(
    DecisionTreeClassifier(random_state=RANDOM_STATE),
    dt_param_grid, cv=cv, scoring='roc_auc', n_jobs=-1, refit=True
)
dt_search.fit(X_train, y_train)

print(f"Completed in {time.time()-start:.1f}s")
print(f"Best AUC: {dt_search.best_score_:.4f}")
print(f"Best params: {dt_search.best_params_}")

# Evaluate on test set
dt_best = dt_search.best_estimator_
dt_prob = dt_best.predict_proba(X_test)[:, 1]
print(f"Test AUC: {roc_auc_score(y_test, dt_prob):.4f}")

## 4. Tuning Random Forest

In [None]:
# Random Forest hyperparameter search (use RandomizedSearchCV for efficiency)
print("Tuning Random Forest...")
start = time.time()

rf_param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [8, 12, 16, None],
    'min_samples_split': [5, 10, 20],
    'min_samples_leaf': [3, 5, 10],
    'max_features': ['sqrt', 'log2'],
    'class_weight': ['balanced', 'balanced_subsample']
}

rf_search = RandomizedSearchCV(
    RandomForestClassifier(n_jobs=-1, random_state=RANDOM_STATE),
    rf_param_dist, n_iter=30, cv=cv, scoring='roc_auc', n_jobs=-1,
    random_state=RANDOM_STATE, refit=True
)
rf_search.fit(X_train, y_train)

print(f"Completed in {time.time()-start:.1f}s")
print(f"Best AUC: {rf_search.best_score_:.4f}")
print(f"Best params: {rf_search.best_params_}")

rf_best = rf_search.best_estimator_
rf_prob = rf_best.predict_proba(X_test)[:, 1]
print(f"Test AUC: {roc_auc_score(y_test, rf_prob):.4f}")

## 5. Tuning XGBoost

In [None]:
# XGBoost hyperparameter search
print("Tuning XGBoost...")
start = time.time()

xgb_param_dist = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

xgb_search = RandomizedSearchCV(
    XGBClassifier(use_label_encoder=False, eval_metric='logloss',
                  scale_pos_weight=len(y_train[y_train==0])/len(y_train[y_train==1]),
                  random_state=RANDOM_STATE),
    xgb_param_dist, n_iter=30, cv=cv, scoring='roc_auc', n_jobs=-1,
    random_state=RANDOM_STATE, refit=True
)
xgb_search.fit(X_train, y_train)

print(f"Completed in {time.time()-start:.1f}s")
print(f"Best AUC: {xgb_search.best_score_:.4f}")
print(f"Best params: {xgb_search.best_params_}")

xgb_best = xgb_search.best_estimator_
xgb_prob = xgb_best.predict_proba(X_test)[:, 1]
print(f"Test AUC: {roc_auc_score(y_test, xgb_prob):.4f}")

## 6. XGBoost Early Stopping

XGBoost has a powerful feature: **early stopping**. It automatically stops training when performance on a validation set stops improving, preventing overfitting without you having to guess the right `n_estimators`.

In [None]:
# XGBoost with early stopping
from sklearn.model_selection import train_test_split

X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2,
                                              stratify=y_train, random_state=RANDOM_STATE)

xgb_early = XGBClassifier(
    n_estimators=1000,  # Set high — early stopping will find the right number
    learning_rate=0.05,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss',
    early_stopping_rounds=20,  # Stop if no improvement for 20 rounds
    random_state=RANDOM_STATE
)

xgb_early.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)

print(f"Best iteration: {xgb_early.best_iteration}")
print(f"Stopped at: {xgb_early.best_iteration} out of 1000 max rounds")
print(f"Test AUC: {roc_auc_score(y_test, xgb_early.predict_proba(X_test)[:, 1]):.4f}")

## 7. Final Comparison: Tuned Models

In [None]:
# Compare all tuned models
from sklearn.metrics import f1_score, precision_score, recall_score

tuned_results = []
for name, model, prob in [('Decision Tree', dt_best, dt_prob),
                           ('Random Forest', rf_best, rf_prob),
                           ('XGBoost', xgb_best, xgb_prob)]:
    pred = (prob >= 0.5).astype(int)
    tuned_results.append({
        'Model': name,
        'ROC-AUC': roc_auc_score(y_test, prob),
        'F1': f1_score(y_test, pred),
        'Precision': precision_score(y_test, pred),
        'Recall': recall_score(y_test, pred)
    })

tuned_df = pd.DataFrame(tuned_results)
print("=" * 70)
print("TUNED MODELS: FINAL COMPARISON")
print("=" * 70)
print(tuned_df.to_string(index=False))
print("=" * 70)

## 8. Summary

### Tuning Cheat Sheet

| Model | Most Impactful Parameters | Tuning Strategy |
|:------|:------------------------|:----------------|
| **Decision Tree** | `max_depth`, `min_samples_leaf` | GridSearchCV (small space) |
| **Random Forest** | `n_estimators`, `max_depth`, `max_features` | RandomizedSearchCV |
| **XGBoost** | `learning_rate`, `max_depth`, `n_estimators` | RandomizedSearchCV + early stopping |

### Key Takeaways

1. **Same tuning API** for all three: `GridSearchCV` or `RandomizedSearchCV`
2. **Decision Trees** are fast to tune but have a performance ceiling
3. **Random Forests** are robust—defaults are often near-optimal
4. **XGBoost early stopping** is the most efficient way to find optimal rounds
5. Always evaluate on a **held-out test set** after tuning

### Next Steps

In the next notebook, we'll build a comprehensive evaluation comparing these tuned models with deeper analysis.

**Proceed to:** `2.4 Evaluating Tree-Based Models`