# 2.2 Building Tree-Based Models in Practice

## Course 3: Advanced Classification Models for Student Success

## Introduction

In this notebook, we apply the **instantiate → fit → predict** pattern to build all three tree-based models on our student departure dataset. The emphasis is on the **practical workflow**—you'll see how little changes between models.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Load and prepare data for tree-based models
2. Build a Decision Tree, Random Forest, and XGBoost model using the same workflow
3. Generate predictions and probability estimates
4. Extract feature importances from each model
5. Visualize a decision tree for stakeholder communication

## 1. Setup and Data Preparation

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Models — note: all from scikit-learn compatible API
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Evaluation
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                              f1_score, roc_auc_score, classification_report)
from sklearn.model_selection import cross_val_score, StratifiedKFold

import time

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Libraries imported successfully!")

In [None]:
# Load the training and testing datasets
train_df = pd.read_csv('../../data/training.csv')
test_df = pd.read_csv('../../data/testing.csv')

# Create binary target
train_df['DEPARTED'] = (train_df['SEM_3_STATUS'] != 'E').astype(int)
test_df['DEPARTED'] = (test_df['SEM_3_STATUS'] != 'E').astype(int)

print(f"Training set: {train_df.shape[0]:,} students")
print(f"Testing set: {test_df.shape[0]:,} students")
print(f"Departure rate (Train): {train_df['DEPARTED'].mean():.2%}")
print(f"Departure rate (Test): {test_df['DEPARTED'].mean():.2%}")

In [None]:
# Define features — NO SCALING NEEDED for tree-based models!
numeric_features = [
    'HS_GPA', 'HS_MATH_GPA', 'HS_ENGL_GPA',
    'UNITS_ATTEMPTED_1', 'UNITS_ATTEMPTED_2',
    'UNITS_COMPLETED_1', 'UNITS_COMPLETED_2',
    'DFW_UNITS_1', 'DFW_UNITS_2',
    'GPA_1', 'GPA_2',
    'DFW_RATE_1', 'DFW_RATE_2',
    'GRADE_POINTS_1', 'GRADE_POINTS_2'
]

categorical_features = ['RACE_ETHNICITY', 'GENDER', 'FIRST_GEN_STATUS', 'COLLEGE']
target = 'DEPARTED'

# One-hot encode categorical variables
train_encoded = pd.get_dummies(train_df[numeric_features + categorical_features],
                               columns=categorical_features, drop_first=True)
test_encoded = pd.get_dummies(test_df[numeric_features + categorical_features],
                              columns=categorical_features, drop_first=True)

# Align columns
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1, fill_value=0)

# Handle missing values
train_encoded = train_encoded.fillna(train_encoded.median())
test_encoded = test_encoded.fillna(test_encoded.median())

X_train = train_encoded
y_train = train_df[target]
X_test = test_encoded
y_test = test_df[target]

print(f"Features: {X_train.shape[1]}")
print(f"\nKey advantage: No feature scaling needed for tree-based models!")

## 2. The Three Models: Same Pattern, Different Strengths

Watch how the code structure is nearly identical for all three models. The only difference is the class name and hyperparameters.

### 2.1 Model 1: Decision Tree

In [None]:
# === DECISION TREE ===
# Step 1: Instantiate
dt = DecisionTreeClassifier(
    max_depth=8,
    min_samples_split=20,
    min_samples_leaf=10,
    max_features='sqrt',
    class_weight='balanced',
    random_state=RANDOM_STATE
)

# Step 2: Fit
start = time.time()
dt.fit(X_train, y_train)
dt_time = time.time() - start

# Step 3: Predict
dt_pred = dt.predict(X_test)
dt_prob = dt.predict_proba(X_test)[:, 1]

# Step 4: Evaluate
print("=== Decision Tree Results ===")
print(f"Training time: {dt_time:.3f}s")
print(f"Accuracy:  {accuracy_score(y_test, dt_pred):.4f}")
print(f"Precision: {precision_score(y_test, dt_pred):.4f}")
print(f"Recall:    {recall_score(y_test, dt_pred):.4f}")
print(f"F1 Score:  {f1_score(y_test, dt_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, dt_prob):.4f}")

### 2.2 Model 2: Random Forest

In [None]:
# === RANDOM FOREST ===
# Step 1: Instantiate
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=12,
    min_samples_split=10,
    min_samples_leaf=5,
    max_features='sqrt',
    class_weight='balanced',
    n_jobs=-1,
    random_state=RANDOM_STATE
)

# Step 2: Fit
start = time.time()
rf.fit(X_train, y_train)
rf_time = time.time() - start

# Step 3: Predict
rf_pred = rf.predict(X_test)
rf_prob = rf.predict_proba(X_test)[:, 1]

# Step 4: Evaluate
print("=== Random Forest Results ===")
print(f"Training time: {rf_time:.3f}s")
print(f"Accuracy:  {accuracy_score(y_test, rf_pred):.4f}")
print(f"Precision: {precision_score(y_test, rf_pred):.4f}")
print(f"Recall:    {recall_score(y_test, rf_pred):.4f}")
print(f"F1 Score:  {f1_score(y_test, rf_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, rf_prob):.4f}")

### 2.3 Model 3: XGBoost

In [None]:
# === XGBOOST ===
# Step 1: Instantiate
xgb = XGBClassifier(
    n_estimators=150,
    learning_rate=0.1,
    max_depth=5,
    min_child_weight=3,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]),
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=RANDOM_STATE
)

# Step 2: Fit
start = time.time()
xgb.fit(X_train, y_train)
xgb_time = time.time() - start

# Step 3: Predict
xgb_pred = xgb.predict(X_test)
xgb_prob = xgb.predict_proba(X_test)[:, 1]

# Step 4: Evaluate
print("=== XGBoost Results ===")
print(f"Training time: {xgb_time:.3f}s")
print(f"Accuracy:  {accuracy_score(y_test, xgb_pred):.4f}")
print(f"Precision: {precision_score(y_test, xgb_pred):.4f}")
print(f"Recall:    {recall_score(y_test, xgb_pred):.4f}")
print(f"F1 Score:  {f1_score(y_test, xgb_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, xgb_prob):.4f}")

## 3. Side-by-Side Comparison

Now let's see how the three models compare directly.

In [None]:
# Compile results
results = pd.DataFrame({
    'Model': ['Decision Tree', 'Random Forest', 'XGBoost'],
    'Accuracy': [accuracy_score(y_test, p) for p in [dt_pred, rf_pred, xgb_pred]],
    'Precision': [precision_score(y_test, p) for p in [dt_pred, rf_pred, xgb_pred]],
    'Recall': [recall_score(y_test, p) for p in [dt_pred, rf_pred, xgb_pred]],
    'F1 Score': [f1_score(y_test, p) for p in [dt_pred, rf_pred, xgb_pred]],
    'ROC-AUC': [roc_auc_score(y_test, p) for p in [dt_prob, rf_prob, xgb_prob]],
    'Train Time (s)': [dt_time, rf_time, xgb_time]
})

print("=" * 80)
print("TREE-BASED MODELS: HEAD-TO-HEAD COMPARISON")
print("=" * 80)
print(results.to_string(index=False))
print("=" * 80)

In [None]:
# Visual comparison
fig = go.Figure()
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC']
colors = ['#2ecc71', '#3498db', '#e74c3c']

for i, model in enumerate(results['Model']):
    fig.add_trace(go.Bar(
        name=model,
        x=metrics,
        y=[results.loc[i, m] for m in metrics],
        marker_color=colors[i]
    ))

fig.update_layout(
    title='Tree-Based Models: Performance Comparison',
    yaxis_title='Score', barmode='group', height=450,
    legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1)
)
fig.show()

## 4. Feature Importance

All three tree-based models provide **feature importance scores**—another shared capability. Feature importance tells us which student characteristics are most predictive of departure.

In [None]:
# Extract feature importances from all three models
importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Decision Tree': dt.feature_importances_,
    'Random Forest': rf.feature_importances_,
    'XGBoost': xgb.feature_importances_
})

# Sort by Random Forest importance (usually most stable)
importance_df = importance_df.sort_values('Random Forest', ascending=False).head(15)

# Plot
fig = make_subplots(rows=1, cols=3, subplot_titles=('Decision Tree', 'Random Forest', 'XGBoost'))

for col, model in enumerate(['Decision Tree', 'Random Forest', 'XGBoost'], 1):
    sorted_df = importance_df.sort_values(model, ascending=True)
    fig.add_trace(go.Bar(
        y=sorted_df['Feature'], x=sorted_df[model],
        orientation='h', marker_color=colors[col-1], showlegend=False
    ), row=1, col=col)

fig.update_layout(height=500, title_text='Top 15 Features by Importance (All Three Models)')
fig.show()

## 5. Visualizing the Decision Tree

One of the biggest advantages of Decision Trees is that we can **visualize the entire model**. This is invaluable for communicating with non-technical stakeholders.

In [None]:
import matplotlib.pyplot as plt

# Train a shallow tree for visualization
dt_visual = DecisionTreeClassifier(max_depth=3, class_weight='balanced', random_state=42)
dt_visual.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(dt_visual, feature_names=X_train.columns, class_names=['Enrolled', 'Departed'],
          filled=True, rounded=True, fontsize=10, ax=ax)
plt.title('Decision Tree for Student Departure Prediction (max_depth=3)', fontsize=16)
plt.tight_layout()
plt.show()

print("\nThis tree can be directly shared with academic advisors!")
print("Each path from root to leaf is an interpretable rule.")

## 6. Cross-Validation: More Robust Comparison

In [None]:
# 5-fold cross-validation for all three models
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

models_cv = {
    'Decision Tree': DecisionTreeClassifier(max_depth=8, min_samples_split=20,
                                             min_samples_leaf=10, class_weight='balanced',
                                             random_state=RANDOM_STATE),
    'Random Forest': RandomForestClassifier(n_estimators=200, max_depth=12,
                                             min_samples_leaf=5, class_weight='balanced',
                                             n_jobs=-1, random_state=RANDOM_STATE),
    'XGBoost': XGBClassifier(n_estimators=150, learning_rate=0.1, max_depth=5,
                              subsample=0.8, colsample_bytree=0.8,
                              use_label_encoder=False, eval_metric='logloss',
                              random_state=RANDOM_STATE)
}

print("5-Fold Cross-Validation Results (ROC-AUC):")
print("-" * 50)
for name, model in models_cv.items():
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)
    print(f"{name:20s}: {scores.mean():.4f} (+/- {scores.std():.4f})")

## 7. Summary

### What We Learned

1. **Same workflow, three models**: `instantiate → fit → predict` works identically for all three
2. **No preprocessing needed**: Tree-based models handle raw features (no scaling required)
3. **Feature importances**: All three provide built-in feature importance scores
4. **Decision Trees are visualizable**: Perfect for stakeholder communication
5. **Random Forests are robust**: Good default choice with minimal tuning
6. **XGBoost often wins on performance**: Best for when accuracy matters most

### The Practical Takeaway

```python
# Switching between models is this easy:
model = DecisionTreeClassifier(max_depth=8)    # Option A
model = RandomForestClassifier(n_estimators=200) # Option B
model = XGBClassifier(n_estimators=150)          # Option C

# Everything else stays the same!
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)[:, 1]
```

### Next Steps

In the next notebook, we'll dive into **hyperparameter tuning** for each model and learn strategies to get the best performance.

**Proceed to:** `2.3 Tuning Tree-Based Models`