# 2.1 Introduction to Tree-Based Models

## Course 3: Advanced Classification Models for Student Success

## Introduction

In Module 1, we extended logistic regression with regularization. Now we turn to **tree-based models**—a family of algorithms that are among the most widely used and successful in applied machine learning. This module covers three core methods that share a common foundation:

1. **Decision Trees** — the building block
2. **Random Forests** — combining many trees via bagging
3. **XGBoost** — combining trees via gradient boosting

These three models represent the practical toolkit you will use most often. They share a common scikit-learn API (`instantiate → fit → predict`) and are the go-to models for tabular data like student records.

### Why These Three?

| Model | Strategy | Key Strength |
|:------|:---------|:-------------|
| **Decision Tree** | Single tree with learned rules | Highly interpretable, great for stakeholder communication |
| **Random Forest** | Many trees trained in parallel (bagging) | Robust, reduces overfitting, good default choice |
| **XGBoost** | Trees trained sequentially (boosting) | Top performance on tabular data, competition winner |

Together, these three models cover the spectrum from maximum interpretability (Decision Tree) to maximum predictive power (XGBoost), with Random Forest as a reliable middle ground.

### Learning Objectives

By the end of this module, you will be able to:

1. Explain how decision trees partition feature space using impurity measures
2. Understand how Random Forests reduce variance through bagging
3. Understand how XGBoost reduces bias through sequential boosting
4. **Build all three models using the same scikit-learn pattern**: `instantiate → fit → predict`
5. Tune hyperparameters for each model family
6. Compare models on the same student departure dataset

## 1. The Common Pattern: Instantiate, Fit, Predict

Before diving into theory, let's establish the most important practical insight of this module: **all three models follow the exact same scikit-learn workflow**. This is by design—scikit-learn's consistent API means that once you learn the pattern, switching between models is trivial.

```python
# The universal scikit-learn pattern:

# Step 1: Instantiate the model with hyperparameters
model = ModelClass(param1=value1, param2=value2)

# Step 2: Fit the model to training data
model.fit(X_train, y_train)

# Step 3: Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Step 4: Evaluate
from sklearn.metrics import accuracy_score, roc_auc_score
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"AUC: {roc_auc_score(y_test, y_prob):.3f}")
```

The **only thing that changes** between models is Step 1—the class name and its hyperparameters.

In [None]:
# Demonstration: The same pattern, three different models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# All three follow the EXACT same API:
models = {
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, max_depth=5, learning_rate=0.1,
                              use_label_encoder=False, eval_metric='logloss', random_state=42)
}

# Same workflow for each:
for name, model in models.items():
    print(f"{name}:")
    print(f"  model.fit(X_train, y_train)")
    print(f"  model.predict(X_test)")
    print(f"  model.predict_proba(X_test)")
    print()

print("That's it. Same three lines. Different model, same interface.")

## 2. What is a Decision Tree?

A **decision tree** is a supervised learning algorithm that makes predictions by learning a series of simple decision rules from the data. Think of it as a flowchart:

- Each **internal node** asks a yes/no question about a feature (e.g., "Is GPA ≤ 2.5?")
- Each **branch** represents the answer
- Each **leaf node** provides a prediction

**Higher Education Example**: An advisor might assess student risk by asking:
1. "Is their GPA below 2.0?" → High risk
2. "Did they fail courses in semester 1?" → Concerning
3. "Are they taking fewer than 12 units?" → May need support

A decision tree automates exactly this kind of reasoning.

In [None]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Visualize a simple decision tree structure
fig = go.Figure()

# Node positions
nodes = {
    'root': (0.5, 1.0), 'left1': (0.25, 0.7), 'right1': (0.75, 0.7),
    'left2': (0.125, 0.4), 'right2': (0.375, 0.4),
    'left3': (0.625, 0.4), 'right3': (0.875, 0.4)
}

# Draw edges
edges = [('root','left1'),('root','right1'),('left1','left2'),('left1','right2'),
         ('right1','left3'),('right1','right3')]
for s, e in edges:
    fig.add_trace(go.Scatter(x=[nodes[s][0],nodes[e][0]], y=[nodes[s][1],nodes[e][1]],
                             mode='lines', line=dict(color='gray',width=2), showlegend=False))

# Decision nodes
fig.add_trace(go.Scatter(
    x=[nodes[n][0] for n in ['root','left1','right1']],
    y=[nodes[n][1] for n in ['root','left1','right1']],
    mode='markers+text', marker=dict(size=50, color='lightblue', line=dict(color='blue',width=2)),
    text=['GPA_1 <= 2.5?', 'DFW > 0.3?', 'UNITS < 12?'],
    textposition='middle center', textfont=dict(size=10), name='Decision Nodes'))

# Leaf nodes
fig.add_trace(go.Scatter(
    x=[nodes[n][0] for n in ['left2','right2','left3','right3']],
    y=[nodes[n][1] for n in ['left2','right2','left3','right3']],
    mode='markers+text', marker=dict(size=50, color='lightgreen', symbol='square',
    line=dict(color='green',width=2)),
    text=['Not Enrolled','Enrolled','Not Enrolled','Enrolled'],
    textposition='middle center', textfont=dict(size=9), name='Leaf Nodes'))

fig.add_annotation(x=0.35, y=0.88, text='Yes', showarrow=False, font=dict(color='green'))
fig.add_annotation(x=0.65, y=0.88, text='No', showarrow=False, font=dict(color='red'))

fig.update_layout(title='Anatomy of a Decision Tree', height=450,
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
fig.show()

### 2.1 How Trees Make Splits: Gini Impurity

Decision trees choose splits by finding the feature and threshold that best separates the classes. The default measure in scikit-learn is **Gini Impurity**:

$$Gini = 1 - \sum_{i=1}^{C} p_i^2$$

- A **pure node** (all one class) has Gini = 0
- **Maximum impurity** (50-50 split) has Gini = 0.5

The tree picks the split that reduces Gini the most at each step. This is a **greedy** algorithm—it makes the locally optimal choice without looking ahead.

### 2.2 Controlling Overfitting

Without constraints, a tree will keep splitting until every leaf is pure—memorizing the training data. Key hyperparameters to prevent this:

| Parameter | What It Does | Typical Range |
|:----------|:-------------|:-------------|
| `max_depth` | Maximum tree depth | 3–15 |
| `min_samples_split` | Min samples to split a node | 5–50 |
| `min_samples_leaf` | Min samples in a leaf | 3–20 |
| `max_features` | Features considered per split | `'sqrt'`, `'log2'` |

## 3. From One Tree to Many: Random Forests

A single decision tree is interpretable but **unstable**—small changes in data can produce very different trees. **Random Forests** solve this by training many trees and averaging their predictions.

### The Bagging Strategy

1. Create `n_estimators` bootstrap samples (random samples with replacement)
2. Train one tree on each sample **independently**
3. Each tree also considers only a random subset of features at each split
4. Combine predictions by majority vote (classification) or averaging (regression)

```
Training Data
     |
     +---> Bootstrap Sample 1 ---> Tree 1 ---+
     +---> Bootstrap Sample 2 ---> Tree 2 ---+--> Majority Vote = Final Prediction
     +---> Bootstrap Sample 3 ---> Tree 3 ---+
     ...
     +---> Bootstrap Sample N ---> Tree N ---+
```

**Why it works**: Each tree overfits in a different way. Averaging many diverse, overfitting trees produces a model that generalizes well. This is the variance-reduction power of ensembles.

### Key Hyperparameters

| Parameter | What It Does | Typical Range |
|:----------|:-------------|:-------------|
| `n_estimators` | Number of trees | 100–500 |
| `max_depth` | Max depth per tree | 8–20, or None |
| `max_features` | Features per split | `'sqrt'` (default) |
| `min_samples_leaf` | Min samples in leaf | 1–10 |
| `class_weight` | Handle class imbalance | `'balanced'` |

## 4. Sequential Improvement: XGBoost

While Random Forests train trees **in parallel** and average them, **XGBoost** (eXtreme Gradient Boosting) trains trees **sequentially**, where each new tree corrects the errors of the previous ones.

### The Boosting Strategy

1. Start with a simple prediction (e.g., the overall departure rate)
2. Calculate the errors (residuals)
3. Train a small tree to predict those errors
4. Update predictions by adding the new tree's output (scaled by a learning rate)
5. Repeat

```
Data --> Tree 1 --> Errors --> Tree 2 --> Errors --> Tree 3 --> ...
           |                     |                     |
        Predicts              Predicts              Predicts
        the data          Tree 1's errors      remaining errors
```

**Key equation**: $F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)$

Where $\eta$ is the learning rate (typically 0.01–0.3) and $h_m$ is the new tree.

### Why XGBoost Dominates Tabular Data

XGBoost adds several innovations beyond basic gradient boosting:

- **Built-in regularization** (L1 + L2 on leaf weights)
- **Handles missing values** automatically
- **Column subsampling** (like Random Forest) for diversity
- **Early stopping** to prevent overfitting

### Key Hyperparameters

| Parameter | What It Does | Typical Range |
|:----------|:-------------|:-------------|
| `n_estimators` | Number of boosting rounds | 100–1000 |
| `learning_rate` | Step size shrinkage | 0.01–0.3 |
| `max_depth` | Max depth per tree (keep shallow!) | 3–8 |
| `subsample` | Row sampling ratio | 0.7–1.0 |
| `colsample_bytree` | Column sampling ratio | 0.7–1.0 |
| `scale_pos_weight` | Handle class imbalance | ratio of neg/pos |

## 5. Comparing the Three Approaches

| Aspect | Decision Tree | Random Forest | XGBoost |
|:-------|:-------------|:-------------|:--------|
| **Strategy** | Single tree | Many trees in parallel | Trees in sequence |
| **Reduces** | — | Variance (overfitting) | Bias (underfitting) |
| **Interpretability** | Excellent | Moderate | Lower |
| **Overfitting risk** | High | Low | Moderate (use early stopping) |
| **Preprocessing needed** | None | None | None |
| **Handles missing data** | No (sklearn) | No (sklearn) | Yes (native) |
| **Training speed** | Fast | Moderate | Moderate |
| **Typical performance** | Good baseline | Strong | Often best |

### When to Use Each

- **Decision Tree**: When you need to explain predictions to non-technical stakeholders (advisors, administrators)
- **Random Forest**: When you want a reliable, robust model with minimal tuning
- **XGBoost**: When you want the best possible predictive performance on tabular data

## 6. Summary

### Key Takeaways

1. **All three models follow the same scikit-learn API**: `instantiate → fit → predict`
2. **Decision Trees** are the building block—interpretable but prone to overfitting
3. **Random Forests** reduce variance by averaging many independent trees (bagging)
4. **XGBoost** reduces bias by sequentially correcting errors (boosting)
5. **No preprocessing needed**—tree-based models handle raw features directly
6. For higher education: start with Random Forest (reliable), use Decision Trees for stakeholder communication, and XGBoost when performance matters most

### Connection to the ML Cycle

| ML Cycle Step | Tree-Based Models |
|:-------------|:-----------------|
| **Build** | Choose model class and hyperparameters |
| **Train** | `model.fit(X_train, y_train)` |
| **Predict** | `model.predict(X_test)` / `model.predict_proba(X_test)` |
| **Evaluate** | Classification metrics (AUC, F1, etc.) |
| **Improve** | Tune hyperparameters, try different model |

### Next Steps

In the next notebook, we will put all three models to work on our student departure dataset using the consistent scikit-learn pattern.

**Proceed to:** `2.2 Building Tree-Based Models in Practice`