# 2.1 Introduction to Decision Trees

## Introduction

In Module 1, we explored regularized logistic regression models. Now we turn to a fundamentally different approach to classification: **Decision Trees**. Unlike logistic regression, which models the probability of class membership using a smooth mathematical function, decision trees partition the feature space into rectangular regions using a series of simple rules.

Decision trees are particularly valuable in higher education contexts because they produce models that are inherently interpretable. Administrators and advisors can easily understand and explain why a student was flagged as at-risk, which is crucial for designing effective interventions.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Explain the structure and components of a decision tree
2. Understand how decision trees make splitting decisions using impurity measures
3. Describe the recursive partitioning algorithm
4. Compare decision trees to logistic regression
5. Identify when decision trees are appropriate for higher education problems

## 1. What is a Decision Tree?

A **decision tree** is a supervised learning algorithm that makes predictions by learning a series of simple decision rules from the data. Think of it as a flowchart where each internal node asks a question about a feature, each branch represents an answer to that question, and each leaf node provides a prediction.

**Intuition**: Imagine you're an academic advisor trying to determine if a student is at risk of not returning for their third semester. You might ask:
1. "Is their GPA below 2.0?" If yes, they might be at risk.
2. "Did they fail any courses in the first semester?" If yes, more concern.
3. "Are they a first-generation student?" This might interact with other factors.

A decision tree automates this process by finding the best questions to ask and in what order.

### 1.1 The Anatomy of a Decision Tree

| Component | Description | Example |
|:----------|:------------|:--------|
| **Root Node** | The topmost node; first decision point | "Is GPA_1 <= 2.5?" |
| **Internal Nodes** | Decision points that split the data | "Is DFW_RATE_1 > 0.3?" |
| **Branches** | Outcomes of decisions (True/False) | Left: Yes, Right: No |
| **Leaf Nodes** | Terminal nodes with predictions | "Class: Not Enrolled" |
| **Depth** | Longest path from root to leaf | Depth = 3 means 3 decisions |

In [None]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create a visual representation of a decision tree structure
fig = go.Figure()

# Node positions (x, y)
nodes = {
    'root': (0.5, 1.0),
    'left1': (0.25, 0.7),
    'right1': (0.75, 0.7),
    'left2': (0.125, 0.4),
    'right2': (0.375, 0.4),
    'left3': (0.625, 0.4),
    'right3': (0.875, 0.4)
}

# Draw edges
edges = [
    ('root', 'left1'), ('root', 'right1'),
    ('left1', 'left2'), ('left1', 'right2'),
    ('right1', 'left3'), ('right1', 'right3')
]

for start, end in edges:
    fig.add_trace(go.Scatter(
        x=[nodes[start][0], nodes[end][0]],
        y=[nodes[start][1], nodes[end][1]],
        mode='lines',
        line=dict(color='gray', width=2),
        showlegend=False
    ))

# Draw nodes
# Internal nodes (decision nodes)
internal = ['root', 'left1', 'right1']
leaf = ['left2', 'right2', 'left3', 'right3']

fig.add_trace(go.Scatter(
    x=[nodes[n][0] for n in internal],
    y=[nodes[n][1] for n in internal],
    mode='markers+text',
    marker=dict(size=50, color='lightblue', line=dict(color='blue', width=2)),
    text=['GPA_1 <= 2.5?', 'DFW > 0.3?', 'UNITS < 12?'],
    textposition='middle center',
    textfont=dict(size=10),
    name='Decision Nodes'
))

fig.add_trace(go.Scatter(
    x=[nodes[n][0] for n in leaf],
    y=[nodes[n][1] for n in leaf],
    mode='markers+text',
    marker=dict(size=50, color='lightgreen', symbol='square', line=dict(color='green', width=2)),
    text=['Not Enrolled', 'Enrolled', 'Not Enrolled', 'Enrolled'],
    textposition='middle center',
    textfont=dict(size=9),
    name='Leaf Nodes'
))

# Add branch labels
fig.add_annotation(x=0.35, y=0.88, text='Yes', showarrow=False, font=dict(color='green'))
fig.add_annotation(x=0.65, y=0.88, text='No', showarrow=False, font=dict(color='red'))

fig.update_layout(
    title='Anatomy of a Decision Tree',
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    height=500,
    showlegend=True
)

fig.show()

### 1.2 A Simple Example

Let's visualize how a decision tree partitions the feature space. Consider a simplified scenario with just two features: GPA and DFW Rate.

In [None]:
# Generate synthetic student data for visualization
np.random.seed(42)

# Enrolled students (higher GPA, lower DFW rate)
n_enrolled = 100
gpa_enrolled = np.random.normal(3.2, 0.5, n_enrolled)
dfw_enrolled = np.random.normal(0.1, 0.1, n_enrolled)

# Not enrolled students (lower GPA, higher DFW rate)
n_not_enrolled = 40
gpa_not_enrolled = np.random.normal(2.0, 0.6, n_not_enrolled)
dfw_not_enrolled = np.random.normal(0.4, 0.15, n_not_enrolled)

# Clip values to realistic ranges
gpa_enrolled = np.clip(gpa_enrolled, 0, 4)
gpa_not_enrolled = np.clip(gpa_not_enrolled, 0, 4)
dfw_enrolled = np.clip(dfw_enrolled, 0, 1)
dfw_not_enrolled = np.clip(dfw_not_enrolled, 0, 1)

# Create visualization
fig = go.Figure()

# Plot enrolled students
fig.add_trace(go.Scatter(
    x=gpa_enrolled, y=dfw_enrolled,
    mode='markers',
    marker=dict(color='blue', size=8, opacity=0.7),
    name='Enrolled (E)'
))

# Plot not enrolled students
fig.add_trace(go.Scatter(
    x=gpa_not_enrolled, y=dfw_not_enrolled,
    mode='markers',
    marker=dict(color='red', size=8, opacity=0.7),
    name='Not Enrolled (N)'
))

# Add decision boundaries (simulating tree splits)
# First split: GPA <= 2.5
fig.add_vline(x=2.5, line_dash="dash", line_color="black", 
              annotation_text="Split 1: GPA <= 2.5")

# Second split: DFW > 0.25 (for GPA > 2.5 region)
fig.add_shape(type="line", x0=2.5, y0=0.25, x1=4.0, y1=0.25,
              line=dict(color="black", dash="dash"))
fig.add_annotation(x=3.25, y=0.28, text="Split 2: DFW > 0.25", showarrow=False)

fig.update_layout(
    title='How Decision Trees Partition Feature Space',
    xaxis_title='First Semester GPA',
    yaxis_title='DFW Rate',
    height=500,
    xaxis=dict(range=[0, 4]),
    yaxis=dict(range=[0, 0.8])
)

fig.show()

**Interpretation**: The decision tree creates rectangular regions by making axis-parallel splits. Students falling in different regions receive different predictions:
- **Region 1** (GPA <= 2.5): Mostly "Not Enrolled" students
- **Region 2** (GPA > 2.5 AND DFW > 0.25): Mixed, but concerning
- **Region 3** (GPA > 2.5 AND DFW <= 0.25): Mostly "Enrolled" students

## 2. How Decision Trees Make Splits

### 2.1 Impurity Measures

Decision trees choose splits by finding the feature and threshold that best separates the classes. "Best" is defined by an **impurity measure**. A node is "pure" if it contains samples from only one class.

The two most common impurity measures for classification are:
1. **Gini Impurity** (default in scikit-learn)
2. **Entropy** (Information Gain)

### 2.2 Gini Impurity

**Gini Impurity** measures the probability of incorrectly classifying a randomly chosen sample if it were labeled according to the class distribution at that node.

$$Gini = 1 - \sum_{i=1}^{C} p_i^2$$

Where $p_i$ is the proportion of samples belonging to class $i$, and $C$ is the number of classes.

**For binary classification (Enrolled vs. Not Enrolled):**
$$Gini = 1 - p_{Enrolled}^2 - p_{NotEnrolled}^2$$

| Scenario | $p_{Enrolled}$ | $p_{NotEnrolled}$ | Gini Impurity |
|:---------|:---------------|:------------------|:--------------|
| Pure node (all Enrolled) | 1.0 | 0.0 | 0.0 |
| Pure node (all Not Enrolled) | 0.0 | 1.0 | 0.0 |
| Maximum impurity (50-50) | 0.5 | 0.5 | 0.5 |
| Typical imbalanced | 0.87 | 0.13 | 0.23 |

In [None]:
# Visualize Gini Impurity
p = np.linspace(0, 1, 100)
gini = 1 - p**2 - (1-p)**2

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=p, y=gini,
    mode='lines',
    line=dict(color='blue', width=3),
    name='Gini Impurity'
))

# Add annotations for key points
fig.add_trace(go.Scatter(
    x=[0, 0.5, 1, 0.13],
    y=[0, 0.5, 0, 1 - 0.13**2 - 0.87**2],
    mode='markers+text',
    marker=dict(size=12, color='red'),
    text=['Pure<br>(all E)', 'Max Impurity<br>(50-50)', 'Pure<br>(all N)', 'Our Data<br>(87-13)'],
    textposition=['bottom center', 'top center', 'bottom center', 'top right'],
    showlegend=False
))

fig.update_layout(
    title='Gini Impurity vs. Class Proportion',
    xaxis_title='Proportion of Not Enrolled (Class N)',
    yaxis_title='Gini Impurity',
    height=400,
    xaxis=dict(range=[-0.05, 1.05]),
    yaxis=dict(range=[-0.05, 0.55])
)

fig.show()

### 2.3 Entropy and Information Gain

**Entropy** comes from information theory and measures the amount of "uncertainty" or "disorder" in a node.

$$Entropy = -\sum_{i=1}^{C} p_i \log_2(p_i)$$

**Information Gain** is the reduction in entropy after a split:
$$Information\ Gain = Entropy_{parent} - \sum_{children} \frac{n_{child}}{n_{parent}} \times Entropy_{child}$$

The split that maximizes information gain (or equivalently, minimizes weighted child entropy) is chosen.

In [None]:
# Compare Gini and Entropy
p = np.linspace(0.001, 0.999, 100)  # Avoid log(0)
gini = 1 - p**2 - (1-p)**2
entropy = -p * np.log2(p) - (1-p) * np.log2(1-p)

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=p, y=gini,
    mode='lines',
    line=dict(color='blue', width=3),
    name='Gini Impurity'
))

fig.add_trace(go.Scatter(
    x=p, y=entropy,
    mode='lines',
    line=dict(color='orange', width=3),
    name='Entropy'
))

fig.update_layout(
    title='Gini Impurity vs. Entropy',
    xaxis_title='Proportion of Class N',
    yaxis_title='Impurity Measure',
    height=400,
    showlegend=True
)

fig.show()

### 2.4 Comparing Gini and Entropy

| Property | Gini Impurity | Entropy |
|:---------|:--------------|:--------|
| Range | [0, 0.5] for binary | [0, 1] for binary |
| Computation | Faster (no logarithm) | Slower |
| Interpretation | Probability of misclassification | Information content |
| In practice | Often produces similar trees | Often produces similar trees |
| Default in sklearn | Yes (`criterion='gini'`) | No (`criterion='entropy'`) |

**Bottom line**: Both measures usually produce very similar trees. Gini is the default because it's computationally faster.

## 3. The Tree Building Algorithm

### 3.1 Recursive Partitioning

Decision trees are built using a **greedy, recursive algorithm** called CART (Classification and Regression Trees):

1. **Start at the root** with all training samples
2. **For each feature**:
   - Consider all possible split thresholds
   - Calculate the impurity reduction for each split
3. **Select the best split** (feature + threshold with maximum impurity reduction)
4. **Create two child nodes** based on the split
5. **Recursively repeat** steps 2-4 for each child node
6. **Stop** when a stopping criterion is met

**Key insight**: The algorithm is "greedy" because it makes the locally optimal choice at each step, without considering future splits. This makes it fast but doesn't guarantee the globally optimal tree.

In [None]:
# Demonstrate the splitting process
import pandas as pd

# Sample data
sample_data = pd.DataFrame({
    'GPA': [3.5, 2.8, 1.9, 3.2, 2.1, 3.8, 2.5, 1.5],
    'DFW_Rate': [0.0, 0.2, 0.5, 0.1, 0.4, 0.0, 0.3, 0.6],
    'Enrolled': ['E', 'E', 'N', 'E', 'N', 'E', 'E', 'N']
})

print("Sample Student Data:")
print(sample_data.to_string(index=False))

# Calculate Gini impurity for original data
p_enrolled = (sample_data['Enrolled'] == 'E').mean()
p_not_enrolled = 1 - p_enrolled
gini_original = 1 - p_enrolled**2 - p_not_enrolled**2

print(f"\nOriginal Distribution: {p_enrolled:.0%} Enrolled, {p_not_enrolled:.0%} Not Enrolled")
print(f"Original Gini Impurity: {gini_original:.3f}")

In [None]:
# Evaluate different split points for GPA
def calculate_weighted_gini(left_enrolled, left_total, right_enrolled, right_total):
    """Calculate weighted Gini impurity after a split."""
    total = left_total + right_total
    
    # Left node Gini
    if left_total > 0:
        p_left = left_enrolled / left_total
        gini_left = 1 - p_left**2 - (1-p_left)**2
    else:
        gini_left = 0
    
    # Right node Gini
    if right_total > 0:
        p_right = right_enrolled / right_total
        gini_right = 1 - p_right**2 - (1-p_right)**2
    else:
        gini_right = 0
    
    # Weighted average
    weighted_gini = (left_total/total) * gini_left + (right_total/total) * gini_right
    return weighted_gini, gini_left, gini_right

# Test different GPA thresholds
thresholds = [2.0, 2.3, 2.5, 3.0]
results = []

for thresh in thresholds:
    left_mask = sample_data['GPA'] <= thresh
    left_enrolled = (sample_data.loc[left_mask, 'Enrolled'] == 'E').sum()
    left_total = left_mask.sum()
    right_enrolled = (sample_data.loc[~left_mask, 'Enrolled'] == 'E').sum()
    right_total = (~left_mask).sum()
    
    weighted, gini_l, gini_r = calculate_weighted_gini(left_enrolled, left_total, right_enrolled, right_total)
    info_gain = gini_original - weighted
    
    results.append({
        'Threshold': thresh,
        'Left (<=)': f"{left_enrolled}E, {left_total-left_enrolled}N",
        'Right (>)': f"{right_enrolled}E, {right_total-right_enrolled}N",
        'Weighted Gini': round(weighted, 3),
        'Info Gain': round(info_gain, 3)
    })

results_df = pd.DataFrame(results)
print("\nEvaluating GPA Split Thresholds:")
print(results_df.to_string(index=False))
print(f"\nBest split: GPA <= {results_df.loc[results_df['Info Gain'].idxmax(), 'Threshold']} (highest information gain)")

### 3.2 Stopping Criteria

Without constraints, a decision tree would keep splitting until every leaf is pure (contains only one class). This leads to **overfitting**. Several stopping criteria prevent this:

| Criterion | Description | scikit-learn Parameter |
|:----------|:------------|:-----------------------|
| **Maximum Depth** | Limit tree depth | `max_depth` |
| **Minimum Samples Split** | Minimum samples to split a node | `min_samples_split` |
| **Minimum Samples Leaf** | Minimum samples in leaf nodes | `min_samples_leaf` |
| **Maximum Leaf Nodes** | Maximum number of leaf nodes | `max_leaf_nodes` |
| **Minimum Impurity Decrease** | Minimum improvement required | `min_impurity_decrease` |

These hyperparameters are crucial for controlling model complexity and preventing overfitting. We will explore them in detail in notebook 2.4.

In [None]:
# Visualize effect of max_depth on tree complexity
depths = [1, 2, 3, 4, 5, 10, None]
descriptions = ['Very simple', 'Simple', 'Moderate', 'Complex', 'More complex', 'Very complex', 'Unlimited']

complexity_data = pd.DataFrame({
    'max_depth': ['1', '2', '3', '4', '5', '10', 'None'],
    'Max Leaf Nodes': [2, 4, 8, 16, 32, 1024, 'Unlimited'],
    'Complexity': descriptions,
    'Risk': ['High Bias', 'Moderate Bias', 'Balanced', 'Moderate Variance', 'High Variance', 'Very High Variance', 'Extreme Variance']
})

print("Effect of max_depth on Tree Complexity:")
print(complexity_data.to_string(index=False))

## 4. Decision Trees vs. Logistic Regression

Now that we understand decision trees, let's compare them to the logistic regression models we built in Course 2 and Module 1.

| Aspect | Decision Trees | Logistic Regression |
|:-------|:---------------|:--------------------|
| **Decision Boundary** | Rectangular (axis-parallel) | Linear (smooth) |
| **Feature Interactions** | Captured automatically | Must be engineered |
| **Preprocessing** | Minimal (no scaling needed) | Important (scaling, encoding) |
| **Interpretability** | Visual rules | Coefficients/odds ratios |
| **Handling Missing Data** | Can be handled | Requires imputation |
| **Outliers** | Robust | Can be sensitive |
| **Overfitting Risk** | High (without constraints) | Lower (with regularization) |
| **Ensemble Potential** | Foundation for Random Forest, XGBoost | Limited |

In [None]:
# Visualize decision boundary differences
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# Create sample data
np.random.seed(42)
n_samples = 200

# Generate two overlapping classes
X1_class0 = np.random.multivariate_normal([3, 0.15], [[0.3, 0], [0, 0.01]], n_samples//2)
X1_class1 = np.random.multivariate_normal([2, 0.35], [[0.5, 0], [0, 0.015]], n_samples//2)

X_demo = np.vstack([X1_class0, X1_class1])
y_demo = np.array([0]*100 + [1]*100)

# Clip to realistic ranges
X_demo[:, 0] = np.clip(X_demo[:, 0], 0, 4)  # GPA
X_demo[:, 1] = np.clip(X_demo[:, 1], 0, 1)  # DFW Rate

# Train models
dt = DecisionTreeClassifier(max_depth=3, random_state=42)
lr = LogisticRegression(random_state=42)

dt.fit(X_demo, y_demo)
lr.fit(X_demo, y_demo)

# Create mesh for decision boundaries
xx, yy = np.meshgrid(np.linspace(0, 4, 100), np.linspace(0, 0.7, 100))
mesh_points = np.c_[xx.ravel(), yy.ravel()]

# Get predictions
Z_dt = dt.predict_proba(mesh_points)[:, 1].reshape(xx.shape)
Z_lr = lr.predict_proba(mesh_points)[:, 1].reshape(xx.shape)

# Create subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=('Decision Tree', 'Logistic Regression'))

# Decision Tree
fig.add_trace(go.Contour(x=np.linspace(0, 4, 100), y=np.linspace(0, 0.7, 100), z=Z_dt,
                         colorscale='RdBu', showscale=False, opacity=0.6,
                         contours=dict(start=0, end=1, size=0.1)), row=1, col=1)
fig.add_trace(go.Scatter(x=X_demo[y_demo==0, 0], y=X_demo[y_demo==0, 1],
                         mode='markers', marker=dict(color='blue', size=6), 
                         name='Enrolled', showlegend=True), row=1, col=1)
fig.add_trace(go.Scatter(x=X_demo[y_demo==1, 0], y=X_demo[y_demo==1, 1],
                         mode='markers', marker=dict(color='red', size=6), 
                         name='Not Enrolled', showlegend=True), row=1, col=1)

# Logistic Regression
fig.add_trace(go.Contour(x=np.linspace(0, 4, 100), y=np.linspace(0, 0.7, 100), z=Z_lr,
                         colorscale='RdBu', showscale=False, opacity=0.6,
                         contours=dict(start=0, end=1, size=0.1)), row=1, col=2)
fig.add_trace(go.Scatter(x=X_demo[y_demo==0, 0], y=X_demo[y_demo==0, 1],
                         mode='markers', marker=dict(color='blue', size=6), 
                         showlegend=False), row=1, col=2)
fig.add_trace(go.Scatter(x=X_demo[y_demo==1, 0], y=X_demo[y_demo==1, 1],
                         mode='markers', marker=dict(color='red', size=6), 
                         showlegend=False), row=1, col=2)

fig.update_xaxes(title_text='GPA', row=1, col=1)
fig.update_xaxes(title_text='GPA', row=1, col=2)
fig.update_yaxes(title_text='DFW Rate', row=1, col=1)
fig.update_yaxes(title_text='DFW Rate', row=1, col=2)

fig.update_layout(height=450, title_text='Decision Boundaries: Decision Tree vs. Logistic Regression')

fig.show()

**Key Observation**: Notice how the decision tree creates rectangular regions with sharp boundaries, while logistic regression creates a smooth, diagonal boundary. The tree naturally captures the non-linear relationship where students with low GPA AND high DFW rate are at highest risk.

## 5. Advantages and Disadvantages

### Advantages of Decision Trees

1. **Interpretability**: Easy to visualize and explain to non-technical stakeholders
2. **No Feature Scaling**: Works with raw feature values
3. **Handles Mixed Data**: Works with both numerical and categorical features
4. **Captures Non-linearity**: Automatically finds non-linear relationships
5. **Feature Interactions**: Naturally captures interactions between features
6. **Feature Importance**: Built-in feature importance scores

### Disadvantages of Decision Trees

1. **Overfitting**: Prone to overfitting without proper constraints
2. **Instability**: Small changes in data can produce very different trees
3. **Greedy Algorithm**: May not find globally optimal tree
4. **Axis-Parallel Splits**: Cannot capture diagonal boundaries efficiently
5. **Imbalanced Data**: Can be biased toward majority class

In [None]:
# Summary comparison table
comparison = pd.DataFrame({
    'Criterion': ['Interpretability', 'Preprocessing Required', 'Handles Non-linearity',
                  'Feature Interactions', 'Overfitting Risk', 'Ensemble Foundation',
                  'Computational Cost', 'Probability Calibration'],
    'Decision Tree': ['Excellent', 'Minimal', 'Yes (automatic)', 'Yes (automatic)',
                      'High', 'Excellent', 'Low', 'Poor'],
    'Logistic Regression': ['Good (coefficients)', 'Important', 'No (needs engineering)',
                            'No (needs engineering)', 'Lower', 'Limited', 'Low', 'Good']
})

print("Decision Trees vs. Logistic Regression:")
print(comparison.to_string(index=False))

## 6. Decision Trees in Higher Education

Decision trees are particularly well-suited for higher education analytics for several reasons:

### Use Cases

1. **Student Risk Identification**
   - Identify at-risk students using interpretable rules
   - "If GPA < 2.0 AND DFW Rate > 0.3, then high risk"

2. **Intervention Targeting**
   - Rules translate directly to actionable interventions
   - Advisors can understand and trust the model's logic

3. **Policy Development**
   - Tree structure can inform academic policies
   - Identify thresholds for early warning systems

4. **Stakeholder Communication**
   - Easy to explain to administrators, faculty, and students
   - Visual representation aids understanding

### Example Rules from Student Departure Prediction

A decision tree might learn rules like:

```
Rule 1: IF GPA_1 <= 1.8 THEN Predict "Not Enrolled" (High Risk)
Rule 2: IF GPA_1 > 1.8 AND DFW_RATE_1 > 0.5 THEN Predict "Not Enrolled" (Moderate Risk)
Rule 3: IF GPA_1 > 2.5 AND DFW_RATE_1 <= 0.2 THEN Predict "Enrolled" (Low Risk)
```

These rules can be directly communicated to advisors and integrated into early alert systems.

## 7. Summary

In this notebook, we covered:

1. **Decision Tree Structure**: Root nodes, internal nodes, branches, and leaf nodes

2. **Splitting Criteria**: Gini impurity and entropy measure node purity

3. **Tree Building**: Recursive partitioning with greedy optimization

4. **Stopping Criteria**: max_depth, min_samples_split, etc. prevent overfitting

5. **Comparison with Logistic Regression**: Different decision boundaries and tradeoffs

### Key Takeaways

| Concept | Remember |
|:--------|:---------|
| Decision Trees | Partition feature space with simple rules |
| Gini Impurity | Probability of misclassification; lower is purer |
| Entropy | Information uncertainty; lower is purer |
| Overfitting | Trees without constraints memorize training data |
| Interpretability | Major advantage for stakeholder communication |

### Connection to ML Cycle

| ML Cycle Step | Decision Tree Context |
|:--------------|:----------------------|
| **Build** | Choose tree structure and hyperparameters |
| **Train** | Fit tree using recursive partitioning |
| **Predict** | Follow decision rules to leaf nodes |
| **Evaluate** | Assess using classification metrics |
| **Improve** | Tune hyperparameters to balance bias/variance |

### Next Steps

In the next notebook, we will build decision tree classification models in scikit-learn, creating pipelines that can be trained on our student departure data.

**Proceed to:** `2.2 Build a Decision Tree Classification Model`