# 2.2 **Build** a Decision Tree Classification Model - Predict Student Departure with Decision Trees

## Model Cycle: The 5 Key Steps

### **1. Build the Model : Create the pipeline with decision tree classifier.**  
### 2. Train the Model : Fit the model on the training data.  
### 3. Generate Predictions : Use the trained model to make predictions.  
### 4. Evaluate the Model : Assess performance using evaluation metrics.  
### 5. Improve the Model : Tune hyperparameters for optimal performance.

## Introduction

In the previous notebook, we learned the theory behind decision trees. Now we will put that knowledge into practice by building decision tree classification models using scikit-learn. We will create pipelines that can be trained on our student departure prediction data.

Unlike logistic regression, decision trees have minimal preprocessing requirements. However, we will still use a pipeline approach to maintain consistency and ensure reproducibility.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Build decision tree classification pipelines in scikit-learn
2. Understand key DecisionTreeClassifier parameters
3. Configure trees with different complexity constraints
4. Handle class imbalance using class weights
5. Save models for training and evaluation

## 1. Load Dependencies and Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import pickle
import os

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder

pd.options.display.max_columns = None

In [None]:
# Set up file paths - using Course 2 data
root_filepath = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/'
data_filepath = f'{root_filepath}data/'
course3_filepath = f'{root_filepath}course_3/'

In [None]:
# Load training data
df_training = pd.read_csv(f'{data_filepath}training.csv')

print(f"Training data shape: {df_training.shape}")
print(f"\nTarget distribution:")
print(df_training['SEM_3_STATUS'].value_counts(normalize=True))

In [None]:
# Define feature matrix and target
X_train = df_training
y_train = df_training['SEM_3_STATUS']

print(f"Features available: {list(df_training.columns)}")

## 2. Review: Preprocessing for Decision Trees

### 2.1 What's Different from Logistic Regression?

One of the key advantages of decision trees is their minimal preprocessing requirements:

| Preprocessing Step | Logistic Regression | Decision Trees |
|:-------------------|:--------------------|:---------------|
| Feature Scaling | Required | Not Required |
| Categorical Encoding | Required | Required (in sklearn) |
| Handling Missing Values | Required | Can handle natively* |
| Feature Normalization | Recommended | Not Needed |

*Note: scikit-learn's DecisionTreeClassifier does not handle missing values natively. We would need to impute or use other libraries like XGBoost.

**Why no scaling?** Decision trees make decisions based on threshold comparisons (e.g., "Is GPA <= 2.5?"). The actual scale of the feature doesn't matter - only the relative ordering of values. A GPA of 2.5 on a 0-4 scale works the same as 62.5 on a 0-100 scale.

In [None]:
# Demonstrate that scaling doesn't affect decision trees
from sklearn.tree import DecisionTreeClassifier

# Simple example
np.random.seed(42)
X_example = np.random.randn(100, 2)
y_example = (X_example[:, 0] + X_example[:, 1] > 0).astype(int)

# Unscaled tree
tree_unscaled = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_unscaled.fit(X_example, y_example)

# Scaled tree (multiply features by different amounts)
X_scaled = X_example.copy()
X_scaled[:, 0] *= 1000  # Scale first feature by 1000
X_scaled[:, 1] *= 0.001  # Scale second feature by 0.001

tree_scaled = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_scaled.fit(X_scaled, y_example)

# Compare predictions
print("Predictions match:", np.all(tree_unscaled.predict(X_example) == tree_scaled.predict(X_scaled)))
print("\nThis demonstrates that decision trees are scale-invariant!")

### 2.2 Setting Up Feature Groups

Even though decision trees don't require scaling, we still need to handle categorical variables. We'll use the same feature groupings from Course 2 but with a simpler preprocessing approach.

In [None]:
# Feature groupings from Course 2

# Numerical columns (no scaling needed for decision trees)
numerical_columns = [
    'HS_GPA',
    'GPA_1', 'GPA_2',
    'DFW_RATE_1', 'DFW_RATE_2',
    'UNITS_ATTEMPTED_1', 'UNITS_ATTEMPTED_2'
]

# Categorical columns for one-hot encoding
categorical_columns = [
    'GENDER',
    'RACE_ETHNICITY',
    'FIRST_GEN_STATUS',
]

print(f"Numerical features: {len(numerical_columns)}")
print(f"Categorical features: {len(categorical_columns)}")

In [None]:
# Simplified preprocessor for decision trees
# We pass through numerical columns and only encode categoricals

preprocessor_dt = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numerical_columns),  # No scaling!
        ('cat', OneHotEncoder(handle_unknown='ignore', 
                              drop=['Female', 'Other', 'Unknown'], 
                              sparse_output=False), categorical_columns)
    ],
    remainder='drop'
)

print("Decision Tree Preprocessor configured:")
print("- Numerical features: passed through unchanged")
print("- Categorical features: one-hot encoded with dropped categories")

## 3. Building Decision Tree Models

We will create three decision tree variants:

1. **Basic Tree**: Default settings (prone to overfitting)
2. **Constrained Tree**: Limited depth to prevent overfitting
3. **Balanced Tree**: With class weights to handle imbalanced data

### 3.1 Basic Decision Tree

This is a baseline decision tree with minimal constraints. It will likely overfit, but serves as a useful comparison point.

In [None]:
# Basic Decision Tree (default settings)
basic_dt_model = Pipeline([
    ('preprocessing', preprocessor_dt),
    ('classifier', DecisionTreeClassifier(
        criterion='gini',       # Impurity measure
        max_depth=None,         # No depth limit (will overfit!)
        min_samples_split=2,    # Default: minimum samples to split
        min_samples_leaf=1,     # Default: minimum samples in leaf
        random_state=42
    ))
])

print("Basic Decision Tree Model:")
basic_dt_model

### 3.2 Decision Tree with Depth Constraint

Limiting the tree depth is one of the most effective ways to prevent overfitting. A shallower tree is also more interpretable.

In [None]:
# Constrained Decision Tree (limited depth)
constrained_dt_model = Pipeline([
    ('preprocessing', preprocessor_dt),
    ('classifier', DecisionTreeClassifier(
        criterion='gini',
        max_depth=5,            # Limit depth to prevent overfitting
        min_samples_split=20,   # Require more samples to split
        min_samples_leaf=10,    # Require more samples in leaves
        random_state=42
    ))
])

print("Constrained Decision Tree Model:")
constrained_dt_model

### 3.3 Decision Tree with Balanced Class Weights

Our dataset is imbalanced (87% Enrolled, 13% Not Enrolled). Using `class_weight='balanced'` adjusts the importance of each class inversely proportional to its frequency.

In [None]:
# Calculate what balanced weights would look like
class_counts = y_train.value_counts()
n_samples = len(y_train)
n_classes = 2

# Balanced weights formula: n_samples / (n_classes * n_class_samples)
weight_E = n_samples / (n_classes * class_counts['E'])
weight_N = n_samples / (n_classes * class_counts['N'])

print("Class Distribution:")
print(f"  Enrolled (E): {class_counts['E']} samples ({class_counts['E']/n_samples:.1%})")
print(f"  Not Enrolled (N): {class_counts['N']} samples ({class_counts['N']/n_samples:.1%})")
print(f"\nBalanced Class Weights:")
print(f"  Weight for E: {weight_E:.3f}")
print(f"  Weight for N: {weight_N:.3f}")
print(f"\nRatio (N/E): {weight_N/weight_E:.2f}x more weight on minority class")

In [None]:
# Balanced Decision Tree (handles class imbalance)
balanced_dt_model = Pipeline([
    ('preprocessing', preprocessor_dt),
    ('classifier', DecisionTreeClassifier(
        criterion='gini',
        max_depth=5,
        min_samples_split=20,
        min_samples_leaf=10,
        class_weight='balanced',  # Handle class imbalance
        random_state=42
    ))
])

print("Balanced Decision Tree Model:")
balanced_dt_model

## 4. Understanding DecisionTreeClassifier Parameters

Let's review the key hyperparameters available in scikit-learn's DecisionTreeClassifier.

In [None]:
# Key hyperparameters for DecisionTreeClassifier
params_df = pd.DataFrame({
    'Parameter': ['criterion', 'max_depth', 'min_samples_split', 'min_samples_leaf', 
                  'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 
                  'class_weight', 'random_state'],
    'Description': [
        'Impurity measure (gini or entropy)',
        'Maximum depth of tree (None = unlimited)',
        'Minimum samples required to split a node',
        'Minimum samples required in a leaf node',
        'Number of features to consider for best split',
        'Maximum number of leaf nodes',
        'Minimum impurity decrease required for split',
        'Weights for handling class imbalance',
        'Random seed for reproducibility'
    ],
    'Default': ['gini', 'None', '2', '1', 'None', 'None', '0.0', 'None', 'None'],
    'Effect': [
        'Usually similar results',
        'Lower = simpler, less overfit',
        'Higher = simpler, less overfit',
        'Higher = simpler, less overfit',
        'Lower = more randomness',
        'Lower = simpler tree',
        'Higher = fewer splits',
        'balanced = equal class importance',
        'Set for reproducibility'
    ]
})

print("DecisionTreeClassifier Hyperparameters:")
print(params_df.to_string(index=False))

In [None]:
# Visualize the relationship between parameters and model complexity
import plotly.graph_objects as go

fig = go.Figure()

# Create a conceptual diagram of complexity control
params = ['max_depth', 'min_samples_split', 'min_samples_leaf', 'max_leaf_nodes', 'min_impurity_decrease']
effects = [
    'Controls tree height',
    'Prevents splitting small nodes',
    'Ensures leaves have enough samples',
    'Limits total number of predictions',
    'Requires meaningful splits'
]

# Complexity reduction potential (conceptual)
reduction = [5, 4, 4, 5, 3]

fig.add_trace(go.Bar(
    y=params,
    x=reduction,
    orientation='h',
    marker=dict(color='steelblue'),
    text=effects,
    textposition='inside',
    insidetextanchor='start'
))

fig.update_layout(
    title='Decision Tree Complexity Control Parameters',
    xaxis_title='Complexity Reduction Potential (Conceptual)',
    yaxis_title='Parameter',
    height=400,
    xaxis=dict(range=[0, 6])
)

fig.show()

### Model Comparison Summary

In [None]:
# Summary of models we've built
models = {
    'Basic (Unconstrained)': basic_dt_model,
    'Constrained (max_depth=5)': constrained_dt_model,
    'Balanced (class_weight)': balanced_dt_model
}

print("Decision Tree Models Built:")
print("="*70)

for name, model in models.items():
    classifier = model.named_steps['classifier']
    print(f"\n{name}:")
    print(f"  - criterion: {classifier.criterion}")
    print(f"  - max_depth: {classifier.max_depth}")
    print(f"  - min_samples_split: {classifier.min_samples_split}")
    print(f"  - min_samples_leaf: {classifier.min_samples_leaf}")
    print(f"  - class_weight: {classifier.class_weight}")

In [None]:
# Create comparison table
comparison_df = pd.DataFrame({
    'Model': ['Basic (Unconstrained)', 'Constrained', 'Balanced'],
    'max_depth': ['None', '5', '5'],
    'min_samples_split': ['2', '20', '20'],
    'min_samples_leaf': ['1', '10', '10'],
    'class_weight': ['None', 'None', 'balanced'],
    'Expected Behavior': [
        'High variance, overfitting',
        'Better generalization',
        'Better minority class recall'
    ]
})

print("\nModel Comparison:")
print(comparison_df.to_string(index=False))

## 5. Save Models for Future Use

We save these untrained model pipelines so they can be loaded and trained in subsequent notebooks.

In [None]:
# Create models directory for Course 3 Module 2 if it doesn't exist
models_path = f'{course3_filepath}models/'
os.makedirs(models_path, exist_ok=True)

# Define model names and pipelines
models_to_save = {
    'basic_decision_tree_model': basic_dt_model,
    'constrained_decision_tree_model': constrained_dt_model,
    'balanced_decision_tree_model': balanced_dt_model
}

# Save each model pipeline
for name, model in models_to_save.items():
    filepath = f'{models_path}{name}.pkl'
    pickle.dump(model, open(filepath, 'wb'))
    print(f"Saved: {filepath}")

In [None]:
# Verify saved models
print("\nVerifying saved models:")
for name in models_to_save.keys():
    filepath = f'{models_path}{name}.pkl'
    loaded_model = pickle.load(open(filepath, 'rb'))
    print(f"  {name}: {type(loaded_model.named_steps['classifier']).__name__}")

## 6. Summary

In this notebook, we built three decision tree classification models:

| Model | Configuration | Purpose |
|:------|:--------------|:--------|
| **Basic** | No constraints | Baseline (will overfit) |
| **Constrained** | max_depth=5, min_samples_* | Prevent overfitting |
| **Balanced** | + class_weight='balanced' | Handle class imbalance |

### Key Points

1. **Minimal Preprocessing**: Decision trees don't require feature scaling
2. **Categorical Encoding**: Still needed for scikit-learn implementation
3. **Complexity Control**: max_depth, min_samples_split, min_samples_leaf
4. **Class Imbalance**: Use class_weight='balanced' for imbalanced datasets
5. **Pipeline Approach**: Maintains consistency with our logistic regression models

### Connection to ML Cycle

We are in **Step 1: Build the Model**. We have:
- Created preprocessing pipeline for decision trees
- Built three model variants with different configurations
- Saved models for training

### Next Steps

In the next notebook, we will:
1. Train these models on our student departure data
2. Visualize the learned decision trees
3. Extract and interpret decision rules

**Proceed to:** `2.3 Train and Visualize Decision Trees`