# 3.2 **Build** a Random Forest Classification Model - Predict Student Departure

## Model Cycle: The 5 Key Steps

### **1. Build the Model : Create the Random Forest pipeline.**  
### 2. Train the Model : Fit the model on the training data.  
### 3. Generate Predictions : Use the trained model to make predictions.  
### 4. Evaluate the Model : Assess performance using evaluation metrics.  
### 5. Improve the Model : Tune hyperparameters for optimal performance.

## Introduction

In the previous notebook, we learned about ensemble learning, bagging, and how Random Forests work. Now we put theory into practice by building a Random Forest classification pipeline for predicting student departure.

One key advantage of Random Forests: **they don't require feature scaling**. Decision trees make splits based on thresholds, not distances, so scaling doesn't affect the model. However, we'll keep our preprocessing pipeline for handling categorical variables.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Create Random Forest models using scikit-learn's `RandomForestClassifier`
2. Understand the key hyperparameters and their effects
3. Build a complete pipeline combining preprocessing and classification
4. Configure alternative Random Forest setups for comparison

## 1. Load Dependencies and Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import pickle
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder

pd.options.display.max_columns = None

In [None]:
# Set up file paths
root_filepath = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/'
data_filepath = f'{root_filepath}data/'
course3_filepath = f'{root_filepath}course_3/'

In [None]:
# Load training data
df_training = pd.read_csv(f'{data_filepath}training.csv')

print(f"Training data shape: {df_training.shape}")
print(f"\nTarget distribution:")
print(df_training['SEM_3_STATUS'].value_counts(normalize=True))
print(f"\nClass imbalance ratio: {df_training['SEM_3_STATUS'].value_counts()[0] / df_training['SEM_3_STATUS'].value_counts()[1]:.2f}:1")

In [None]:
# View available features
print("Available columns:")
print(df_training.columns.tolist())

In [None]:
# Preview the data
df_training.head()

## 2. Random Forests in Scikit-learn

### 2.1 The RandomForestClassifier Class

Scikit-learn provides `RandomForestClassifier` for classification tasks. It implements the Random Forest algorithm we discussed in the previous notebook.

```python
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=None,        # Maximum tree depth (None = unlimited)
    max_features='sqrt',   # Features to consider at each split
    random_state=42        # For reproducibility
)
```

### 2.2 Key Hyperparameters

Random Forests have several important hyperparameters that control the ensemble and individual trees:

#### Ensemble-Level Parameters

| Parameter | Description | Default | Effect |
|:----------|:------------|:--------|:-------|
| `n_estimators` | Number of trees in the forest | 100 | More trees = more stable but slower |
| `bootstrap` | Whether to use bootstrap sampling | True | False = each tree sees all data |
| `oob_score` | Calculate out-of-bag score | False | Free validation metric |

#### Tree-Level Parameters

| Parameter | Description | Default | Effect |
|:----------|:------------|:--------|:-------|
| `max_depth` | Maximum depth of each tree | None (unlimited) | Limits tree complexity |
| `max_features` | Features considered at each split | 'sqrt' | Controls randomness |
| `min_samples_split` | Min samples to split a node | 2 | Prevents overly specific splits |
| `min_samples_leaf` | Min samples in a leaf node | 1 | Controls leaf size |

#### Class Imbalance Parameters

| Parameter | Description | Default | Effect |
|:----------|:------------|:--------|:-------|
| `class_weight` | Weight for each class | None | Handle imbalanced data |

In [None]:
# Visualize the effect of n_estimators
n_estimators_values = [1, 10, 50, 100, 200, 500]
stability = [0.3, 0.55, 0.75, 0.85, 0.90, 0.92]  # Simulated stability scores
training_time = [0.1, 0.5, 2, 4, 8, 20]  # Simulated relative training time

fig = make_subplots(rows=1, cols=2, subplot_titles=(
    'Model Stability vs. n_estimators',
    'Training Time vs. n_estimators'
))

fig.add_trace(go.Scatter(
    x=n_estimators_values, y=stability,
    mode='lines+markers',
    line=dict(color='darkblue', width=3),
    marker=dict(size=10)
), row=1, col=1)

fig.add_trace(go.Scatter(
    x=n_estimators_values, y=training_time,
    mode='lines+markers',
    line=dict(color='darkred', width=3),
    marker=dict(size=10)
), row=1, col=2)

fig.update_xaxes(title='Number of Trees (n_estimators)')
fig.update_yaxes(title='Stability Score', row=1, col=1)
fig.update_yaxes(title='Relative Training Time', row=1, col=2)

fig.update_layout(
    title='Trade-off: More Trees = More Stable but Slower',
    height=400,
    showlegend=False
)

fig.show()

**Key Insight**: Stability increases rapidly with more trees, then levels off. Training time increases roughly linearly. 100-500 trees is usually a good balance.

In [None]:
# Visualize the effect of max_features
max_features_options = ['sqrt', 'log2', 0.5, 1.0]
descriptions = ['sqrt(p)', 'log2(p)', '50% of features', 'All features']
diversity = [0.9, 0.85, 0.7, 0.3]  # Higher = more diverse trees
individual_accuracy = [0.75, 0.78, 0.82, 0.88]  # Individual tree accuracy

fig = go.Figure()

fig.add_trace(go.Bar(
    x=descriptions,
    y=diversity,
    name='Tree Diversity',
    marker_color='darkblue'
))

fig.add_trace(go.Bar(
    x=descriptions,
    y=individual_accuracy,
    name='Individual Tree Accuracy',
    marker_color='lightblue'
))

fig.update_layout(
    title='max_features Trade-off: Diversity vs. Individual Accuracy',
    xaxis_title='max_features Setting',
    yaxis_title='Score',
    barmode='group',
    height=400
)

fig.show()

**Interpretation**: 
- Lower `max_features` creates more diverse (less correlated) trees but each tree is less accurate
- Higher `max_features` creates more accurate individual trees but they're more correlated
- The optimal is usually somewhere in between (sqrt or log2 for classification)

## 3. Build the Preprocessing Pipeline

### 3.1 Feature Groupings

Although Random Forests don't require scaling, we still need to:
1. Handle categorical variables (one-hot encoding)
2. Define which features to use

We'll use the same feature groupings from our logistic regression models for fair comparison.

In [None]:
# Define feature groups

# Numeric features (no scaling needed for Random Forests, but included for consistency)
numeric_columns = [
    'HS_GPA',
    'GPA_1', 'GPA_2',
    'DFW_RATE_1', 'DFW_RATE_2',
    'UNITS_ATTEMPTED_1', 'UNITS_ATTEMPTED_2'
]

# Categorical columns for one-hot encoding
categorical_columns = [
    'GENDER',
    'RACE_ETHNICITY',
    'FIRST_GEN_STATUS'
]

# All features
all_features = numeric_columns + categorical_columns
print(f"Total features: {len(all_features)}")
print(f"Numeric: {len(numeric_columns)}")
print(f"Categorical: {len(categorical_columns)}")

### 3.2 Preprocessor Configuration

For Random Forests, we have two options:

**Option A**: Minimal preprocessing (just handle categoricals)
```python
preprocessor = ColumnTransformer([
    ('passthrough', 'passthrough', numeric_columns),
    ('onehot', OneHotEncoder(...), categorical_columns)
])
```

**Option B**: Use the same preprocessing as logistic regression (for fair comparison)
```python
preprocessor = ColumnTransformer([
    ('minmax', MinMaxScaler(), minmax_columns),
    ('standard', StandardScaler(), standard_columns),
    ('onehot', OneHotEncoder(...), categorical_columns)
])
```

We'll use Option B for consistency, but note that **scaling doesn't affect Random Forest performance**.

In [None]:
# For fair comparison with logistic regression, use same preprocessing
# (But know that scaling is optional for Random Forests)

minmax_columns = [
    'HS_GPA',
    'GPA_1', 'GPA_2',
    'DFW_RATE_1', 'DFW_RATE_2'
]

standard_columns = [
    'UNITS_ATTEMPTED_1', 'UNITS_ATTEMPTED_2'
]

categorical_columns = [
    'GENDER',
    'RACE_ETHNICITY',
    'FIRST_GEN_STATUS'
]

In [None]:
# Build the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('minmax', MinMaxScaler(), minmax_columns),
        ('standard', StandardScaler(), standard_columns),
        ('onehot', OneHotEncoder(handle_unknown='ignore', 
                                  drop=['Female', 'Other', 'Unknown'], 
                                  sparse_output=False), categorical_columns)
    ],
    remainder='drop'
)

print("Preprocessor configured successfully.")

## 4. Build the Random Forest Pipeline

### 4.1 Basic Random Forest Model

Our baseline Random Forest will use sensible default parameters:
- 100 trees (standard starting point)
- `max_features='sqrt'` (default for classification)
- `class_weight='balanced'` (handle class imbalance)
- `oob_score=True` (free validation metric)

In [None]:
# Build the baseline Random Forest pipeline
rf_baseline_model = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=100,           # Number of trees
        max_depth=None,             # No depth limit (trees grow fully)
        max_features='sqrt',        # sqrt(n_features) at each split
        min_samples_split=2,        # Minimum samples to split
        min_samples_leaf=1,         # Minimum samples in leaf
        bootstrap=True,             # Use bootstrap sampling
        oob_score=True,             # Calculate out-of-bag score
        class_weight='balanced',    # Handle class imbalance
        random_state=42,            # For reproducibility
        n_jobs=-1                   # Use all CPU cores
    ))
])

print("Baseline Random Forest Model:")
rf_baseline_model

In [None]:
# Display model configuration
classifier = rf_baseline_model.named_steps['classifier']
print("Model Configuration:")
print("="*50)
print(f"n_estimators: {classifier.n_estimators}")
print(f"max_depth: {classifier.max_depth}")
print(f"max_features: {classifier.max_features}")
print(f"min_samples_split: {classifier.min_samples_split}")
print(f"min_samples_leaf: {classifier.min_samples_leaf}")
print(f"bootstrap: {classifier.bootstrap}")
print(f"oob_score: {classifier.oob_score}")
print(f"class_weight: {classifier.class_weight}")

### 4.2 Alternative Configurations

Let's create a few alternative Random Forest configurations for comparison:

1. **More Trees**: 500 trees for increased stability
2. **Constrained Depth**: Limited tree depth to prevent overfitting
3. **Different max_features**: Using log2 instead of sqrt

In [None]:
# Model 2: More trees (500)
rf_large_model = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=500,           # More trees for stability
        max_depth=None,
        max_features='sqrt',
        bootstrap=True,
        oob_score=True,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ))
])

print("Large Random Forest Model (500 trees):")
rf_large_model

In [None]:
# Model 3: Constrained depth (prevent overfitting)
rf_constrained_model = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=100,
        max_depth=10,               # Limit tree depth
        max_features='sqrt',
        min_samples_split=5,        # Require more samples to split
        min_samples_leaf=2,         # Require at least 2 samples in leaves
        bootstrap=True,
        oob_score=True,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ))
])

print("Constrained Random Forest Model (max_depth=10):")
rf_constrained_model

In [None]:
# Model 4: Different max_features (log2)
rf_log2_model = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=100,
        max_depth=None,
        max_features='log2',        # log2 of features at each split
        bootstrap=True,
        oob_score=True,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ))
])

print("Random Forest Model with log2 features:")
rf_log2_model

## 5. Compare Pipeline Structures

In [None]:
# Summary of all Random Forest models
models = {
    'RF Baseline': rf_baseline_model,
    'RF Large (500)': rf_large_model,
    'RF Constrained': rf_constrained_model,
    'RF Log2': rf_log2_model
}

print("Random Forest Models Configured:")
print("="*70)
for name, model in models.items():
    clf = model.named_steps['classifier']
    print(f"\n{name}:")
    print(f"  n_estimators: {clf.n_estimators}")
    print(f"  max_depth: {clf.max_depth}")
    print(f"  max_features: {clf.max_features}")
    print(f"  min_samples_split: {clf.min_samples_split}")
    print(f"  min_samples_leaf: {clf.min_samples_leaf}")

In [None]:
# Create a comparison table
comparison_data = []
for name, model in models.items():
    clf = model.named_steps['classifier']
    comparison_data.append({
        'Model': name,
        'n_estimators': clf.n_estimators,
        'max_depth': str(clf.max_depth),
        'max_features': clf.max_features,
        'min_samples_split': clf.min_samples_split,
        'min_samples_leaf': clf.min_samples_leaf
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df

In [None]:
# Visualize model configurations
fig = go.Figure()

model_names = list(models.keys())
n_trees = [models[m].named_steps['classifier'].n_estimators for m in model_names]
max_depths = [models[m].named_steps['classifier'].max_depth for m in model_names]
max_depths_display = [d if d is not None else 50 for d in max_depths]  # Display None as high value

fig.add_trace(go.Bar(
    name='n_estimators',
    x=model_names,
    y=n_trees,
    marker_color='darkblue'
))

fig.add_trace(go.Bar(
    name='max_depth (None=50)',
    x=model_names,
    y=max_depths_display,
    marker_color='lightblue'
))

fig.update_layout(
    title='Random Forest Model Configurations',
    xaxis_title='Model',
    yaxis_title='Value',
    barmode='group',
    height=400
)

fig.show()

## 6. Save Models for Future Use

In [None]:
# Create models directory for Course 3 Module 3 if it doesn't exist
import os
models_path = f'{course3_filepath}models/'
os.makedirs(models_path, exist_ok=True)

# Save each model pipeline (untrained)
for name, model in models.items():
    # Create a clean filename
    filename = name.lower().replace(' ', '_').replace('(', '').replace(')', '')
    filepath = f'{models_path}{filename}_model.pkl'
    pickle.dump(model, open(filepath, 'wb'))
    print(f"Saved: {filepath}")

In [None]:
# Verify saved models
print("\nVerifying saved models:")
for name, model in models.items():
    filename = name.lower().replace(' ', '_').replace('(', '').replace(')', '')
    filepath = f'{models_path}{filename}_model.pkl'
    loaded_model = pickle.load(open(filepath, 'rb'))
    print(f"{name}: {type(loaded_model.named_steps['classifier']).__name__}")

## 7. Summary

In this notebook, we built four Random Forest classification models for predicting student departure:

### Models Built

| Model | Description | Key Settings |
|:------|:------------|:-------------|
| **RF Baseline** | Standard configuration | 100 trees, sqrt features, no depth limit |
| **RF Large** | More trees for stability | 500 trees, sqrt features |
| **RF Constrained** | Limited depth to prevent overfitting | 100 trees, max_depth=10 |
| **RF Log2** | Alternative feature sampling | 100 trees, log2 features |

### Key Points

1. **No scaling required**: Random Forests work with raw feature values
2. **class_weight='balanced'**: Automatically handles class imbalance
3. **oob_score=True**: Provides free validation metric using out-of-bag samples
4. **n_jobs=-1**: Uses all CPU cores for faster training

### Random Forest vs. Logistic Regression Pipelines

| Aspect | Logistic Regression | Random Forest |
|:-------|:--------------------|:--------------|
| Scaling | Required | Optional |
| Training speed | Fast | Slower (many trees) |
| Key hyperparameters | C, penalty | n_estimators, max_depth, max_features |
| Built-in validation | No | Yes (OOB score) |

### Next Steps

In the next notebook, we will:
1. Train these Random Forest models on our data
2. Evaluate using Out-of-Bag (OOB) scores and cross-validation
3. Examine feature importance to understand what drives predictions

**Proceed to:** `3.3 Train and Evaluate Random Forests`