# 3.3 **Train and Evaluate** Random Forests - Predict Student Departure

## Model Cycle: The 5 Key Steps

### 1. Build the Model : Create the Random Forest pipeline.  
### **2. Train the Model : Fit the model on the training data.**  
### **3. Generate Predictions : Use the trained model to make predictions.**  
### **4. Evaluate the Model : Assess performance using evaluation metrics.**  
### 5. Improve the Model : Tune hyperparameters for optimal performance.

## Introduction

In the previous notebook, we built several Random Forest classification pipelines. Now we train these models and evaluate their performance using multiple techniques:

1. **Out-of-Bag (OOB) Error**: A "free" validation metric unique to bagging methods
2. **Cross-Validation**: Standard technique for assessing generalization
3. **Feature Importance**: Understanding which features drive predictions
4. **Test Set Evaluation**: Final performance assessment

### Learning Objectives

By the end of this notebook, you will be able to:

1. Train Random Forest models on the student departure dataset
2. Understand and use Out-of-Bag (OOB) error for model evaluation
3. Extract and interpret feature importance from Random Forests
4. Evaluate model performance using confusion matrices, ROC curves, and various metrics

## 1. Load Dependencies and Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import pickle
import time
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report
)

pd.options.display.max_columns = None

In [None]:
# Set up file paths
root_filepath = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/'
data_filepath = f'{root_filepath}data/'
course3_filepath = f'{root_filepath}course_3/'
models_path = f'{course3_filepath}models/'

In [None]:
# Load training and testing data
df_training = pd.read_csv(f'{data_filepath}training.csv')
df_testing = pd.read_csv(f'{data_filepath}testing.csv')

print(f"Training data: {df_training.shape}")
print(f"Testing data: {df_testing.shape}")

In [None]:
# Define features and target
X_train = df_training
y_train = df_training['SEM_3_STATUS']

X_test = df_testing
y_test = df_testing['SEM_3_STATUS']

print(f"Training target distribution:")
print(y_train.value_counts())
print(f"\nTesting target distribution:")
print(y_test.value_counts())

In [None]:
# Load the saved models
model_files = {
    'RF Baseline': 'rf_baseline_model.pkl',
    'RF Large (500)': 'rf_large_500_model.pkl',
    'RF Constrained': 'rf_constrained_model.pkl',
    'RF Log2': 'rf_log2_model.pkl'
}

models = {}
for name, filename in model_files.items():
    filepath = f'{models_path}{filename}'
    models[name] = pickle.load(open(filepath, 'rb'))
    print(f"Loaded: {name}")

## 2. Train the Random Forest Models

### 2.1 Training the Baseline Model

Let's first train and examine the baseline Random Forest model in detail.

In [None]:
# Train the baseline model
print("Training RF Baseline model...")
start_time = time.time()

models['RF Baseline'].fit(X_train, y_train)

training_time = time.time() - start_time
print(f"Training completed in {training_time:.2f} seconds")

In [None]:
# Examine the trained model
rf_classifier = models['RF Baseline'].named_steps['classifier']

print("Trained Random Forest Properties:")
print("="*50)
print(f"Number of trees: {rf_classifier.n_estimators}")
print(f"Number of features seen: {rf_classifier.n_features_in_}")
print(f"Classes: {rf_classifier.classes_}")
print(f"OOB Score available: {hasattr(rf_classifier, 'oob_score_')}")

### 2.2 Training All Models

Now let's train all our Random Forest models and record their training times.

In [None]:
# Train all models and record times
training_times = {}

for name, model in models.items():
    print(f"Training {name}...")
    start_time = time.time()
    
    model.fit(X_train, y_train)
    
    training_times[name] = time.time() - start_time
    print(f"  Completed in {training_times[name]:.2f} seconds")

print("\nAll models trained successfully!")

In [None]:
# Visualize training times
fig = go.Figure()

fig.add_trace(go.Bar(
    x=list(training_times.keys()),
    y=list(training_times.values()),
    marker_color=['darkblue', 'blue', 'lightblue', 'steelblue']
))

fig.update_layout(
    title='Random Forest Training Times',
    xaxis_title='Model',
    yaxis_title='Training Time (seconds)',
    height=400
)

fig.show()

## 3. Out-of-Bag (OOB) Evaluation

### 3.1 What is OOB Error?

**Out-of-Bag (OOB) error** is a unique validation technique for bagging methods:

1. Each tree in the forest is trained on a bootstrap sample (~63% of data)
2. The remaining ~37% (out-of-bag samples) can be used for validation
3. For each sample, average predictions from trees that didn't see it during training

**Benefits of OOB:**
- "Free" validation without holdout set
- Uses all available training data
- Generally good estimate of test error

```
For sample i:
  - Trees trained with sample i: Cannot evaluate
  - Trees trained WITHOUT sample i: Can predict and average
  - OOB prediction = average of predictions from trees that didn't see i
```

In [None]:
# Visualize OOB concept
np.random.seed(42)

n_trees = 5
n_samples = 10

# Simulate which samples are in each tree's bootstrap
in_bootstrap = np.random.binomial(1, 0.63, (n_trees, n_samples))

fig = go.Figure()

# Heatmap: 1 = in bootstrap, 0 = out-of-bag
fig.add_trace(go.Heatmap(
    z=in_bootstrap,
    x=[f'Sample {i+1}' for i in range(n_samples)],
    y=[f'Tree {i+1}' for i in range(n_trees)],
    colorscale=[[0, 'lightcoral'], [1, 'lightgreen']],
    showscale=False
))

# Add text annotations
for i in range(n_trees):
    for j in range(n_samples):
        text = 'Train' if in_bootstrap[i, j] else 'OOB'
        fig.add_annotation(
            x=j, y=i, text=text,
            showarrow=False, font=dict(size=10)
        )

fig.update_layout(
    title='Out-of-Bag (OOB) Samples: Green=Training, Red=Out-of-Bag',
    xaxis_title='Samples',
    yaxis_title='Trees',
    height=350
)

fig.show()

# Show OOB ratio per sample
oob_ratio = 1 - in_bootstrap.mean(axis=0)
print(f"\nProportion of trees where each sample is OOB:")
for i in range(n_samples):
    print(f"  Sample {i+1}: {oob_ratio[i]*100:.0f}%")

### 3.2 OOB Scores for All Models

In [None]:
# Get OOB scores for all models
oob_scores = {}

print("Out-of-Bag Scores:")
print("="*50)
for name, model in models.items():
    rf = model.named_steps['classifier']
    if hasattr(rf, 'oob_score_'):
        oob_scores[name] = rf.oob_score_
        print(f"{name}: {rf.oob_score_:.4f}")
    else:
        print(f"{name}: OOB score not available (oob_score=False)")

In [None]:
# Visualize OOB scores
fig = go.Figure()

fig.add_trace(go.Bar(
    x=list(oob_scores.keys()),
    y=list(oob_scores.values()),
    marker_color='darkgreen',
    text=[f'{v:.4f}' for v in oob_scores.values()],
    textposition='outside'
))

fig.update_layout(
    title='Out-of-Bag (OOB) Accuracy Scores',
    xaxis_title='Model',
    yaxis_title='OOB Accuracy',
    yaxis=dict(range=[0.5, 1.0]),
    height=400
)

fig.show()

## 4. Cross-Validation Evaluation

### 4.1 Stratified K-Fold Cross-Validation

While OOB provides a free validation metric, we'll also use cross-validation for comparison with our logistic regression models.

In [None]:
# Set up cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Cross-validate all models with multiple metrics
cv_results = {}

scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'roc_auc': 'roc_auc'
}

print("Running cross-validation for all models...")
print("="*50)

for name, model in models.items():
    print(f"\nCross-validating {name}...")
    
    results = cross_validate(
        model, X_train, y_train,
        cv=cv,
        scoring=scoring,
        return_train_score=True
    )
    
    cv_results[name] = results
    
    print(f"  Accuracy: {results['test_accuracy'].mean():.4f} (+/- {results['test_accuracy'].std()*2:.4f})")
    print(f"  ROC-AUC: {results['test_roc_auc'].mean():.4f} (+/- {results['test_roc_auc'].std()*2:.4f})")

### 4.2 Multiple Metrics

In [None]:
# Create a comprehensive comparison table
metrics_summary = []

for name in models.keys():
    results = cv_results[name]
    metrics_summary.append({
        'Model': name,
        'Accuracy': f"{results['test_accuracy'].mean():.4f} (+/- {results['test_accuracy'].std()*2:.4f})",
        'Precision': f"{results['test_precision'].mean():.4f} (+/- {results['test_precision'].std()*2:.4f})",
        'Recall': f"{results['test_recall'].mean():.4f} (+/- {results['test_recall'].std()*2:.4f})",
        'F1-Score': f"{results['test_f1'].mean():.4f} (+/- {results['test_f1'].std()*2:.4f})",
        'ROC-AUC': f"{results['test_roc_auc'].mean():.4f} (+/- {results['test_roc_auc'].std()*2:.4f})"
    })

metrics_df = pd.DataFrame(metrics_summary)
metrics_df

In [None]:
# Visualize cross-validation results
metric_names = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
model_names = list(models.keys())

fig = go.Figure()

colors = px.colors.qualitative.Set2

for i, metric in enumerate(metric_names):
    means = [cv_results[name][f'test_{metric}'].mean() for name in model_names]
    fig.add_trace(go.Bar(
        name=metric.upper(),
        x=model_names,
        y=means,
        marker_color=colors[i]
    ))

fig.update_layout(
    title='Cross-Validation Metrics Comparison',
    xaxis_title='Model',
    yaxis_title='Score',
    barmode='group',
    height=450,
    yaxis=dict(range=[0, 1])
)

fig.show()

In [None]:
# Compare train vs test scores to check for overfitting
fig = make_subplots(rows=1, cols=2, subplot_titles=(
    'Accuracy: Train vs Test',
    'ROC-AUC: Train vs Test'
))

for i, metric in enumerate(['accuracy', 'roc_auc']):
    train_scores = [cv_results[name][f'train_{metric}'].mean() for name in model_names]
    test_scores = [cv_results[name][f'test_{metric}'].mean() for name in model_names]
    
    fig.add_trace(go.Bar(name='Train', x=model_names, y=train_scores, 
                         marker_color='lightblue', showlegend=(i==0)), row=1, col=i+1)
    fig.add_trace(go.Bar(name='Test', x=model_names, y=test_scores, 
                         marker_color='darkblue', showlegend=(i==0)), row=1, col=i+1)

fig.update_layout(
    title='Checking for Overfitting: Train vs Test Scores',
    height=400,
    barmode='group'
)

fig.show()

**Interpretation**: If train scores are much higher than test scores, the model may be overfitting. The constrained model (with limited depth) typically shows smaller gaps.

## 5. Feature Importance

### 5.1 Understanding Feature Importance

Random Forests provide **feature importance** scores based on how much each feature contributes to reducing impurity (Gini or entropy) across all trees.

**Mean Decrease in Impurity (MDI)**:
- For each feature, sum the impurity decrease at all splits using that feature
- Average across all trees
- Normalize so all importances sum to 1

**Interpretation**:
- Higher importance = feature is more useful for predictions
- Does NOT indicate direction of effect (unlike logistic regression coefficients)
- Can be biased toward high-cardinality features

In [None]:
# Get feature names from the preprocessor
preprocessor = models['RF Baseline'].named_steps['preprocessing']

# Get feature names from each transformer
minmax_features = preprocessor.transformers_[0][2]  # minmax columns
standard_features = preprocessor.transformers_[1][2]  # standard columns
onehot_features = list(preprocessor.transformers_[2][1].get_feature_names_out(
    preprocessor.transformers_[2][2]
))

# Combine all feature names
all_feature_names = list(minmax_features) + list(standard_features) + onehot_features
print(f"Total features after preprocessing: {len(all_feature_names)}")
print(f"Feature names: {all_feature_names}")

In [None]:
# Get feature importances from baseline model
rf_baseline = models['RF Baseline'].named_steps['classifier']
importances = rf_baseline.feature_importances_

# Create importance dataframe
importance_df = pd.DataFrame({
    'Feature': all_feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=False)

print("Feature Importances (RF Baseline):")
importance_df

### 5.2 Visualizing Feature Importance

In [None]:
# Plot feature importances
fig = go.Figure()

# Sort for visualization
sorted_df = importance_df.sort_values('Importance', ascending=True)

fig.add_trace(go.Bar(
    y=sorted_df['Feature'],
    x=sorted_df['Importance'],
    orientation='h',
    marker_color='darkgreen'
))

fig.update_layout(
    title='Random Forest Feature Importances (Baseline Model)',
    xaxis_title='Importance (Mean Decrease in Impurity)',
    yaxis_title='Feature',
    height=500
)

fig.show()

In [None]:
# Compare feature importances across models
importance_comparison = pd.DataFrame({'Feature': all_feature_names})

for name, model in models.items():
    rf = model.named_steps['classifier']
    importance_comparison[name] = rf.feature_importances_

# Display top 10 features for each model
print("Top 5 Features by Model:")
print("="*60)
for name in models.keys():
    print(f"\n{name}:")
    top5 = importance_comparison.nlargest(5, name)[['Feature', name]]
    for _, row in top5.iterrows():
        print(f"  {row['Feature']}: {row[name]:.4f}")

In [None]:
# Visualize importance comparison for top features
top_features = importance_df.head(10)['Feature'].tolist()
comparison_subset = importance_comparison[importance_comparison['Feature'].isin(top_features)]

fig = go.Figure()

colors = ['darkblue', 'blue', 'lightblue', 'steelblue']
for i, name in enumerate(models.keys()):
    subset = comparison_subset.sort_values('Feature')
    fig.add_trace(go.Bar(
        name=name,
        x=subset['Feature'],
        y=subset[name],
        marker_color=colors[i]
    ))

fig.update_layout(
    title='Feature Importance Comparison Across Models (Top 10 Features)',
    xaxis_title='Feature',
    yaxis_title='Importance',
    barmode='group',
    height=450,
    xaxis_tickangle=45
)

fig.show()

### 5.3 Comparing to Logistic Regression

Random Forest importance tells us which features are useful, but not the *direction* of their effect. For that interpretation, logistic regression coefficients are more informative.

In [None]:
# Create interpretation summary
print("Feature Importance Interpretation:")
print("="*60)
print("\nTop 5 Most Important Features for Predicting Student Departure:")
print()

for i, (_, row) in enumerate(importance_df.head(5).iterrows(), 1):
    feature = row['Feature']
    importance = row['Importance']
    
    # Add interpretation
    if 'GPA' in feature:
        interpretation = "Academic performance indicator"
    elif 'DFW' in feature:
        interpretation = "Course failure/withdrawal rate"
    elif 'UNITS' in feature:
        interpretation = "Course load measure"
    else:
        interpretation = "Demographic characteristic"
    
    print(f"{i}. {feature}")
    print(f"   Importance: {importance:.4f} ({importance*100:.1f}% of total)")
    print(f"   Interpretation: {interpretation}")
    print()

## 6. Generate and Evaluate Predictions

### 6.1 Predictions on Test Data

In [None]:
# Generate predictions for all models on test data
predictions = {}
probabilities = {}

for name, model in models.items():
    predictions[name] = model.predict(X_test)
    probabilities[name] = model.predict_proba(X_test)[:, 1]  # Probability of class 1
    
print("Predictions generated for all models.")

In [None]:
# Calculate test metrics for all models
test_metrics = []

for name in models.keys():
    y_pred = predictions[name]
    y_prob = probabilities[name]
    
    test_metrics.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1-Score': f1_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_prob)
    })

test_metrics_df = pd.DataFrame(test_metrics)
test_metrics_df

### 6.2 Confusion Matrix and Classification Report

In [None]:
# Generate confusion matrices for all models
fig = make_subplots(rows=2, cols=2, subplot_titles=list(models.keys()))

for idx, (name, y_pred) in enumerate(predictions.items()):
    cm = confusion_matrix(y_test, y_pred)
    row = idx // 2 + 1
    col = idx % 2 + 1
    
    # Add heatmap
    fig.add_trace(
        go.Heatmap(
            z=cm,
            x=['Predicted Retained', 'Predicted Departed'],
            y=['Actual Retained', 'Actual Departed'],
            colorscale='Blues',
            showscale=False,
            text=cm,
            texttemplate='%{text}',
            textfont=dict(size=14)
        ),
        row=row, col=col
    )

fig.update_layout(
    title='Confusion Matrices for All Models',
    height=600
)

fig.show()

In [None]:
# Detailed classification report for baseline model
print("Classification Report - RF Baseline:")
print("="*60)
print(classification_report(y_test, predictions['RF Baseline'], 
                           target_names=['Retained', 'Departed']))

### 6.3 ROC Curve and AUC

In [None]:
# Plot ROC curves for all models
fig = go.Figure()

colors = ['darkblue', 'blue', 'lightblue', 'steelblue']

for idx, (name, y_prob) in enumerate(probabilities.items()):
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    auc = roc_auc_score(y_test, y_prob)
    
    fig.add_trace(go.Scatter(
        x=fpr, y=tpr,
        mode='lines',
        name=f'{name} (AUC={auc:.4f})',
        line=dict(color=colors[idx], width=2)
    ))

# Add diagonal reference line
fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines',
    name='Random (AUC=0.5)',
    line=dict(color='gray', width=1, dash='dash')
))

fig.update_layout(
    title='ROC Curves - Random Forest Models',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
    height=500
)

fig.show()

## 7. Model Comparison Summary

In [None]:
# Create comprehensive comparison table
final_comparison = []

for name in models.keys():
    final_comparison.append({
        'Model': name,
        'OOB Score': f"{oob_scores.get(name, 'N/A'):.4f}" if name in oob_scores else 'N/A',
        'CV Accuracy': f"{cv_results[name]['test_accuracy'].mean():.4f}",
        'CV ROC-AUC': f"{cv_results[name]['test_roc_auc'].mean():.4f}",
        'Test Accuracy': f"{test_metrics_df[test_metrics_df['Model']==name]['Accuracy'].values[0]:.4f}",
        'Test ROC-AUC': f"{test_metrics_df[test_metrics_df['Model']==name]['ROC-AUC'].values[0]:.4f}",
        'Training Time (s)': f"{training_times[name]:.2f}"
    })

final_comparison_df = pd.DataFrame(final_comparison)
print("Model Comparison Summary:")
final_comparison_df

In [None]:
# Visualize final comparison
fig = go.Figure()

metrics_to_plot = ['OOB Score', 'CV Accuracy', 'CV ROC-AUC', 'Test ROC-AUC']
model_names = final_comparison_df['Model'].tolist()

for metric in metrics_to_plot:
    if metric == 'OOB Score':
        values = [oob_scores.get(name, 0) for name in model_names]
    elif metric == 'CV Accuracy':
        values = [cv_results[name]['test_accuracy'].mean() for name in model_names]
    elif metric == 'CV ROC-AUC':
        values = [cv_results[name]['test_roc_auc'].mean() for name in model_names]
    else:
        values = test_metrics_df['ROC-AUC'].tolist()
    
    fig.add_trace(go.Bar(
        name=metric,
        x=model_names,
        y=values
    ))

fig.update_layout(
    title='Random Forest Model Performance Comparison',
    xaxis_title='Model',
    yaxis_title='Score',
    barmode='group',
    yaxis=dict(range=[0.5, 1.0]),
    height=450
)

fig.show()

In [None]:
# Save the best model
best_model_name = test_metrics_df.loc[test_metrics_df['ROC-AUC'].idxmax(), 'Model']
best_model = models[best_model_name]

print(f"Best model based on test ROC-AUC: {best_model_name}")

# Save the trained best model
best_model_path = f'{models_path}rf_best_trained_model.pkl'
pickle.dump(best_model, open(best_model_path, 'wb'))
print(f"Saved to: {best_model_path}")

## 8. Summary

In this notebook, we trained and evaluated Random Forest models for predicting student departure.

### Key Findings

1. **OOB Scores**: Provided quick validation without a holdout set
2. **Cross-Validation**: Confirmed model generalization across folds
3. **Feature Importance**: Academic performance features (GPA, DFW rates) were most important
4. **Test Performance**: Models achieved competitive accuracy and ROC-AUC scores

### Evaluation Methods Summary

| Method | Description | When to Use |
|:-------|:------------|:------------|
| OOB Score | Free validation from bootstrap | Quick initial assessment |
| Cross-Validation | K-fold with stratification | Robust generalization estimate |
| Test Set | Final holdout evaluation | Unbiased final performance |

### Feature Importance Insights

The most important features for predicting student departure were:
1. GPA measures (first and second semester)
2. DFW rates (course failure/withdrawal)
3. Units attempted

This aligns with domain knowledge: academic performance is the strongest predictor of student persistence.

### Next Steps

In the next notebook, we will tune Random Forest hyperparameters to optimize model performance.

**Proceed to:** `3.4 Tune Random Forest Hyperparameters`