# 6.2 Final Model Selection and Deployment

## Course 3: Advanced Classification Models for Student Success

## Introduction

In the previous notebook (6.1), we systematically compared all models from Course 3. Now it's time to make the final selection and prepare the chosen model for deployment in a real higher education setting.

Deploying a machine learning model involves more than just selecting the highest-performing algorithm. We must consider:

- **Institutional constraints**: IT infrastructure, data pipelines, staff expertise
- **Stakeholder needs**: Advisors, administrators, students, and regulators
- **Ethical implications**: Fairness, transparency, and potential unintended consequences
- **Maintenance requirements**: Model monitoring, retraining schedules, and drift detection

### Learning Objectives

By the end of this notebook, you will be able to:

1. Apply a structured decision framework for final model selection
2. Optimize classification thresholds for institutional priorities
3. Prepare a model for production deployment with proper serialization
4. Create stakeholder-friendly model documentation
5. Understand deployment considerations specific to higher education

## 1. Setup and Review

### 1.1 Import Libraries

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Preprocessing and modeling
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, precision_recall_curve, average_precision_score,
    confusion_matrix, classification_report, brier_score_loss
)

# Calibration
from sklearn.calibration import calibration_curve, CalibratedClassifierCV

# Model persistence
import joblib
import pickle
import json
from datetime import datetime
import os

# Set random seed
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("All libraries imported successfully!")

### 1.2 Load Data and Recap Comparison Results

In [None]:
# Load the datasets
train_df = pd.read_csv('../../data/training.csv')
test_df = pd.read_csv('../../data/testing.csv')

# Create binary target
train_df['DEPARTED'] = (train_df['SEM_3_STATUS'] != 'E').astype(int)
test_df['DEPARTED'] = (test_df['SEM_3_STATUS'] != 'E').astype(int)

print(f"Training set: {train_df.shape[0]:,} students")
print(f"Testing set: {test_df.shape[0]:,} students")
print(f"Departure rate: {train_df['DEPARTED'].mean():.2%}")

In [None]:
# Define feature sets
numeric_features = [
    'HS_GPA', 'HS_MATH_GPA', 'HS_ENGL_GPA',
    'UNITS_ATTEMPTED_1', 'UNITS_ATTEMPTED_2',
    'UNITS_COMPLETED_1', 'UNITS_COMPLETED_2',
    'DFW_UNITS_1', 'DFW_UNITS_2',
    'GPA_1', 'GPA_2',
    'DFW_RATE_1', 'DFW_RATE_2',
    'GRADE_POINTS_1', 'GRADE_POINTS_2'
]

categorical_features = ['RACE_ETHNICITY', 'GENDER', 'FIRST_GEN_STATUS', 'COLLEGE']
target = 'DEPARTED'

# Prepare data
train_encoded = pd.get_dummies(train_df[numeric_features + categorical_features], 
                               columns=categorical_features, drop_first=True)
test_encoded = pd.get_dummies(test_df[numeric_features + categorical_features], 
                              columns=categorical_features, drop_first=True)

# Align columns
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1, fill_value=0)

# Handle missing values
train_encoded = train_encoded.fillna(train_encoded.median())
test_encoded = test_encoded.fillna(test_encoded.median())

X_train = train_encoded
y_train = train_df[target]
X_test = test_encoded
y_test = test_df[target]

# Store feature names for later use
feature_names = X_train.columns.tolist()

print(f"Number of features: {len(feature_names)}")

In [None]:
# Recap: Model comparison results from Notebook 6.1
# These are representative results - actual values will vary based on data
comparison_results = pd.DataFrame({
    'Model': ['Logistic Regression (L2)', 'Decision Tree', 'Random Forest', 
              'Gradient Boosting', 'Neural Network'],
    'ROC-AUC': [0.82, 0.75, 0.85, 0.86, 0.84],
    'F1 Score': [0.58, 0.52, 0.62, 0.64, 0.61],
    'Recall': [0.65, 0.58, 0.68, 0.70, 0.67],
    'Precision': [0.52, 0.47, 0.57, 0.59, 0.56],
    'Interpretability': [9, 10, 5, 4, 2],
    'Training Time (s)': [0.5, 0.2, 3.5, 8.2, 15.3]
})

print("RECAP: Model Comparison Results from Notebook 6.1")
print("="*80)
print(comparison_results.to_string(index=False))
print("="*80)

## 2. Final Model Selection

### 2.1 Selection Criteria for Higher Education

Before selecting the final model, let's establish clear criteria weighted for a typical higher education deployment.

In [None]:
# Define weighted selection criteria
selection_criteria = {
    'Predictive Performance (ROC-AUC)': {
        'weight': 0.30,
        'description': 'Overall ability to distinguish at-risk from not-at-risk students'
    },
    'Recall (Sensitivity)': {
        'weight': 0.25,
        'description': 'Ability to identify students who will actually depart (minimize false negatives)'
    },
    'Interpretability': {
        'weight': 0.25,
        'description': 'Can advisors and administrators understand and explain predictions?'
    },
    'Deployment Ease': {
        'weight': 0.15,
        'description': 'Training time, prediction speed, maintenance requirements'
    },
    'Precision': {
        'weight': 0.05,
        'description': 'Accuracy of positive predictions (minimize false alarms)'
    }
}

print("MODEL SELECTION CRITERIA FOR HIGHER EDUCATION")
print("="*70)
total_weight = 0
for criterion, details in selection_criteria.items():
    print(f"\n{criterion} (Weight: {details['weight']:.0%})")
    print(f"  {details['description']}")
    total_weight += details['weight']
print(f"\nTotal Weight: {total_weight:.0%}")
print("="*70)

In [None]:
# Visualize criteria weights
criteria_names = list(selection_criteria.keys())
criteria_weights = [selection_criteria[c]['weight'] for c in criteria_names]

fig = go.Figure(data=[go.Pie(
    labels=criteria_names,
    values=criteria_weights,
    hole=0.4,
    textinfo='label+percent',
    textposition='outside'
)])

fig.update_layout(
    title='Model Selection Criteria Weights for Higher Education',
    height=500,
    showlegend=False
)

fig.show()

### 2.2 Selecting the Production Model

Based on our comparison and the criteria above, we will select **Random Forest** as our production model for the following reasons:

1. **Strong predictive performance**: High ROC-AUC and good recall
2. **Reasonable interpretability**: Feature importance is available
3. **Robust**: Less prone to overfitting than single trees
4. **Good balance**: Best trade-off between performance and complexity
5. **Practical**: Reasonable training time and easy to maintain

In [None]:
# Score each model on criteria (normalized 0-1 scale)
model_scores = {
    'Logistic Regression (L2)': {
        'ROC-AUC': 0.82,
        'Recall': 0.65,
        'Interpretability': 0.9,
        'Deployment Ease': 1.0,
        'Precision': 0.52
    },
    'Decision Tree': {
        'ROC-AUC': 0.75,
        'Recall': 0.58,
        'Interpretability': 1.0,
        'Deployment Ease': 0.95,
        'Precision': 0.47
    },
    'Random Forest': {
        'ROC-AUC': 0.85,
        'Recall': 0.68,
        'Interpretability': 0.5,
        'Deployment Ease': 0.7,
        'Precision': 0.57
    },
    'Gradient Boosting': {
        'ROC-AUC': 0.86,
        'Recall': 0.70,
        'Interpretability': 0.4,
        'Deployment Ease': 0.5,
        'Precision': 0.59
    },
    'Neural Network': {
        'ROC-AUC': 0.84,
        'Recall': 0.67,
        'Interpretability': 0.2,
        'Deployment Ease': 0.3,
        'Precision': 0.56
    }
}

# Calculate weighted scores
weights = [0.30, 0.25, 0.25, 0.15, 0.05]
criteria_map = ['ROC-AUC', 'Recall', 'Interpretability', 'Deployment Ease', 'Precision']

weighted_scores = {}
for model, scores in model_scores.items():
    total = sum(scores[c] * w for c, w in zip(criteria_map, weights))
    weighted_scores[model] = total

# Display results
print("WEIGHTED MODEL SCORES")
print("="*50)
for model, score in sorted(weighted_scores.items(), key=lambda x: x[1], reverse=True):
    print(f"{model:<30} {score:.4f}")
print("="*50)
print(f"\nSelected Model: {max(weighted_scores, key=weighted_scores.get)}")

In [None]:
# Visualize weighted scores
score_df = pd.DataFrame({
    'Model': list(weighted_scores.keys()),
    'Weighted Score': list(weighted_scores.values())
}).sort_values('Weighted Score', ascending=True)

fig = go.Figure()

colors = ['lightgray'] * len(score_df)
colors[-1] = 'green'  # Highlight the best model

fig.add_trace(go.Bar(
    x=score_df['Weighted Score'],
    y=score_df['Model'],
    orientation='h',
    marker_color=colors,
    text=score_df['Weighted Score'].round(3),
    textposition='outside'
))

fig.update_layout(
    title='Weighted Model Scores for Production Deployment',
    xaxis_title='Weighted Score',
    yaxis_title='Model',
    height=400,
    xaxis=dict(range=[0, 1])
)

fig.show()

### 2.3 Final Model Training

Now we train the final Random Forest model with optimized hyperparameters.

In [None]:
# Define the final model with optimized parameters
final_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=12,
    min_samples_split=10,
    min_samples_leaf=5,
    max_features='sqrt',
    class_weight='balanced',
    n_jobs=-1,
    random_state=RANDOM_STATE,
    oob_score=True  # Enable out-of-bag scoring
)

print("Final Model Configuration:")
print("="*50)
print(f"Model: Random Forest Classifier")
print(f"Number of trees: 200")
print(f"Max depth: 12")
print(f"Min samples split: 10")
print(f"Min samples leaf: 5")
print(f"Max features: sqrt({len(feature_names)}) = {int(np.sqrt(len(feature_names)))}")
print(f"Class weight: balanced")
print("="*50)

In [None]:
# Train the final model
print("Training final model...")
import time
start_time = time.time()

final_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"Training completed in {training_time:.2f} seconds")
print(f"Out-of-bag score: {final_model.oob_score_:.4f}")

## 3. Model Validation and Testing

### 3.1 Final Performance Evaluation

In [None]:
# Generate predictions
y_pred = final_model.predict(X_test)
y_prob = final_model.predict_proba(X_test)[:, 1]

# Calculate comprehensive metrics
metrics = {
    'Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1 Score': f1_score(y_test, y_pred),
    'ROC-AUC': roc_auc_score(y_test, y_prob),
    'Average Precision': average_precision_score(y_test, y_prob),
    'Brier Score': brier_score_loss(y_test, y_prob)
}

print("FINAL MODEL PERFORMANCE ON TEST SET")
print("="*50)
for metric, value in metrics.items():
    print(f"{metric:<20} {value:.4f}")
print("="*50)

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

fig = go.Figure(data=go.Heatmap(
    z=cm,
    x=['Predicted: Retained', 'Predicted: Departed'],
    y=['Actual: Retained', 'Actual: Departed'],
    colorscale='Blues',
    text=cm,
    texttemplate='%{text}',
    textfont={'size': 16},
    hovertemplate='Actual: %{y}<br>Predicted: %{x}<br>Count: %{z}<extra></extra>'
))

fig.update_layout(
    title='Confusion Matrix - Final Random Forest Model',
    xaxis_title='Predicted',
    yaxis_title='Actual',
    height=450
)

fig.show()

# Calculate and display key counts
tn, fp, fn, tp = cm.ravel()
print(f"\nConfusion Matrix Breakdown:")
print(f"  True Negatives (correctly identified retained): {tn}")
print(f"  False Positives (false alarms): {fp}")
print(f"  False Negatives (missed departures): {fn}")
print(f"  True Positives (correctly identified departures): {tp}")

In [None]:
# ROC and Precision-Recall curves
fpr, tpr, roc_thresholds = roc_curve(y_test, y_prob)
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_test, y_prob)

fig = make_subplots(rows=1, cols=2, subplot_titles=(
    f'ROC Curve (AUC = {metrics["ROC-AUC"]:.3f})',
    f'Precision-Recall Curve (AP = {metrics["Average Precision"]:.3f})'
))

# ROC Curve
fig.add_trace(go.Scatter(
    x=fpr, y=tpr,
    mode='lines',
    name='ROC',
    line=dict(color='blue', width=2)
), row=1, col=1)

fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines',
    name='Random',
    line=dict(color='gray', dash='dash')
), row=1, col=1)

# Precision-Recall Curve
fig.add_trace(go.Scatter(
    x=recall_curve, y=precision_curve,
    mode='lines',
    name='PR Curve',
    line=dict(color='green', width=2)
), row=1, col=2)

prevalence = y_test.mean()
fig.add_hline(y=prevalence, line_dash='dash', line_color='gray', row=1, col=2)

fig.update_layout(
    height=450,
    showlegend=False
)

fig.update_xaxes(title_text='False Positive Rate', row=1, col=1)
fig.update_yaxes(title_text='True Positive Rate', row=1, col=1)
fig.update_xaxes(title_text='Recall', row=1, col=2)
fig.update_yaxes(title_text='Precision', row=1, col=2)

fig.show()

### 3.2 Threshold Optimization

The default threshold of 0.5 may not be optimal for higher education contexts where identifying at-risk students is critical. Let's explore different thresholds.

In [None]:
# Calculate metrics at different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
threshold_metrics = []

for thresh in thresholds:
    y_pred_thresh = (y_prob >= thresh).astype(int)
    
    threshold_metrics.append({
        'Threshold': thresh,
        'Precision': precision_score(y_test, y_pred_thresh, zero_division=0),
        'Recall': recall_score(y_test, y_pred_thresh, zero_division=0),
        'F1 Score': f1_score(y_test, y_pred_thresh, zero_division=0),
        'Accuracy': accuracy_score(y_test, y_pred_thresh)
    })

thresh_df = pd.DataFrame(threshold_metrics)

In [None]:
# Plot threshold vs metrics
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=thresh_df['Threshold'], y=thresh_df['Precision'],
    mode='lines', name='Precision',
    line=dict(color='blue', width=2)
))

fig.add_trace(go.Scatter(
    x=thresh_df['Threshold'], y=thresh_df['Recall'],
    mode='lines', name='Recall',
    line=dict(color='red', width=2)
))

fig.add_trace(go.Scatter(
    x=thresh_df['Threshold'], y=thresh_df['F1 Score'],
    mode='lines', name='F1 Score',
    line=dict(color='green', width=2)
))

# Add vertical lines for key thresholds
fig.add_vline(x=0.5, line_dash='dash', line_color='gray', 
              annotation_text='Default (0.5)')

# Find optimal threshold for F1
optimal_f1_idx = thresh_df['F1 Score'].idxmax()
optimal_f1_thresh = thresh_df.loc[optimal_f1_idx, 'Threshold']
fig.add_vline(x=optimal_f1_thresh, line_dash='dash', line_color='green',
              annotation_text=f'Optimal F1 ({optimal_f1_thresh:.2f})')

fig.update_layout(
    title='Precision, Recall, and F1 Score vs. Classification Threshold',
    xaxis_title='Classification Threshold',
    yaxis_title='Score',
    height=500,
    legend=dict(x=0.8, y=0.95)
)

fig.show()

In [None]:
# Recommend thresholds for different institutional priorities
print("THRESHOLD RECOMMENDATIONS FOR DIFFERENT PRIORITIES")
print("="*70)

# High Recall (catch more at-risk students)
high_recall_idx = thresh_df[thresh_df['Recall'] >= 0.80]['Threshold'].min()
if pd.notna(high_recall_idx):
    hr_row = thresh_df[thresh_df['Threshold'] == high_recall_idx].iloc[0]
    print(f"\n1. HIGH RECALL PRIORITY (catch most at-risk students)")
    print(f"   Threshold: {high_recall_idx:.2f}")
    print(f"   Recall: {hr_row['Recall']:.2%} | Precision: {hr_row['Precision']:.2%}")
    print(f"   Trade-off: More false alarms, but fewer missed students")

# Balanced (optimal F1)
bal_row = thresh_df.loc[optimal_f1_idx]
print(f"\n2. BALANCED PRIORITY (best F1 score)")
print(f"   Threshold: {optimal_f1_thresh:.2f}")
print(f"   Recall: {bal_row['Recall']:.2%} | Precision: {bal_row['Precision']:.2%}")
print(f"   Trade-off: Good balance of precision and recall")

# High Precision (fewer false alarms)
high_prec_idx = thresh_df[thresh_df['Precision'] >= 0.65]['Threshold'].min()
if pd.notna(high_prec_idx):
    hp_row = thresh_df[thresh_df['Threshold'] == high_prec_idx].iloc[0]
    print(f"\n3. HIGH PRECISION PRIORITY (minimize false alarms)")
    print(f"   Threshold: {high_prec_idx:.2f}")
    print(f"   Recall: {hp_row['Recall']:.2%} | Precision: {hp_row['Precision']:.2%}")
    print(f"   Trade-off: Fewer false alarms, but may miss some students")

print("\n" + "="*70)

### 3.3 Calibration Analysis

Calibration measures how well the predicted probabilities match actual outcomes.

In [None]:
# Calculate calibration curve
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10, strategy='uniform')

fig = go.Figure()

# Perfect calibration line
fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines',
    name='Perfectly Calibrated',
    line=dict(color='gray', dash='dash')
))

# Model calibration
fig.add_trace(go.Scatter(
    x=prob_pred, y=prob_true,
    mode='lines+markers',
    name='Random Forest',
    line=dict(color='blue', width=2),
    marker=dict(size=10)
))

fig.update_layout(
    title='Calibration Curve - Random Forest Model',
    xaxis_title='Mean Predicted Probability',
    yaxis_title='Fraction of Positives',
    height=500,
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1])
)

fig.show()

print(f"Brier Score: {metrics['Brier Score']:.4f}")
print("(Lower is better - 0 is perfect, 0.25 is random for balanced classes)")

In [None]:
# Distribution of predicted probabilities
fig = make_subplots(rows=1, cols=2, subplot_titles=(
    'Probability Distribution by Actual Outcome',
    'Histogram of All Predictions'
))

# By actual outcome
fig.add_trace(go.Histogram(
    x=y_prob[y_test == 0],
    name='Actually Retained',
    opacity=0.7,
    marker_color='blue',
    nbinsx=30
), row=1, col=1)

fig.add_trace(go.Histogram(
    x=y_prob[y_test == 1],
    name='Actually Departed',
    opacity=0.7,
    marker_color='red',
    nbinsx=30
), row=1, col=1)

# All predictions
fig.add_trace(go.Histogram(
    x=y_prob,
    name='All Predictions',
    marker_color='green',
    nbinsx=30,
    showlegend=False
), row=1, col=2)

fig.update_layout(
    height=400,
    barmode='overlay'
)

fig.update_xaxes(title_text='Predicted Probability', row=1, col=1)
fig.update_xaxes(title_text='Predicted Probability', row=1, col=2)
fig.update_yaxes(title_text='Count', row=1, col=1)
fig.update_yaxes(title_text='Count', row=1, col=2)

fig.show()

## 4. Model Interpretation for Stakeholders

### 4.1 Feature Importance Analysis

In [None]:
# Get feature importances
importances = final_model.feature_importances_
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=False)

# Display top 15 features
print("TOP 15 MOST IMPORTANT FEATURES")
print("="*50)
for i, (_, row) in enumerate(importance_df.head(15).iterrows(), 1):
    print(f"{i:2}. {row['Feature']:<30} {row['Importance']:.4f}")
print("="*50)

In [None]:
# Visualize feature importances
top_n = 15
top_features = importance_df.head(top_n).sort_values('Importance', ascending=True)

fig = go.Figure()

fig.add_trace(go.Bar(
    x=top_features['Importance'],
    y=top_features['Feature'],
    orientation='h',
    marker_color='steelblue',
    text=top_features['Importance'].round(3),
    textposition='outside'
))

fig.update_layout(
    title=f'Top {top_n} Feature Importances - Random Forest Model',
    xaxis_title='Importance (Mean Decrease in Impurity)',
    yaxis_title='Feature',
    height=550,
    margin=dict(l=200)
)

fig.show()

In [None]:
# Group features by category
category_importance = {
    'Academic Performance (GPA)': ['GPA_1', 'GPA_2', 'HS_GPA', 'HS_MATH_GPA', 'HS_ENGL_GPA'],
    'Course Completion': ['UNITS_COMPLETED_1', 'UNITS_COMPLETED_2', 'UNITS_ATTEMPTED_1', 'UNITS_ATTEMPTED_2'],
    'Academic Difficulty (DFW)': ['DFW_RATE_1', 'DFW_RATE_2', 'DFW_UNITS_1', 'DFW_UNITS_2'],
    'Grade Points': ['GRADE_POINTS_1', 'GRADE_POINTS_2'],
    'Demographics': [f for f in feature_names if any(x in f for x in ['RACE', 'GENDER', 'FIRST_GEN', 'COLLEGE'])]
}

category_totals = {}
for category, features in category_importance.items():
    total = importance_df[importance_df['Feature'].isin(features)]['Importance'].sum()
    category_totals[category] = total

cat_df = pd.DataFrame({
    'Category': list(category_totals.keys()),
    'Total Importance': list(category_totals.values())
}).sort_values('Total Importance', ascending=True)

fig = go.Figure()

fig.add_trace(go.Bar(
    x=cat_df['Total Importance'],
    y=cat_df['Category'],
    orientation='h',
    marker_color=['#2ecc71', '#3498db', '#e74c3c', '#f39c12', '#9b59b6'],
    text=cat_df['Total Importance'].round(3),
    textposition='outside'
))

fig.update_layout(
    title='Feature Importance by Category',
    xaxis_title='Total Importance',
    yaxis_title='Feature Category',
    height=400,
    margin=dict(l=200)
)

fig.show()

### 4.2 Creating Stakeholder-Friendly Explanations

In [None]:
# Create stakeholder-friendly feature descriptions
feature_descriptions = {
    'GPA_1': 'First semester GPA',
    'GPA_2': 'Second semester GPA',
    'DFW_RATE_1': 'Proportion of D/F/W grades in first semester',
    'DFW_RATE_2': 'Proportion of D/F/W grades in second semester',
    'HS_GPA': 'High school GPA',
    'UNITS_COMPLETED_1': 'Credit hours completed in first semester',
    'UNITS_COMPLETED_2': 'Credit hours completed in second semester',
    'GRADE_POINTS_1': 'Total grade points earned in first semester',
    'GRADE_POINTS_2': 'Total grade points earned in second semester'
}

print("STAKEHOLDER-FRIENDLY MODEL SUMMARY")
print("="*70)
print("\nWhat the model considers most important when predicting student departure:\n")

for i, (_, row) in enumerate(importance_df.head(10).iterrows(), 1):
    feature = row['Feature']
    importance = row['Importance']
    description = feature_descriptions.get(feature, feature)
    
    # Convert importance to percentage
    pct = importance * 100 / importance_df['Importance'].sum()
    
    print(f"{i}. {description}")
    print(f"   Relative importance: {pct:.1f}%\n")

print("="*70)

In [None]:
# Create a sample prediction explanation
def explain_prediction(model, X, feature_names, student_idx=0):
    """
    Create a human-readable explanation for a single prediction.
    """
    student_data = X.iloc[student_idx]
    prob = model.predict_proba(X.iloc[[student_idx]])[0, 1]
    
    print(f"PREDICTION EXPLANATION FOR STUDENT {student_idx}")
    print("="*60)
    print(f"\nPredicted Departure Probability: {prob:.1%}")
    
    if prob >= 0.7:
        risk_level = "HIGH RISK"
    elif prob >= 0.4:
        risk_level = "MODERATE RISK"
    else:
        risk_level = "LOW RISK"
    
    print(f"Risk Level: {risk_level}")
    
    print("\nKey Student Characteristics:")
    key_features = ['GPA_1', 'GPA_2', 'DFW_RATE_1', 'DFW_RATE_2', 'HS_GPA']
    for feature in key_features:
        if feature in student_data.index:
            value = student_data[feature]
            desc = feature_descriptions.get(feature, feature)
            print(f"  - {desc}: {value:.2f}")
    
    print("\n" + "="*60)

# Show example
explain_prediction(final_model, X_test, feature_names, student_idx=0)

## 5. Preparing for Deployment

### 5.1 Model Serialization

In [None]:
# Create model artifacts directory
model_dir = '../../models/'
os.makedirs(model_dir, exist_ok=True)

# Save the model using joblib (recommended for scikit-learn)
model_filename = os.path.join(model_dir, 'student_departure_rf_model.joblib')
joblib.dump(final_model, model_filename)
print(f"Model saved to: {model_filename}")

# Verify the saved model
loaded_model = joblib.load(model_filename)
test_pred = loaded_model.predict_proba(X_test[:5])[:, 1]
original_pred = final_model.predict_proba(X_test[:5])[:, 1]

print(f"\nModel verification:")
print(f"Predictions match: {np.allclose(test_pred, original_pred)}")

In [None]:
# Save feature names for consistency
feature_file = os.path.join(model_dir, 'feature_names.json')
with open(feature_file, 'w') as f:
    json.dump(feature_names, f, indent=2)
print(f"Feature names saved to: {feature_file}")

In [None]:
# Save model metadata
metadata = {
    'model_name': 'Student Departure Prediction Model',
    'model_type': 'RandomForestClassifier',
    'version': '1.0.0',
    'created_date': datetime.now().isoformat(),
    'training_samples': len(X_train),
    'n_features': len(feature_names),
    'hyperparameters': {
        'n_estimators': 200,
        'max_depth': 12,
        'min_samples_split': 10,
        'min_samples_leaf': 5,
        'max_features': 'sqrt',
        'class_weight': 'balanced'
    },
    'performance_metrics': {
        'roc_auc': round(metrics['ROC-AUC'], 4),
        'f1_score': round(metrics['F1 Score'], 4),
        'recall': round(metrics['Recall'], 4),
        'precision': round(metrics['Precision'], 4),
        'accuracy': round(metrics['Accuracy'], 4)
    },
    'recommended_threshold': {
        'default': 0.5,
        'high_recall': 0.35,
        'balanced': float(optimal_f1_thresh),
        'high_precision': 0.65
    },
    'feature_importance_top_5': importance_df.head(5).to_dict('records')
}

metadata_file = os.path.join(model_dir, 'model_metadata.json')
with open(metadata_file, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"Model metadata saved to: {metadata_file}")
print("\nMetadata contents:")
print(json.dumps(metadata, indent=2))

### 5.2 Creating a Prediction Pipeline

In [None]:
class StudentDeparturePredictionPipeline:
    """
    A complete prediction pipeline for student departure risk assessment.
    """
    
    def __init__(self, model_path, feature_names_path, threshold=0.5):
        """
        Initialize the pipeline.
        
        Parameters:
        -----------
        model_path : str
            Path to the saved model file
        feature_names_path : str
            Path to the feature names JSON file
        threshold : float
            Classification threshold (default: 0.5)
        """
        self.model = joblib.load(model_path)
        with open(feature_names_path, 'r') as f:
            self.feature_names = json.load(f)
        self.threshold = threshold
        
    def preprocess(self, df):
        """
        Preprocess input data to match training format.
        """
        # Define expected features
        numeric_features = [
            'HS_GPA', 'HS_MATH_GPA', 'HS_ENGL_GPA',
            'UNITS_ATTEMPTED_1', 'UNITS_ATTEMPTED_2',
            'UNITS_COMPLETED_1', 'UNITS_COMPLETED_2',
            'DFW_UNITS_1', 'DFW_UNITS_2',
            'GPA_1', 'GPA_2',
            'DFW_RATE_1', 'DFW_RATE_2',
            'GRADE_POINTS_1', 'GRADE_POINTS_2'
        ]
        
        categorical_features = ['RACE_ETHNICITY', 'GENDER', 'FIRST_GEN_STATUS', 'COLLEGE']
        
        # One-hot encode
        df_encoded = pd.get_dummies(df[numeric_features + categorical_features],
                                    columns=categorical_features, drop_first=True)
        
        # Align with training features
        for feature in self.feature_names:
            if feature not in df_encoded.columns:
                df_encoded[feature] = 0
        
        # Reorder columns
        df_encoded = df_encoded[self.feature_names]
        
        # Handle missing values
        df_encoded = df_encoded.fillna(df_encoded.median())
        
        return df_encoded
    
    def predict(self, df):
        """
        Generate predictions for input data.
        
        Returns:
        --------
        dict with 'probabilities', 'predictions', and 'risk_levels'
        """
        X = self.preprocess(df)
        
        probabilities = self.model.predict_proba(X)[:, 1]
        predictions = (probabilities >= self.threshold).astype(int)
        
        # Assign risk levels
        risk_levels = []
        for prob in probabilities:
            if prob >= 0.7:
                risk_levels.append('High Risk')
            elif prob >= 0.4:
                risk_levels.append('Moderate Risk')
            else:
                risk_levels.append('Low Risk')
        
        return {
            'probabilities': probabilities,
            'predictions': predictions,
            'risk_levels': risk_levels
        }
    
    def generate_report(self, df, output_file=None):
        """
        Generate a student risk assessment report.
        """
        results = self.predict(df)
        
        report_df = df[['SID']].copy() if 'SID' in df.columns else df.iloc[:, :1].copy()
        report_df['Departure_Probability'] = results['probabilities']
        report_df['Predicted_Departure'] = results['predictions']
        report_df['Risk_Level'] = results['risk_levels']
        
        if output_file:
            report_df.to_csv(output_file, index=False)
        
        return report_df

print("StudentDeparturePredictionPipeline class defined successfully!")

In [None]:
# Test the pipeline
pipeline = StudentDeparturePredictionPipeline(
    model_path=model_filename,
    feature_names_path=feature_file,
    threshold=0.5
)

# Generate predictions for a few students
sample_students = test_df.head(10)
results = pipeline.predict(sample_students)

print("SAMPLE PREDICTIONS")
print("="*60)
for i in range(min(5, len(sample_students))):
    sid = sample_students.iloc[i]['SID'] if 'SID' in sample_students.columns else f"Student {i}"
    prob = results['probabilities'][i]
    risk = results['risk_levels'][i]
    actual = sample_students.iloc[i]['DEPARTED']
    
    print(f"{sid}: Prob={prob:.2%}, Risk={risk}, Actual={'Departed' if actual else 'Retained'}")
print("="*60)

### 5.3 Model Documentation

In [None]:
# Generate comprehensive model documentation
documentation = f"""
================================================================================
                    STUDENT DEPARTURE PREDICTION MODEL
                           Model Documentation
================================================================================

VERSION: 1.0.0
DATE: {datetime.now().strftime('%Y-%m-%d')}

--------------------------------------------------------------------------------
1. MODEL OVERVIEW
--------------------------------------------------------------------------------

Purpose: Predict the probability that a student will depart (not return) by 
         their third semester based on academic and demographic factors.

Model Type: Random Forest Classifier
Algorithm: Ensemble of {final_model.n_estimators} decision trees with majority voting

Target Variable: Student departure status (1 = Departed, 0 = Retained)

--------------------------------------------------------------------------------
2. INPUT FEATURES
--------------------------------------------------------------------------------

The model uses {len(feature_names)} features across these categories:

Academic Performance:
  - GPA_1, GPA_2: First and second semester GPA
  - HS_GPA, HS_MATH_GPA, HS_ENGL_GPA: High school performance
  
Course Completion:
  - UNITS_ATTEMPTED_1, UNITS_ATTEMPTED_2: Credit hours attempted
  - UNITS_COMPLETED_1, UNITS_COMPLETED_2: Credit hours completed
  
Academic Difficulty:
  - DFW_RATE_1, DFW_RATE_2: Proportion of D/F/W grades
  - DFW_UNITS_1, DFW_UNITS_2: Number of D/F/W credit hours
  
Demographics:
  - RACE_ETHNICITY, GENDER, FIRST_GEN_STATUS, COLLEGE

--------------------------------------------------------------------------------
3. PERFORMANCE METRICS
--------------------------------------------------------------------------------

Evaluated on held-out test set ({len(y_test):,} students):

  ROC-AUC:      {metrics['ROC-AUC']:.4f}  (Discrimination ability)
  F1 Score:     {metrics['F1 Score']:.4f}  (Balance of precision and recall)
  Recall:       {metrics['Recall']:.4f}  (Sensitivity - at-risk students identified)
  Precision:    {metrics['Precision']:.4f}  (Positive predictive value)
  Accuracy:     {metrics['Accuracy']:.4f}  (Overall correctness)

--------------------------------------------------------------------------------
4. RECOMMENDED THRESHOLDS
--------------------------------------------------------------------------------

Choose threshold based on institutional priorities:

  High Recall (0.35):      Identify most at-risk students
                           Trade-off: More false positives
                           
  Balanced ({optimal_f1_thresh:.2f}):        Optimize F1 score
                           Trade-off: Good balance
                           
  High Precision (0.65):   Minimize false alarms
                           Trade-off: May miss some students

--------------------------------------------------------------------------------
5. RISK LEVEL DEFINITIONS
--------------------------------------------------------------------------------

  HIGH RISK (>= 70%):      Immediate intervention recommended
  MODERATE RISK (40-70%):  Proactive outreach recommended
  LOW RISK (< 40%):        Standard support sufficient

--------------------------------------------------------------------------------
6. LIMITATIONS AND CONSIDERATIONS
--------------------------------------------------------------------------------

- Model is trained on historical data and may not capture recent changes
- Predictions are probabilities, not certainties
- Should be used as one input among many in advising decisions
- Regular retraining recommended (annually at minimum)
- Monitor for demographic disparities in predictions

--------------------------------------------------------------------------------
7. USAGE
--------------------------------------------------------------------------------

from pipeline import StudentDeparturePredictionPipeline

pipeline = StudentDeparturePredictionPipeline(
    model_path='models/student_departure_rf_model.joblib',
    feature_names_path='models/feature_names.json',
    threshold=0.5
)

results = pipeline.predict(student_data)

================================================================================
"""

print(documentation)

# Save documentation
doc_file = os.path.join(model_dir, 'model_documentation.txt')
with open(doc_file, 'w') as f:
    f.write(documentation)
print(f"Documentation saved to: {doc_file}")

## 6. Deployment Considerations for Higher Education

### 6.1 Integration Strategies

In [None]:
integration_strategies = {
    'Strategy': [
        'Batch Processing',
        'Real-time API',
        'Dashboard Integration',
        'SIS Integration',
        'Early Alert System'
    ],
    'Description': [
        'Run predictions weekly/monthly on all students',
        'REST API for on-demand predictions',
        'Integrate into advising dashboards (Tableau, Power BI)',
        'Connect to Student Information System',
        'Trigger alerts for high-risk students'
    ],
    'Complexity': ['Low', 'Medium', 'Medium', 'High', 'Medium'],
    'Best For': [
        'Small institutions, limited IT',
        'Custom applications, flexibility',
        'Visual analytics, reporting',
        'Seamless workflow integration',
        'Proactive intervention'
    ]
}

integration_df = pd.DataFrame(integration_strategies)

print("INTEGRATION STRATEGIES FOR HIGHER EDUCATION")
print("="*90)
for _, row in integration_df.iterrows():
    print(f"\n{row['Strategy']} (Complexity: {row['Complexity']})")
    print(f"  Description: {row['Description']}")
    print(f"  Best for: {row['Best For']}")
print("\n" + "="*90)

### 6.2 Monitoring and Maintenance

In [None]:
monitoring_checklist = """
MODEL MONITORING AND MAINTENANCE CHECKLIST
================================================================================

REGULAR MONITORING (Monthly)
[ ] Check prediction distribution - has it shifted significantly?
[ ] Review feature distributions for input drift
[ ] Monitor API response times (if applicable)
[ ] Track number of predictions made

PERFORMANCE VALIDATION (Quarterly)
[ ] Calculate actual vs predicted outcomes for past predictions
[ ] Update performance metrics (AUC, precision, recall)
[ ] Check for performance degradation
[ ] Analyze performance by demographic group

MODEL RETRAINING (Annually or as needed)
[ ] Collect new training data from recent cohorts
[ ] Retrain model with updated data
[ ] Validate new model against holdout set
[ ] Compare performance to previous version
[ ] Document changes and update version number

ALERT CONDITIONS (Investigate immediately)
- Performance drops more than 5% from baseline
- Prediction distribution shifts significantly
- Feature values outside expected ranges
- Significant disparities across demographic groups

================================================================================
"""

print(monitoring_checklist)

### 6.3 Ethical Considerations

In [None]:
ethical_considerations = """
ETHICAL CONSIDERATIONS FOR STUDENT DEPARTURE PREDICTION
================================================================================

1. TRANSPARENCY
   - Students should be informed that predictive analytics are used
   - Explain what data is collected and how it's used
   - Provide opt-out options where feasible

2. FAIRNESS
   - Regularly audit model for demographic disparities
   - Ensure interventions don't disadvantage any group
   - Consider removing demographic features if they cause bias

3. PRIVACY
   - Limit access to predictions on a need-to-know basis
   - Secure storage and transmission of student data
   - Comply with FERPA and institutional policies

4. HUMAN OVERSIGHT
   - Predictions should inform, not replace, human judgment
   - Advisors should have final decision-making authority
   - Regular review by diverse stakeholders

5. AVOIDING HARM
   - Risk labels should not become self-fulfilling prophecies
   - Interventions should be supportive, not punitive
   - Consider unintended consequences of classification

6. ACCOUNTABILITY
   - Clear ownership of model decisions
   - Process for students to appeal or dispute predictions
   - Document all model changes and their rationale

================================================================================
"""

print(ethical_considerations)

In [None]:
# Analyze model fairness across demographic groups
print("MODEL FAIRNESS ANALYSIS")
print("="*70)

# Add predictions to test data for analysis
test_df_analysis = test_df.copy()
test_df_analysis['predicted_prob'] = y_prob
test_df_analysis['predicted_class'] = y_pred

# Check performance by demographic groups
for group_col in ['GENDER', 'FIRST_GEN_STATUS']:
    if group_col in test_df_analysis.columns:
        print(f"\nPerformance by {group_col}:")
        print("-"*50)
        
        for group in test_df_analysis[group_col].unique():
            mask = test_df_analysis[group_col] == group
            group_y_true = test_df_analysis.loc[mask, 'DEPARTED']
            group_y_prob = test_df_analysis.loc[mask, 'predicted_prob']
            group_y_pred = test_df_analysis.loc[mask, 'predicted_class']
            
            if len(group_y_true) > 0 and group_y_true.nunique() > 1:
                auc = roc_auc_score(group_y_true, group_y_prob)
                recall = recall_score(group_y_true, group_y_pred)
                
                print(f"  {group}: n={len(group_y_true):,}, AUC={auc:.3f}, Recall={recall:.3f}")

print("\n" + "="*70)

## 7. Summary and Course Conclusion

In [None]:
# Final summary
print("="*80)
print("                    COURSE 3 SUMMARY: MODEL COMPARISON AND SELECTION")
print("="*80)

print("""
MODELS EXPLORED:
----------------
Module 1: Regularized Logistic Regression (L1, L2, ElasticNet)
Module 2: Decision Trees
Module 3: Random Forests
Module 4: Gradient Boosting (XGBoost, LightGBM, CatBoost)
Module 5: Neural Networks (MLPClassifier)
Module 6: Model Comparison and Final Selection

SELECTED MODEL:
---------------
Random Forest Classifier
- Best balance of performance and interpretability
- Robust to overfitting
- Feature importance for stakeholder explanations
""")

print(f"""
FINAL MODEL PERFORMANCE:
------------------------
ROC-AUC:      {metrics['ROC-AUC']:.4f}
F1 Score:     {metrics['F1 Score']:.4f}
Recall:       {metrics['Recall']:.4f}
Precision:    {metrics['Precision']:.4f}
""")

print("""
DEPLOYMENT ARTIFACTS CREATED:
-----------------------------
1. Trained model (joblib format)
2. Feature names (JSON format)
3. Model metadata (JSON format)
4. Model documentation (text format)
5. Prediction pipeline class
""")

print("""
KEY RECOMMENDATIONS:
--------------------
1. Use High Recall threshold (0.35) if identifying all at-risk students is priority
2. Use Balanced threshold (~0.45) for general use
3. Monitor model performance quarterly
4. Retrain annually with new cohort data
5. Regularly audit for fairness across demographic groups
""")

print("="*80)

### Key Takeaways from Course 3

| Topic | Key Learning |
|:------|:-------------|
| **Model Selection** | No single model is best for all situations - consider interpretability, performance, and deployment requirements |
| **Regularization** | L1/L2 regularization prevents overfitting and can perform feature selection |
| **Ensemble Methods** | Random Forests and Gradient Boosting often achieve best performance by combining multiple models |
| **Neural Networks** | Powerful for complex patterns but require more data and tuning |
| **Threshold Tuning** | Adjust classification threshold based on institutional priorities (recall vs precision) |
| **Deployment** | Production models need monitoring, documentation, and maintenance plans |
| **Ethics** | Consider fairness, transparency, and potential unintended consequences |

### Next Steps

With your model trained, validated, and documented, you are ready to:

1. **Deploy** the model in your institution's systems
2. **Monitor** performance and fairness over time
3. **Iterate** by collecting feedback and improving the model
4. **Expand** to other prediction tasks (e.g., graduation, major selection)

**Congratulations on completing Course 3!**

You now have the skills to build, compare, and deploy machine learning models for student success prediction in higher education contexts.

In [None]:
# List all saved artifacts
print("\nSAVED MODEL ARTIFACTS:")
print("="*50)
for filename in os.listdir(model_dir):
    filepath = os.path.join(model_dir, filename)
    size = os.path.getsize(filepath) / 1024  # Size in KB
    print(f"  {filename}: {size:.1f} KB")
print("="*50)
print("\nCourse 3, Module 6 complete!")