# 3.2 Final Model Selection and Deployment Considerations

## Course 3: Advanced Classification Models for Student Success

## Introduction

In notebook 3.1, we compared Regularized Logistic Regression, Random Forest, and XGBoost on multiple metrics. In this notebook, we make a **final model selection** and discuss practical deployment considerations for higher education contexts.

### Learning Objectives

1. Select a final model based on institutional priorities
2. Prepare a model for deployment (saving, loading, scoring)
3. Understand deployment considerations specific to higher education
4. Create a model card documenting the selected model

## 1. Model Selection Criteria

When selecting a model for deployment, consider:

| Factor | Question to Ask |
|:-------|:---------------|
| **Performance** | Does it meet our minimum accuracy/recall thresholds? |
| **Interpretability** | Can advisors understand and trust the predictions? |
| **Fairness** | Does it perform equitably across student subgroups? |
| **Maintenance** | Can our team retrain and monitor it? |
| **Integration** | Does it fit into our existing systems? |
| **Compliance** | Can we explain individual predictions if audited? |

In [None]:
import numpy as np
import pandas as pd
import joblib
import warnings
warnings.filterwarnings('ignore')

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

RANDOM_STATE = 42

# Load and prepare data
train_df = pd.read_csv('../../data/training.csv')
test_df = pd.read_csv('../../data/testing.csv')
train_df['DEPARTED'] = (train_df['SEM_3_STATUS'] != 'E').astype(int)
test_df['DEPARTED'] = (test_df['SEM_3_STATUS'] != 'E').astype(int)

numeric_features = ['HS_GPA','HS_MATH_GPA','HS_ENGL_GPA','UNITS_ATTEMPTED_1','UNITS_ATTEMPTED_2',
    'UNITS_COMPLETED_1','UNITS_COMPLETED_2','DFW_UNITS_1','DFW_UNITS_2','GPA_1','GPA_2',
    'DFW_RATE_1','DFW_RATE_2','GRADE_POINTS_1','GRADE_POINTS_2']
categorical_features = ['RACE_ETHNICITY','GENDER','FIRST_GEN_STATUS','COLLEGE']

train_enc = pd.get_dummies(train_df[numeric_features + categorical_features],
                           columns=categorical_features, drop_first=True)
test_enc = pd.get_dummies(test_df[numeric_features + categorical_features],
                          columns=categorical_features, drop_first=True)
train_enc, test_enc = train_enc.align(test_enc, join='left', axis=1, fill_value=0)
train_enc = train_enc.fillna(train_enc.median())
test_enc = test_enc.fillna(test_enc.median())

X_train, y_train = train_enc, train_df['DEPARTED']
X_test, y_test = test_enc, test_df['DEPARTED']

print("Data prepared for final model selection.")

## 2. Saving and Loading Models

Scikit-learn models (including XGBoost) can be saved using `joblib`:

```python
import joblib

# Save
joblib.dump(model, 'model_filename.pkl')

# Load
loaded_model = joblib.load('model_filename.pkl')
predictions = loaded_model.predict(new_data)
```

This makes deployment straightforward—the saved model file can be loaded by any Python application.

In [None]:
# Example: Save and load a model
rf = RandomForestClassifier(n_estimators=200, max_depth=12, min_samples_leaf=5,
    class_weight='balanced', n_jobs=-1, random_state=RANDOM_STATE)
rf.fit(X_train, y_train)

# Save
joblib.dump(rf, '../../models/random_forest_final.pkl')
print("Model saved to models/random_forest_final.pkl")

# Load and verify
loaded_rf = joblib.load('../../models/random_forest_final.pkl')
prob_original = rf.predict_proba(X_test)[:, 1]
prob_loaded = loaded_rf.predict_proba(X_test)[:, 1]
print(f"Predictions match: {np.allclose(prob_original, prob_loaded)}")
print(f"Test AUC: {roc_auc_score(y_test, prob_loaded):.4f}")

## 3. Model Card Template

A **model card** documents a deployed model for transparency and accountability.

| Field | Description |
|:------|:-----------|
| **Model Name** | Student Departure Risk Predictor v1.0 |
| **Model Type** | Random Forest / XGBoost / Logistic Regression |
| **Task** | Binary classification: predict 3rd semester departure |
| **Training Data** | CSULB first-time freshmen cohorts |
| **Features** | HS GPA, college GPA, DFW rates, demographics |
| **Performance (AUC)** | [Insert from evaluation] |
| **Intended Use** | Early warning system for academic advisors |
| **Limitations** | Trained on CSULB data; may not generalize to other institutions |
| **Ethical Considerations** | Monitor for bias across demographic groups |
| **Retraining Schedule** | Annually, with each new cohort |
| **Owner** | Institutional Research and Analytics Department |

## 4. Summary

### Key Decisions

1. **Choose Logistic Regression** when interpretability and compliance are paramount
2. **Choose Random Forest** for a reliable, robust default with good performance
3. **Choose XGBoost** when maximizing predictive accuracy is the top priority
4. **Document your choice** with a model card for institutional transparency
5. **Plan for retraining** as student populations and institutional contexts evolve

### Course Progress

You have now completed the core supervised learning modules of Course 3. Next:
- **Module 4**: Unsupervised Learning (coming soon)
- **Module 5**: AI-Assisted Coding
- **Module 6**: Special Topics (additional algorithms)
- **Capstones**: End-to-end applied projects