# Capstone Project 3: Early Warning System - Combining Multiple Models

# Student Workbook

Welcome to your Capstone Project workbook! This notebook provides a structured outline for building an early warning system. Your task is to fill in the missing code where indicated (replace `...` with appropriate code) to complete the analysis. Good luck!

# Understand

## Building a Production-Ready Early Warning System

In this project, you will:
1. Build an ensemble of models from different model families
2. Implement a stacking classifier to combine model predictions
3. Create risk scores and early warning tiers (High/Medium/Low risk)
4. Develop student intervention recommendations
5. Create deployment documentation

### Learning Objectives

By the end of this capstone, you will be able to:
1. Implement ensemble methods including stacking classifiers
2. Create calibrated risk scores and probability estimates
3. Design tiered intervention frameworks based on risk levels
4. Prepare deployment documentation for institutional use

# Prepare

#### **Step 1: Import Libraries and Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.calibration import CalibratedClassifierCV, calibration_curve

# Base Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

# Ensemble Methods
from sklearn.ensemble import StackingClassifier, VotingClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, precision_recall_curve, average_precision_score,
    confusion_matrix, classification_report, brier_score_loss
)

# Model persistence
import joblib
import json
from datetime import datetime

# Set random seed
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("All libraries imported successfully!")

In [None]:
# Load data
data_location = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/data/'
df = ...                    # Load the student academics data
print(f"Dataset shape: {df.shape}")
...                         # Display first few rows

#### **Step 2: Data Cleaning**

In [None]:
# Standard data cleaning
df['RACE_ETHNICITY'] = ...     # Replace rare categories with 'Other'

df = ...                       # Remove Nonbinary rows
df['GENDER'] = ...             # Clean and standardize

...                            # Drop SEM_1_STATUS and SEM_2_STATUS
...                            # Remove duplicates

# Drop columns with >50% missing
missing_pct = df.isnull().sum() / len(df)
cols_to_drop = missing_pct[missing_pct > 0.5].index.tolist()
df.drop(columns=cols_to_drop, inplace=True)

# Create target
df['DEPARTED'] = ...           # Create binary target

print(f"Cleaned dataset shape: {df.shape}")
print(f"Departure rate: {df['DEPARTED'].mean():.2%}")

#### **Step 3: Feature Engineering**

In [None]:
# Feature engineering
def create_features(df):
    df = df.copy()
    
    # DFW Rates
    df['DFW_RATE_1'] = ...     # Calculate DFW rate semester 1
    df['DFW_RATE_2'] = ...     # Calculate DFW rate semester 2
    
    # Grade Points
    df['GRADE_POINTS_1'] = ... # Calculate grade points semester 1
    df['GRADE_POINTS_2'] = ... # Calculate grade points semester 2
    
    # GPA Trend (change from sem 1 to sem 2)
    df['GPA_TREND'] = ...      # Calculate GPA trend
    
    # Cumulative metrics
    df['TOTAL_UNITS_ATTEMPTED'] = ...  # Sum of attempted units
    df['TOTAL_UNITS_COMPLETED'] = ...  # Sum of completed units
    df['OVERALL_COMPLETION_RATE'] = ... # Overall completion rate
    
    return df

df = create_features(df)
print("Features created successfully.")

#### **Step 4: Prepare Data for Modeling**

In [None]:
# Define features
numeric_features = ...         # List of numeric feature column names

categorical_features = ...     # List of categorical feature column names

target = 'DEPARTED'

# Handle missing values in numeric features
for col in numeric_features:
    if col in df.columns and df[col].isnull().any():
        df[col] = df[col].fillna(df[col].median())

In [None]:
# Train/Validation/Test split (60/20/20)
# First split off test set
train_val_df, test_df = ...    # Split to get test set (20%)

# Then split training into train and validation
train_df, val_df = ...         # Split remaining into train and validation

print(f"Training set: {len(train_df):,} students")
print(f"Validation set: {len(val_df):,} students")
print(f"Test set: {len(test_df):,} students")

In [None]:
# Prepare feature matrices
all_features = numeric_features + categorical_features
available_features = [f for f in all_features if f in train_df.columns]

train_encoded = ...            # One-hot encode training features
val_encoded = ...              # One-hot encode validation features
test_encoded = ...             # One-hot encode test features

# Align columns
train_encoded, val_encoded = train_encoded.align(val_encoded, join='left', axis=1, fill_value=0)
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1, fill_value=0)

# Fill NaN
train_encoded = train_encoded.fillna(0)
val_encoded = val_encoded.fillna(0)
test_encoded = test_encoded.fillna(0)

# Prepare X and y
X_train = ...                  # Training features
y_train = ...                  # Training target
X_val = ...                    # Validation features
y_val = ...                    # Validation target
X_test = ...                   # Test features
y_test = ...                   # Test target

# Scale features
scaler = StandardScaler()
X_train_scaled = ...           # Fit and transform
X_val_scaled = ...             # Transform only
X_test_scaled = ...            # Transform only

print(f"Feature matrix shape: {X_train.shape}")

# Analyze

## Part 1: Build Base Models

#### **Step 5: Train Individual Base Models**

In [None]:
# Define base models
base_models = {
    'Logistic Regression': ...    # Create LogisticRegression
    'Random Forest': ...          # Create RandomForestClassifier
    'Gradient Boosting': ...      # Create GradientBoostingClassifier
    'Neural Network': ...         # Create MLPClassifier
}

# Models that need scaled data
scaled_models = ['Logistic Regression', 'Neural Network']

In [None]:
# Train and evaluate each base model
base_results = []
trained_models = {}
base_predictions = {}
base_probabilities = {}

print("Training Base Models...")
print("="*80)

for name, model in base_models.items():
    print(f"\nTraining {name}...")
    
    # Select appropriate data
    if name in scaled_models:
        X_tr, X_v = X_train_scaled, X_val_scaled
    else:
        X_tr, X_v = X_train, X_val
    
    # Train
    ...                        # Fit model
    trained_models[name] = model
    
    # Predict on validation
    y_pred = ...               # Get predictions
    y_prob = ...               # Get probabilities
    
    base_predictions[name] = y_pred
    base_probabilities[name] = y_prob
    
    # Evaluate
    results = {
        'Model': name,
        'Accuracy': ...        # Calculate accuracy
        'F1 Score': ...        # Calculate F1 score
        'ROC-AUC': ...         # Calculate ROC-AUC
        'Brier Score': ...     # Calculate Brier score
    }
    base_results.append(results)
    
    print(f"  ROC-AUC: {results['ROC-AUC']:.4f}")

base_results_df = pd.DataFrame(base_results)
print("\n" + "="*80)
print("\nBase Model Results:")
print(base_results_df.round(4).to_string(index=False))

## Part 2: Build Ensemble Models

#### **Step 6: Implement Voting Classifier**

In [None]:
# Create Voting Classifier
print("Building Voting Classifier...")

voting_estimators = ...        # List of (name, estimator) tuples

voting_clf = VotingClassifier(
    estimators=voting_estimators,
    voting='soft',
    n_jobs=-1
)

# Train
...                            # Fit voting classifier

# Evaluate on validation
voting_pred = ...              # Get predictions
voting_prob = ...              # Get probabilities

print(f"\nVoting Classifier (Validation):")
print(f"  ROC-AUC: {roc_auc_score(y_val, voting_prob):.4f}")
print(f"  F1 Score: {f1_score(y_val, voting_pred):.4f}")

#### **Step 7: Implement Stacking Classifier**

In [None]:
# Create Stacking Classifier
print("Building Stacking Classifier...")

stacking_estimators = ...      # List of (name, estimator) tuples for base models

meta_learner = ...             # Create meta-learner (e.g., LogisticRegression)

stacking_clf = StackingClassifier(
    estimators=stacking_estimators,
    final_estimator=meta_learner,
    cv=5,
    stack_method='predict_proba',
    n_jobs=-1
)

# Train
...                            # Fit stacking classifier

# Evaluate on validation
stacking_pred = ...            # Get predictions
stacking_prob = ...            # Get probabilities

print(f"\nStacking Classifier (Validation):")
print(f"  ROC-AUC: {roc_auc_score(y_val, stacking_prob):.4f}")
print(f"  F1 Score: {f1_score(y_val, stacking_pred):.4f}")

#### **Step 8: Compare All Models**

In [None]:
# Add ensemble results to comparison and display
# Your code here
...

In [None]:
# Visualize model comparison
# Your code here
...

## Part 3: Create Risk Scores and Early Warning Tiers

#### **Step 9: Select Best Model and Calibrate Probabilities**

In [None]:
# Select best model
best_model = ...               # Assign best performing model
best_model_name = ...          # Name of best model

print(f"Selected model for production: {best_model_name}")

In [None]:
# Calibrate probabilities
print("Calibrating probabilities...")

calibrated_model = CalibratedClassifierCV(
    best_model, method='isotonic', cv='prefit'
)

# Fit calibration on validation set
...                            # Fit calibrated model

# Get calibrated probabilities on test set
calibrated_prob = ...          # Get calibrated probabilities

print(f"\nCalibration Results (Test Set):")
print(f"Calibrated Brier Score: {brier_score_loss(y_test, calibrated_prob):.4f}")

In [None]:
# Visualize calibration curves
# Your code here
...

#### **Step 10: Define Risk Tiers**

In [None]:
# Define risk tiers
def assign_risk_tier(probability):
    """
    Assign risk tier based on departure probability.
    """
    if probability >= 0.5:
        return 'High Risk'
    elif probability >= 0.25:
        return 'Medium Risk'
    else:
        return 'Low Risk'

# Apply to test set
test_results = test_df.copy()
test_results['Risk_Score'] = calibrated_prob
test_results['Risk_Tier'] = ...  # Apply risk tier function

# Summarize risk tier distribution
tier_summary = test_results.groupby('Risk_Tier').agg({
    'SID': 'count',
    'DEPARTED': ['sum', 'mean'],
    'Risk_Score': 'mean'
}).round(4)

print("RISK TIER SUMMARY")
print("="*80)
print(tier_summary)

In [None]:
# Visualize risk tiers
# Your code here
...

## Part 4: Develop Intervention Recommendations

#### **Step 11: Define Intervention Framework**

In [None]:
# Define intervention recommendations based on risk tier
intervention_framework = {
    'High Risk': {
        'Priority': 'Immediate',
        'Interventions': [
            # List of interventions for high risk students
            ...
        ],
        'Frequency': 'Weekly monitoring',
        'Escalation': ...
    },
    'Medium Risk': {
        'Priority': 'Proactive',
        'Interventions': [
            # List of interventions for medium risk students
            ...
        ],
        'Frequency': 'Bi-weekly monitoring',
        'Escalation': ...
    },
    'Low Risk': {
        'Priority': 'Standard',
        'Interventions': [
            # List of interventions for low risk students
            ...
        ],
        'Frequency': 'Monthly monitoring',
        'Escalation': ...
    }
}

# Display intervention framework
print("INTERVENTION FRAMEWORK")
print("="*80)
for tier, details in intervention_framework.items():
    print(f"\n{tier.upper()}")
    print(f"Priority: {details['Priority']}")
    print(f"Interventions: {details['Interventions']}")

#### **Step 12: Generate Student-Level Recommendations**

In [None]:
def generate_student_recommendations(row):
    """
    Generate personalized intervention recommendations based on student profile.
    """
    recommendations = []
    risk_factors = []
    
    # Check academic performance and add relevant recommendations
    # Your code to identify risk factors and add recommendations
    ...
    
    return {
        'risk_factors': risk_factors if risk_factors else ['No specific risk factors identified'],
        'recommendations': recommendations if recommendations else ['Continue standard monitoring']
    }

In [None]:
# Generate recommendations for high-risk students
high_risk_students = test_results[test_results['Risk_Tier'] == 'High Risk'].head(10)

print("SAMPLE STUDENT INTERVENTION REPORTS")
print("="*80)

for idx, row in high_risk_students.iterrows():
    rec = generate_student_recommendations(row)
    print(f"\nStudent ID: {row['SID']}")
    print(f"Risk Score: {row['Risk_Score']:.2%}")
    print(f"Risk Factors: {rec['risk_factors']}")
    print(f"Recommendations: {rec['recommendations']}")
    print("-"*40)

# Deploy

#### **Step 13: Final Model Evaluation on Test Set**

In [None]:
# Final evaluation on test set
final_pred = (calibrated_prob >= 0.5).astype(int)

print("="*80)
print("FINAL MODEL EVALUATION (Test Set)")
print("="*80)
print(f"\nPerformance Metrics:")
print(f"  Accuracy: {accuracy_score(y_test, final_pred):.4f}")
print(f"  Precision: {precision_score(y_test, final_pred):.4f}")
print(f"  Recall: {recall_score(y_test, final_pred):.4f}")
print(f"  F1 Score: {f1_score(y_test, final_pred):.4f}")
print(f"  ROC-AUC: {roc_auc_score(y_test, calibrated_prob):.4f}")

In [None]:
# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, final_pred, target_names=['Enrolled', 'Departed']))

#### **Step 14: Create Deployment Documentation**

In [None]:
# Generate deployment documentation
deployment_doc = {
    'model_info': {
        'name': 'Student Departure Early Warning System',
        'version': '1.0',
        'created_date': datetime.now().strftime('%Y-%m-%d'),
        'model_type': best_model_name,
        # Add more model information
    },
    'performance': {
        # Add performance metrics
    },
    'risk_tiers': {
        # Add risk tier definitions
    },
    'usage': {
        # Add usage instructions
    },
    'limitations': [
        # Add limitations
    ]
}

print("DEPLOYMENT DOCUMENTATION")
print("="*80)
print(json.dumps(deployment_doc, indent=2))

#### **Step 15: Produce Comprehensive Deployment Report**

### Deliverable: Early Warning System Deployment Documentation

Using the analyses above, write a comprehensive deployment report that addresses the following:

1. **System Overview**: Describe the early warning system, including purpose, goals, and model architecture.

2. **Technical Specifications**: Document required input features, model components, and output interpretation.

3. **Risk Tier Framework**: Explain threshold definitions and intervention tiers.

4. **Intervention Recommendations**: Provide guidance on interventions for each risk tier.

5. **Operational Considerations**: Address data pipeline requirements and monitoring needs.

6. **Limitations and Ethical Guidelines**: Discuss known limitations and appropriate use.

> **Rubric**: Your report should be 4-5 pages and include:
> - Complete technical documentation
> - Risk tier visualization and validation
> - Intervention framework with specific recommendations
> - Operational deployment plan
> - Ethical guidelines and limitations

---

## Your Report (Write Below)

*[Write your comprehensive deployment documentation here]*

---