# Capstone Project 1: Predicting Third Semester Retention - Model Tournament

# Student Workbook

Welcome to your Capstone Project workbook! This notebook provides a structured outline for your project. Your task is to fill in the missing code where indicated (replace `...` with appropriate code) to complete the steps and analysis. Good luck!

# Understand

A university's Student Success Center is seeking to implement an early warning system to identify students at risk of not returning for their third semester. Your task is to build and compare five different machine learning models to determine which approach will be most effective.

### Learning Objectives

By the end of this capstone, you will be able to:
1. Build and tune five different model families for classification
2. Compare models across multiple performance metrics
3. Consider trade-offs between accuracy, interpretability, and computational cost
4. Communicate findings to non-technical stakeholders

# Prepare

## Data Wrangling

#### **Step 1: Import Libraries and Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Models - All five families
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, precision_recall_curve, average_precision_score,
    confusion_matrix, classification_report, ConfusionMatrixDisplay,
    brier_score_loss, log_loss, make_scorer
)

# Timing
import time

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("All libraries imported successfully!")

In [None]:
# Load data
data_location = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/data/'
df = ...                    # Load the student academics data into a pandas DataFrame
print(f"Dataset shape: {df.shape}")
...                         # Display the first few rows

#### **Step 2: Data Quality - Handle Rare Categories and Missing Values**

In [None]:
# Address Rare Classes in RACE_ETHNICITY
df['RACE_ETHNICITY'] = ...     # Replace rare categories with 'Other'
print("Race/Ethnicity categories:")
print(df['RACE_ETHNICITY'].value_counts())

In [None]:
# Address Rare Classes in GENDER
df = ...                       # Remove rows where gender is 'Nonbinary'
df['GENDER'] = ...             # Clean and standardize the 'GENDER' column
print("Gender categories:")
print(df['GENDER'].value_counts())

In [None]:
# Drop noninformative features
df.drop(['SEM_1_STATUS', 'SEM_2_STATUS'], axis=1, inplace=True)

# Remove duplicates
...                            # Drop duplicate rows

# Drop columns with >50% missing values
missing_values_count = ...     # Count missing values per column
total_rows = ...               # Get total number of rows
columns_to_drop = ...          # Identify columns with >50% missing
df.drop(columns=columns_to_drop, inplace=True)

print(f"Cleaned dataset shape: {df.shape}")

#### **Step 3: Create Target Variable and Train/Test Split**

In [None]:
# Create binary target: 1 = Departed (not enrolled), 0 = Enrolled
df['DEPARTED'] = ...           # Create the binary target variable

print("Target variable distribution:")
print(df['DEPARTED'].value_counts())
print(f"\nDeparture rate: {df['DEPARTED'].mean():.2%}")

In [None]:
# Split into train and test sets
train_df, test_df = ...        # Use train_test_split with 80/20 split and stratification

print(f"Training set: {train_df.shape[0]:,} students")
print(f"Testing set: {test_df.shape[0]:,} students")

#### **Step 4: Handle Missing Values and Feature Engineering**

In [None]:
def impute_missing_values(df_train, df_test):
    """Impute missing values using train statistics to prevent data leakage."""
    df_train = df_train.copy()
    df_test = df_test.copy()
    
    for col in df_train.columns:
        if df_train[col].isnull().any():
            if df_train[col].dtype in ['int64', 'float64']:
                median_val = df_train[col].median()
                df_train[col] = df_train[col].fillna(median_val)
                df_test[col] = df_test[col].fillna(median_val)
            else:
                mode_val = df_train[col].mode()[0]
                df_train[col] = df_train[col].fillna(mode_val)
                df_test[col] = df_test[col].fillna(mode_val)
    
    return df_train, df_test

train_df, test_df = ...        # Call the impute function
print("Missing values imputed successfully.")

In [None]:
# Feature Engineering: Create DFW rates and grade points
def create_features(df):
    df = df.copy()
    
    # DFW Rate (proportion of attempted units not completed)
    df['DFW_RATE_1'] = ...     # Calculate DFW rate for semester 1
    df['DFW_RATE_2'] = ...     # Calculate DFW rate for semester 2
    
    # Grade Points
    df['GRADE_POINTS_1'] = ... # Calculate grade points for semester 1
    df['GRADE_POINTS_2'] = ... # Calculate grade points for semester 2
    
    return df

train_df = create_features(train_df)
test_df = create_features(test_df)
print("Features created successfully.")

#### **Step 5: Define Feature Sets and Prepare Data for Modeling**

In [None]:
# Define feature categories
numeric_features = ...         # List of numeric feature column names

categorical_features = ...     # List of categorical feature column names

target = 'DEPARTED'

print(f"Numeric features ({len(numeric_features)}): {numeric_features[:5]}...")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}")

In [None]:
# One-hot encode categorical variables
train_encoded = ...            # Use pd.get_dummies to encode training data
test_encoded = ...             # Use pd.get_dummies to encode test data

# Align columns between train and test
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1, fill_value=0)

# Handle any remaining missing values
train_encoded = train_encoded.fillna(train_encoded.median())
test_encoded = test_encoded.fillna(test_encoded.median())

# Prepare X and y
X_train = ...                  # Features for training
y_train = ...                  # Target for training
X_test = ...                   # Features for testing
y_test = ...                   # Target for testing

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

In [None]:
# Scale features for models that require it
scaler = ...                   # Create a StandardScaler
X_train_scaled = ...           # Fit and transform training data
X_test_scaled = ...            # Transform test data

print("Features scaled successfully!")

#### **Step 6: Exploratory Data Analysis**

In [None]:
# Create at least one visualization comparing departure rates
# across different student groups

# Your code here
...

# Analyze

## Model Tournament: Building and Comparing All Five Model Families

In [None]:
# Dictionary to store all models and results
models = {}
training_times = {}
all_results = []

#### **Step 7: Build Regularized Logistic Regression Model**

In [None]:
# L2 Regularized Logistic Regression
print("Training L2 Regularized Logistic Regression...")
start_time = time.time()

lr_l2 = ...                    # Create LogisticRegression with L2 penalty
...                            # Fit on scaled training data

training_times['Logistic Regression (L2)'] = time.time() - start_time
models['Logistic Regression (L2)'] = ('scaled', lr_l2)

print(f"Training completed in {training_times['Logistic Regression (L2)']:.2f} seconds")

#### **Step 8: Build Decision Tree Model**

In [None]:
# Decision Tree with tuned hyperparameters
print("Training Decision Tree Classifier...")
start_time = time.time()

dt = ...                       # Create DecisionTreeClassifier
...                            # Fit on unscaled training data

training_times['Decision Tree'] = time.time() - start_time
models['Decision Tree'] = ('unscaled', dt)

print(f"Training completed in {training_times['Decision Tree']:.2f} seconds")

#### **Step 9: Build Random Forest Model**

In [None]:
# Random Forest with tuned hyperparameters
print("Training Random Forest Classifier...")
start_time = time.time()

rf = ...                       # Create RandomForestClassifier
...                            # Fit on unscaled training data

training_times['Random Forest'] = time.time() - start_time
models['Random Forest'] = ('unscaled', rf)

print(f"Training completed in {training_times['Random Forest']:.2f} seconds")

#### **Step 10: Build Gradient Boosting Model**

In [None]:
# Gradient Boosting Classifier
print("Training Gradient Boosting Classifier...")
start_time = time.time()

gb = ...                       # Create GradientBoostingClassifier
...                            # Fit on unscaled training data

training_times['Gradient Boosting'] = time.time() - start_time
models['Gradient Boosting'] = ('unscaled', gb)

print(f"Training completed in {training_times['Gradient Boosting']:.2f} seconds")

#### **Step 11: Build Neural Network Model**

In [None]:
# Neural Network (MLP)
print("Training Neural Network (MLP) Classifier...")
start_time = time.time()

nn = ...                       # Create MLPClassifier
...                            # Fit on scaled training data

training_times['Neural Network'] = time.time() - start_time
models['Neural Network'] = ('scaled', nn)

print(f"Training completed in {training_times['Neural Network']:.2f} seconds")

In [None]:
# Summary of trained models
print("="*60)
print("MODEL TRAINING SUMMARY")
print("="*60)
for model_name, train_time in sorted(training_times.items(), key=lambda x: x[1]):
    print(f"{model_name:<30} {train_time:>15.3f}s")
print("="*60)

#### **Step 12: Evaluate All Models**

In [None]:
def evaluate_model(model, X_test, y_test, model_name, scaled=False, X_test_scaled=None):
    """Comprehensive model evaluation returning multiple metrics."""
    # Select appropriate test set
    X_eval = X_test_scaled if scaled else X_test
    
    # Get predictions
    y_pred = ...               # Get class predictions
    y_prob = ...               # Get probability of positive class
    
    # Calculate metrics
    metrics = {
        'Model': model_name,
        'Accuracy': ...        # Calculate accuracy
        'Precision': ...       # Calculate precision
        'Recall': ...          # Calculate recall
        'F1 Score': ...        # Calculate F1 score
        'ROC-AUC': ...         # Calculate ROC-AUC
        'Avg Precision': ...   # Calculate average precision
    }
    
    return metrics, y_pred, y_prob

In [None]:
# Evaluate all models
all_results = []
predictions = {}
probabilities = {}

for model_name, (scale_type, model) in models.items():
    scaled = (scale_type == 'scaled')
    metrics, y_pred, y_prob = evaluate_model(
        model, X_test, y_test, model_name, 
        scaled=scaled, 
        X_test_scaled=X_test_scaled
    )
    all_results.append(metrics)
    predictions[model_name] = y_pred
    probabilities[model_name] = y_prob

# Create results DataFrame
results_df = pd.DataFrame(all_results)
results_df = results_df.set_index('Model')
results_df['Training Time (s)'] = results_df.index.map(training_times)

print("Model evaluation complete!")

In [None]:
# Display comprehensive results table
print("="*100)
print("MODEL TOURNAMENT RESULTS")
print("="*100)
print(results_df.round(4).to_string())
print("="*100)

#### **Step 13: Visualize Model Comparison**

In [None]:
# Create a bar chart comparing model performance across metrics
# Your code here
...

In [None]:
# Create ROC Curve Comparison
# Your code here
...

In [None]:
# Create Precision-Recall Curve Comparison
# Your code here
...

#### **Step 14: Confusion Matrices for All Models**

In [None]:
# Create confusion matrices for all models
# Your code here
...

#### **Step 15: Feature Importance Comparison**

In [None]:
# Extract and compare feature importances from different models
# Your code here
...

#### **Step 16: Model Selection - Determine Tournament Winner**

In [None]:
# Rank models and determine the winner
print("="*80)
print("MODEL TOURNAMENT - FINAL RANKINGS")
print("="*80)

# Your code to rank models by different criteria
...

# Determine and print the tournament winner
tournament_winner = ...        # Determine the best model
print(f"\n*** TOURNAMENT WINNER: {tournament_winner} ***")

# Deploy

#### **Step 17: Create Stakeholder Report**

In [None]:
# Generate summary statistics for the report
print("="*80)
print("EXECUTIVE SUMMARY: EARLY WARNING SYSTEM MODEL SELECTION")
print("="*80)

# Your code to generate the executive summary
...

#### **Step 18: Produce a Comprehensive Report on Your Findings**

### Deliverable: Written Report for Stakeholders

Using the analyses above, write a comprehensive report that addresses the following:

1. **Model Comparison Summary**: Create a table comparing all five models across key metrics. Which model performed best overall? Were there trade-offs between different metrics?

2. **Feature Importance Analysis**: What factors are most predictive of student departure? Do different models agree on the most important features?

3. **Model Selection Rationale**: Beyond raw performance, discuss why you would recommend a particular model. Consider interpretability, training time, and maintenance burden.

4. **Implementation Recommendations**: How should the selected model be deployed? Consider threshold selection, intervention strategies, and monitoring.

5. **Limitations and Ethical Considerations**: What are the limitations of this analysis? What ethical concerns should be considered?

> **Rubric**: Your report should be 2-3 pages and include:
> - Clear summary table of model performance
> - At least 2 visualizations from your analysis
> - Specific recommendations for implementation
> - Discussion of limitations and ethical considerations

---

## Your Report (Write Below)

*[Write your comprehensive stakeholder report here]*

---