# Capstone Project 1: Predicting Third Semester Retention - Model Tournament

## Objective
This notebook walks you through the UPAD cycle (Understand, Prepare, Analyze, Deploy) to:
- Compare ALL classification models from Course 3 on the student departure prediction task
- Build regularized logistic regression, decision tree, random forest, boosted model, and neural network classifiers
- Create comprehensive comparison tables and select the best model
- Write an interpretation report for university stakeholders

# Understand

A university's Student Success Center is seeking to implement an early warning system to identify students at risk of not returning for their third semester. The center has tasked the institutional research team with developing and comparing multiple machine learning models to determine which approach will be most effective for their campus.

This "model tournament" approach is important because different models have different strengths:
- **Regularized Logistic Regression**: Highly interpretable, provides coefficients that show factor importance
- **Decision Trees**: Create visual, rule-based explanations that advisors can easily understand
- **Random Forests**: Ensemble methods that reduce overfitting and provide robust predictions
- **Gradient Boosting**: State-of-the-art performance for tabular data
- **Neural Networks**: Flexible models that can capture complex patterns

The goal is to systematically evaluate each model family and provide a recommendation for which model should be deployed in the early warning system.

### Learning Objectives

By the end of this capstone, you will be able to:
1. Build and tune five different model families for classification
2. Compare models across multiple performance metrics
3. Consider trade-offs between accuracy, interpretability, and computational cost
4. Communicate findings to non-technical stakeholders

# Prepare

## Data Wrangling

#### **Step 1: Import Libraries and Data**

We need to import all the libraries required for our five model families plus evaluation metrics.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Models - All five families
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, precision_recall_curve, average_precision_score,
    confusion_matrix, classification_report, ConfusionMatrixDisplay,
    brier_score_loss, log_loss, make_scorer
)

# Timing
import time

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("All libraries imported successfully!")

In [None]:
# Load data
data_location = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/data/'
df = pd.read_csv(f'{data_location}student_academics_data.csv')
print(f"Dataset shape: {df.shape}")
df.head()

#### **Step 2: Data Quality - Handle Rare Categories and Missing Values**

Following the data preparation steps from our tutorial notebooks.

In [None]:
# Address Rare Classes in RACE_ETHNICITY
df['RACE_ETHNICITY'] = df['RACE_ETHNICITY'].replace(
    ['Unknown', 'Native Hawaiian or Other Pacific Islander', 'American Indian or Alaska Native'], 
    'Other'
)
print("Race/Ethnicity categories:")
print(df['RACE_ETHNICITY'].value_counts())

In [None]:
# Address Rare Classes in GENDER
df = df[df['GENDER'] != 'Nonbinary']
df['GENDER'] = df['GENDER'].str.strip().str.capitalize()
print("Gender categories:")
print(df['GENDER'].value_counts())

In [None]:
# Drop noninformative features
df.drop(['SEM_1_STATUS', 'SEM_2_STATUS'], axis=1, inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Drop columns with >50% missing values
missing_values_count = df.isnull().sum()
total_rows = len(df)
columns_to_drop = missing_values_count[missing_values_count / total_rows > 0.5].index.tolist()
df.drop(columns=columns_to_drop, inplace=True)

print(f"Cleaned dataset shape: {df.shape}")
print(f"Columns dropped due to missingness: {columns_to_drop}")

#### **Step 3: Create Target Variable and Train/Test Split**

In [None]:
# Create binary target: 1 = Departed (not enrolled), 0 = Enrolled
df['DEPARTED'] = (df['SEM_3_STATUS'] != 'E').astype(int)

print("Target variable distribution:")
print(df['DEPARTED'].value_counts())
print(f"\nDeparture rate: {df['DEPARTED'].mean():.2%}")

In [None]:
# Split into train and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=RANDOM_STATE, stratify=df['DEPARTED'])

print(f"Training set: {train_df.shape[0]:,} students")
print(f"Testing set: {test_df.shape[0]:,} students")
print(f"\nDeparture rate (Train): {train_df['DEPARTED'].mean():.2%}")
print(f"Departure rate (Test): {test_df['DEPARTED'].mean():.2%}")

#### **Step 4: Handle Missing Values and Feature Engineering**

In [None]:
def impute_missing_values(df_train, df_test):
    """Impute missing values using train statistics to prevent data leakage."""
    df_train = df_train.copy()
    df_test = df_test.copy()
    
    for col in df_train.columns:
        if df_train[col].isnull().any():
            if df_train[col].dtype in ['int64', 'float64']:
                median_val = df_train[col].median()
                df_train[col] = df_train[col].fillna(median_val)
                df_test[col] = df_test[col].fillna(median_val)
            else:
                mode_val = df_train[col].mode()[0]
                df_train[col] = df_train[col].fillna(mode_val)
                df_test[col] = df_test[col].fillna(mode_val)
    
    return df_train, df_test

train_df, test_df = impute_missing_values(train_df, test_df)
print("Missing values imputed successfully.")

In [None]:
# Feature Engineering: Create DFW rates and grade points
def create_features(df):
    df = df.copy()
    
    # DFW Rate (proportion of attempted units not completed)
    df['DFW_RATE_1'] = ((df['UNITS_ATTEMPTED_1'] - df['UNITS_COMPLETED_1']).clip(lower=0) 
                        / df['UNITS_ATTEMPTED_1'].replace(0, 1))
    df['DFW_RATE_2'] = ((df['UNITS_ATTEMPTED_2'] - df['UNITS_COMPLETED_2']).clip(lower=0) 
                        / df['UNITS_ATTEMPTED_2'].replace(0, 1))
    
    # Grade Points
    df['GRADE_POINTS_1'] = df['UNITS_ATTEMPTED_1'] * df['GPA_1']
    df['GRADE_POINTS_2'] = df['UNITS_ATTEMPTED_2'] * df['GPA_2']
    
    return df

train_df = create_features(train_df)
test_df = create_features(test_df)
print("Features created successfully.")
print(f"New columns: DFW_RATE_1, DFW_RATE_2, GRADE_POINTS_1, GRADE_POINTS_2")

#### **Step 5: Define Feature Sets and Prepare Data for Modeling**

In [None]:
# Define feature categories
numeric_features = [
    'HS_GPA', 'HS_MATH_GPA', 'HS_ENGL_GPA',
    'UNITS_ATTEMPTED_1', 'UNITS_ATTEMPTED_2',
    'UNITS_COMPLETED_1', 'UNITS_COMPLETED_2',
    'DFW_UNITS_1', 'DFW_UNITS_2',
    'GPA_1', 'GPA_2',
    'DFW_RATE_1', 'DFW_RATE_2',
    'GRADE_POINTS_1', 'GRADE_POINTS_2'
]

categorical_features = ['RACE_ETHNICITY', 'GENDER', 'FIRST_GEN_STATUS', 'COLLEGE']

target = 'DEPARTED'

print(f"Numeric features ({len(numeric_features)}): {numeric_features[:5]}...")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}")

In [None]:
# One-hot encode categorical variables
train_encoded = pd.get_dummies(train_df[numeric_features + categorical_features], 
                               columns=categorical_features, drop_first=True)
test_encoded = pd.get_dummies(test_df[numeric_features + categorical_features], 
                              columns=categorical_features, drop_first=True)

# Align columns between train and test
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1, fill_value=0)

# Handle any remaining missing values
train_encoded = train_encoded.fillna(train_encoded.median())
test_encoded = test_encoded.fillna(test_encoded.median())

# Prepare X and y
X_train = train_encoded
y_train = train_df[target]
X_test = test_encoded
y_test = test_df[target]

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"\nFeature columns ({len(X_train.columns)} total):")
print(X_train.columns.tolist()[:10], "...")

In [None]:
# Scale features for models that require it
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features scaled successfully!")
print(f"Scaled mean: {X_train_scaled.mean():.6f}")
print(f"Scaled std: {X_train_scaled.std():.6f}")

#### **Step 6: Exploratory Data Analysis**

In [None]:
# Visualize departure rates by key factors
fig = make_subplots(rows=2, cols=2, subplot_titles=(
    'Departure Rate by First Gen Status',
    'Departure Rate by College',
    'GPA Distribution by Departure Status',
    'DFW Rate Distribution by Departure Status'
))

# Departure by First Gen
fg_rates = train_df.groupby('FIRST_GEN_STATUS')['DEPARTED'].mean().reset_index()
fig.add_trace(
    go.Bar(x=fg_rates['FIRST_GEN_STATUS'], y=fg_rates['DEPARTED'], 
           marker_color=['steelblue', 'coral', 'gray'][:len(fg_rates)]),
    row=1, col=1
)

# Departure by College
college_rates = train_df.groupby('COLLEGE')['DEPARTED'].mean().reset_index().sort_values('DEPARTED', ascending=False)
fig.add_trace(
    go.Bar(x=college_rates['COLLEGE'], y=college_rates['DEPARTED'], marker_color='teal'),
    row=1, col=2
)

# GPA by Departure
fig.add_trace(
    go.Histogram(x=train_df[train_df['DEPARTED']==0]['GPA_1'], name='Enrolled', opacity=0.7),
    row=2, col=1
)
fig.add_trace(
    go.Histogram(x=train_df[train_df['DEPARTED']==1]['GPA_1'], name='Departed', opacity=0.7),
    row=2, col=1
)

# DFW Rate by Departure
fig.add_trace(
    go.Histogram(x=train_df[train_df['DEPARTED']==0]['DFW_RATE_1'], name='Enrolled', opacity=0.7, showlegend=False),
    row=2, col=2
)
fig.add_trace(
    go.Histogram(x=train_df[train_df['DEPARTED']==1]['DFW_RATE_1'], name='Departed', opacity=0.7, showlegend=False),
    row=2, col=2
)

fig.update_layout(height=700, title_text="Exploratory Data Analysis: Factors Related to Student Departure")
fig.update_xaxes(tickangle=-45, row=1, col=2)
fig.show()

# Analyze

## Model Tournament: Building and Comparing All Five Model Families

We will now build one model from each of the five families covered in Course 3:
1. Regularized Logistic Regression (L2)
2. Decision Tree
3. Random Forest
4. Gradient Boosting
5. Neural Network (MLP)

In [None]:
# Dictionary to store all models and results
models = {}
training_times = {}
all_results = []

#### **Step 7: Build Regularized Logistic Regression Model**

In [None]:
# L2 Regularized Logistic Regression
print("Training L2 Regularized Logistic Regression...")
start_time = time.time()

lr_l2 = LogisticRegression(
    penalty='l2',
    C=0.1,  # Inverse of regularization strength
    solver='lbfgs',
    max_iter=1000,
    class_weight='balanced',
    random_state=RANDOM_STATE
)
lr_l2.fit(X_train_scaled, y_train)

training_times['Logistic Regression (L2)'] = time.time() - start_time
models['Logistic Regression (L2)'] = ('scaled', lr_l2)

print(f"Training completed in {training_times['Logistic Regression (L2)']:.2f} seconds")

#### **Step 8: Build Decision Tree Model**

In [None]:
# Decision Tree with tuned hyperparameters
print("Training Decision Tree Classifier...")
start_time = time.time()

dt = DecisionTreeClassifier(
    max_depth=8,
    min_samples_split=20,
    min_samples_leaf=10,
    max_features='sqrt',
    class_weight='balanced',
    random_state=RANDOM_STATE
)
dt.fit(X_train, y_train)  # No scaling needed for tree-based models

training_times['Decision Tree'] = time.time() - start_time
models['Decision Tree'] = ('unscaled', dt)

print(f"Training completed in {training_times['Decision Tree']:.2f} seconds")

#### **Step 9: Build Random Forest Model**

In [None]:
# Random Forest with tuned hyperparameters
print("Training Random Forest Classifier...")
start_time = time.time()

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=12,
    min_samples_split=10,
    min_samples_leaf=5,
    max_features='sqrt',
    class_weight='balanced',
    n_jobs=-1,
    random_state=RANDOM_STATE
)
rf.fit(X_train, y_train)

training_times['Random Forest'] = time.time() - start_time
models['Random Forest'] = ('unscaled', rf)

print(f"Training completed in {training_times['Random Forest']:.2f} seconds")

#### **Step 10: Build Gradient Boosting Model**

In [None]:
# Gradient Boosting Classifier
print("Training Gradient Boosting Classifier...")
start_time = time.time()

gb = GradientBoostingClassifier(
    n_estimators=150,
    learning_rate=0.1,
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    subsample=0.8,
    random_state=RANDOM_STATE
)
gb.fit(X_train, y_train)

training_times['Gradient Boosting'] = time.time() - start_time
models['Gradient Boosting'] = ('unscaled', gb)

print(f"Training completed in {training_times['Gradient Boosting']:.2f} seconds")

#### **Step 11: Build Neural Network Model**

In [None]:
# Neural Network (MLP)
print("Training Neural Network (MLP) Classifier...")
start_time = time.time()

nn = MLPClassifier(
    hidden_layer_sizes=(64, 32, 16),
    activation='relu',
    solver='adam',
    alpha=0.001,  # L2 regularization
    batch_size=32,
    learning_rate='adaptive',
    learning_rate_init=0.001,
    max_iter=500,
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=20,
    random_state=RANDOM_STATE
)
nn.fit(X_train_scaled, y_train)

training_times['Neural Network'] = time.time() - start_time
models['Neural Network'] = ('scaled', nn)

print(f"Training completed in {training_times['Neural Network']:.2f} seconds")

In [None]:
# Summary of trained models
print("="*60)
print("MODEL TRAINING SUMMARY")
print("="*60)
print(f"{'Model':<30} {'Training Time (s)':<20}")
print("-"*60)
for model_name, train_time in sorted(training_times.items(), key=lambda x: x[1]):
    print(f"{model_name:<30} {train_time:>15.3f}")
print("="*60)
print(f"Total models trained: {len(models)}")

#### **Step 12: Evaluate All Models**

In [None]:
def evaluate_model(model, X_test, y_test, model_name, scaled=False, X_test_scaled=None):
    """Comprehensive model evaluation returning multiple metrics."""
    # Select appropriate test set
    X_eval = X_test_scaled if scaled else X_test
    
    # Get predictions
    y_pred = model.predict(X_eval)
    y_prob = model.predict_proba(X_eval)[:, 1]
    
    # Calculate metrics
    metrics = {
        'Model': model_name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, zero_division=0),
        'Recall': recall_score(y_test, y_pred, zero_division=0),
        'F1 Score': f1_score(y_test, y_pred, zero_division=0),
        'ROC-AUC': roc_auc_score(y_test, y_prob),
        'Avg Precision': average_precision_score(y_test, y_prob),
        'Brier Score': brier_score_loss(y_test, y_prob),
        'Log Loss': log_loss(y_test, y_prob)
    }
    
    return metrics, y_pred, y_prob

In [None]:
# Evaluate all models
all_results = []
predictions = {}
probabilities = {}

for model_name, (scale_type, model) in models.items():
    scaled = (scale_type == 'scaled')
    metrics, y_pred, y_prob = evaluate_model(
        model, X_test, y_test, model_name, 
        scaled=scaled, 
        X_test_scaled=X_test_scaled
    )
    all_results.append(metrics)
    predictions[model_name] = y_pred
    probabilities[model_name] = y_prob

# Create results DataFrame
results_df = pd.DataFrame(all_results)
results_df = results_df.set_index('Model')
results_df['Training Time (s)'] = results_df.index.map(training_times)

print("Model evaluation complete!")

In [None]:
# Display comprehensive results table
print("="*100)
print("MODEL TOURNAMENT RESULTS - PERFORMANCE METRICS")
print("="*100)
display_cols = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC', 'Avg Precision', 'Training Time (s)']
print(results_df[display_cols].round(4).to_string())
print("="*100)

#### **Step 13: Visualize Model Comparison**

In [None]:
# Performance comparison bar chart
metrics_to_plot = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC']

fig = go.Figure()
colors = px.colors.qualitative.Set2

for i, metric in enumerate(metrics_to_plot):
    fig.add_trace(go.Bar(
        name=metric,
        x=results_df.index,
        y=results_df[metric],
        marker_color=colors[i % len(colors)]
    ))

fig.update_layout(
    title='Model Tournament: Performance Comparison Across Metrics',
    xaxis_title='Model',
    yaxis_title='Score',
    barmode='group',
    height=500,
    legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1),
    xaxis_tickangle=-30
)

fig.show()

In [None]:
# ROC Curve Comparison
fig = go.Figure()
colors = px.colors.qualitative.Plotly

for i, (model_name, y_prob) in enumerate(probabilities.items()):
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    auc = roc_auc_score(y_test, y_prob)
    
    fig.add_trace(go.Scatter(
        x=fpr, y=tpr,
        mode='lines',
        name=f'{model_name} (AUC={auc:.3f})',
        line=dict(color=colors[i % len(colors)], width=2)
    ))

# Add diagonal reference line
fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines',
    name='Random Classifier',
    line=dict(color='gray', dash='dash', width=1)
))

fig.update_layout(
    title='ROC Curve Comparison - All Models',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    height=600,
    legend=dict(x=0.55, y=0.05),
    xaxis=dict(constrain='domain'),
    yaxis=dict(scaleanchor='x', scaleratio=1)
)

fig.show()

In [None]:
# Precision-Recall Curve Comparison
fig = go.Figure()
colors = px.colors.qualitative.Plotly

for i, (model_name, y_prob) in enumerate(probabilities.items()):
    precision, recall, _ = precision_recall_curve(y_test, y_prob)
    ap = average_precision_score(y_test, y_prob)
    
    fig.add_trace(go.Scatter(
        x=recall, y=precision,
        mode='lines',
        name=f'{model_name} (AP={ap:.3f})',
        line=dict(color=colors[i % len(colors)], width=2)
    ))

# Add baseline
prevalence = y_test.mean()
fig.add_hline(y=prevalence, line_dash='dash', line_color='gray',
              annotation_text=f'Baseline (prevalence={prevalence:.2%})')

fig.update_layout(
    title='Precision-Recall Curve Comparison - All Models',
    xaxis_title='Recall (True Positive Rate)',
    yaxis_title='Precision (Positive Predictive Value)',
    height=600,
    legend=dict(x=0.55, y=0.95)
)

fig.show()

#### **Step 14: Confusion Matrices for All Models**

In [None]:
# Create confusion matrices for all models
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, (model_name, y_pred) in enumerate(predictions.items()):
    if i < len(axes):
        cm = confusion_matrix(y_test, y_pred)
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[i],
                   xticklabels=['Enrolled', 'Departed'],
                   yticklabels=['Enrolled', 'Departed'])
        axes[i].set_title(f'{model_name}')
        axes[i].set_xlabel('Predicted')
        axes[i].set_ylabel('Actual')

# Hide empty subplot
if len(predictions) < len(axes):
    axes[-1].axis('off')

plt.suptitle('Confusion Matrices: All Models', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

#### **Step 15: Feature Importance Comparison**

In [None]:
# Get feature importances from tree-based models
feature_names = X_train.columns.tolist()

# Random Forest feature importance
rf_importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': models['Random Forest'][1].feature_importances_
}).sort_values('Importance', ascending=False).head(15)

# Gradient Boosting feature importance
gb_importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': models['Gradient Boosting'][1].feature_importances_
}).sort_values('Importance', ascending=False).head(15)

# Logistic Regression coefficients (absolute values)
lr_coefs = pd.DataFrame({
    'Feature': feature_names,
    'Importance': np.abs(models['Logistic Regression (L2)'][1].coef_[0])
}).sort_values('Importance', ascending=False).head(15)

In [None]:
# Plot feature importances side by side
fig = make_subplots(rows=1, cols=3, 
                    subplot_titles=('Logistic Regression', 'Random Forest', 'Gradient Boosting'))

fig.add_trace(
    go.Bar(x=lr_coefs['Importance'], y=lr_coefs['Feature'], orientation='h', marker_color='steelblue'),
    row=1, col=1
)

fig.add_trace(
    go.Bar(x=rf_importance['Importance'], y=rf_importance['Feature'], orientation='h', marker_color='forestgreen'),
    row=1, col=2
)

fig.add_trace(
    go.Bar(x=gb_importance['Importance'], y=gb_importance['Feature'], orientation='h', marker_color='coral'),
    row=1, col=3
)

fig.update_layout(
    title='Feature Importance Comparison Across Models',
    height=500,
    showlegend=False
)

fig.show()

#### **Step 16: Model Selection - Determine Tournament Winner**

In [None]:
# Rank models by different criteria
print("="*80)
print("MODEL TOURNAMENT - FINAL RANKINGS")
print("="*80)

# Best by each metric
print("\nBest Model by Each Metric:")
print("-"*60)
print(f"Best ROC-AUC: {results_df['ROC-AUC'].idxmax()} ({results_df['ROC-AUC'].max():.4f})")
print(f"Best F1 Score: {results_df['F1 Score'].idxmax()} ({results_df['F1 Score'].max():.4f})")
print(f"Best Recall: {results_df['Recall'].idxmax()} ({results_df['Recall'].max():.4f})")
print(f"Best Precision: {results_df['Precision'].idxmax()} ({results_df['Precision'].max():.4f})")
print(f"Fastest Training: {min(training_times, key=training_times.get)} ({min(training_times.values()):.3f}s)")

print("\n" + "="*80)

In [None]:
# Create comprehensive ranking table
ranking_df = results_df[['ROC-AUC', 'F1 Score', 'Recall', 'Precision']].copy()

# Rank each metric (1 = best)
for col in ranking_df.columns:
    ranking_df[f'{col} Rank'] = ranking_df[col].rank(ascending=False).astype(int)

# Calculate average rank
rank_cols = [c for c in ranking_df.columns if 'Rank' in c]
ranking_df['Average Rank'] = ranking_df[rank_cols].mean(axis=1)

# Sort by average rank
ranking_df = ranking_df.sort_values('Average Rank')

print("\nOVERALL MODEL RANKINGS (Lower Average Rank = Better):")
print("="*80)
print(ranking_df[rank_cols + ['Average Rank']].to_string())
print("="*80)

tournament_winner = ranking_df['Average Rank'].idxmin()
print(f"\n*** TOURNAMENT WINNER: {tournament_winner} ***")

# Deploy

#### **Step 17: Create Stakeholder Report**

In [None]:
# Generate summary statistics for the report
best_model_name = tournament_winner
best_model_metrics = results_df.loc[best_model_name]

print("="*80)
print("EXECUTIVE SUMMARY: EARLY WARNING SYSTEM MODEL SELECTION")
print("="*80)
print(f"""
OBJECTIVE:
Develop a predictive model to identify students at risk of not returning 
for their third semester.

APPROACH:
Evaluated five different machine learning model families:
- Regularized Logistic Regression (interpretable, coefficient-based)
- Decision Tree (visual, rule-based)
- Random Forest (ensemble of trees)
- Gradient Boosting (advanced ensemble)
- Neural Network (deep learning)

DATASET:
- Total students: {len(df):,}
- Training set: {len(train_df):,} students
- Testing set: {len(test_df):,} students
- Departure rate: {df['DEPARTED'].mean():.1%}

RECOMMENDED MODEL: {best_model_name}

PERFORMANCE METRICS:
- ROC-AUC: {best_model_metrics['ROC-AUC']:.2%}
- F1 Score: {best_model_metrics['F1 Score']:.2%}
- Recall: {best_model_metrics['Recall']:.2%}
- Precision: {best_model_metrics['Precision']:.2%}

INTERPRETATION:
- ROC-AUC of {best_model_metrics['ROC-AUC']:.1%} means the model can distinguish 
  between students who will depart and those who will stay {best_model_metrics['ROC-AUC']*100:.0f}% 
  of the time.
- Recall of {best_model_metrics['Recall']:.1%} means the model correctly identifies 
  {best_model_metrics['Recall']*100:.0f}% of students who actually depart.
""")
print("="*80)

#### **Step 18: Produce a Comprehensive Report on Your Findings**

### Deliverable: Written Report for Stakeholders

Using the analyses above, write a comprehensive report that addresses the following:

1. **Model Comparison Summary**: Create a table comparing all five models across key metrics. Which model performed best overall? Were there trade-offs between different metrics?

2. **Feature Importance Analysis**: What factors are most predictive of student departure? Do different models agree on the most important features? What does this tell us about the drivers of student attrition?

3. **Model Selection Rationale**: Beyond raw performance, discuss why you would recommend a particular model. Consider:
   - Interpretability (can advisors understand and explain predictions?)
   - Training time and computational resources
   - Maintenance burden
   - Regulatory compliance (can decisions be audited?)

4. **Implementation Recommendations**: How should the selected model be deployed? Consider:
   - Threshold selection (when should a student be flagged as "at-risk"?)
   - Intervention strategies based on risk scores
   - Monitoring and retraining schedule

5. **Limitations and Ethical Considerations**: What are the limitations of this analysis? What ethical concerns should be considered when deploying predictive models for student success?

> **Rubric**: Your report should be 2-3 pages and include:
> - Clear summary table of model performance
> - At least 2 visualizations from your analysis
> - Specific recommendations for implementation
> - Discussion of limitations and ethical considerations

---

## Your Report (Write Below)

*[Write your comprehensive stakeholder report here]*

---