# Capstone Project 2: Equity Analysis - Model Performance by Student Subgroups

# Student Workbook

Welcome to your Capstone Project workbook! This notebook provides a structured outline for your equity analysis. Your task is to fill in the missing code where indicated (replace `...` with appropriate code) to complete the analysis. Good luck!

# Understand

## The Importance of Equity in Predictive Analytics

As universities increasingly adopt machine learning models for student success initiatives, it is critical to ensure these models do not perpetuate or amplify existing inequities.

This capstone project examines model performance across three key demographic dimensions:
1. **Race/Ethnicity**
2. **First Generation Status**
3. **Gender**

### Learning Objectives

By the end of this capstone, you will be able to:
1. Evaluate model performance separately for demographic subgroups
2. Identify potential sources of algorithmic bias
3. Apply fairness metrics to machine learning models
4. Develop policy recommendations for equitable model deployment

# Prepare

#### **Step 1: Import Libraries and Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report,
    average_precision_score
)

# Set random seed
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("All libraries imported successfully!")

In [None]:
# Load data
data_location = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/data/'
df = ...                    # Load the student academics data
print(f"Dataset shape: {df.shape}")
...                         # Display the first few rows

#### **Step 2: Data Cleaning and Preparation**

In [None]:
# Address Rare Classes in RACE_ETHNICITY
df['RACE_ETHNICITY'] = ...     # Replace rare categories with 'Other'

# Address Rare Classes in GENDER
df = ...                       # Remove Nonbinary rows
df['GENDER'] = ...             # Clean and standardize

# Drop noninformative features
...                            # Drop SEM_1_STATUS and SEM_2_STATUS

# Remove duplicates and handle missing values
...                            # Remove duplicate rows
...                            # Drop columns with >50% missing

# Create binary target
df['DEPARTED'] = ...           # Create target variable

print(f"Cleaned dataset shape: {df.shape}")

#### **Step 3: Examine Demographic Distributions**

In [None]:
# Display demographic distributions
print("DEMOGRAPHIC DISTRIBUTIONS")
print("="*60)

print("\nRace/Ethnicity:")
print(df['RACE_ETHNICITY'].value_counts())

print("\nFirst Generation Status:")
print(df['FIRST_GEN_STATUS'].value_counts())

print("\nGender:")
print(df['GENDER'].value_counts())

In [None]:
# Visualize departure rates by demographic groups
# Your code to create a visualization showing departure rates across groups
...

#### **Step 4: Prepare Features and Train/Test Split**

In [None]:
# Feature engineering - create DFW rates and grade points
def create_features(df):
    df = df.copy()
    df['DFW_RATE_1'] = ...     # Calculate DFW rate for semester 1
    df['DFW_RATE_2'] = ...     # Calculate DFW rate for semester 2
    df['GRADE_POINTS_1'] = ... # Calculate grade points for semester 1
    df['GRADE_POINTS_2'] = ... # Calculate grade points for semester 2
    return df

df = create_features(df)

In [None]:
# Define features (excluding demographic features from model inputs)
numeric_features = ...         # List of numeric feature columns

categorical_features = ...     # List of non-demographic categorical features (e.g., COLLEGE)

# Store demographic columns for later analysis
demographic_cols = ['RACE_ETHNICITY', 'FIRST_GEN_STATUS', 'GENDER']

target = 'DEPARTED'

In [None]:
# Handle missing values in numeric features
for col in numeric_features:
    if df[col].isnull().any():
        df[col] = df[col].fillna(df[col].median())

# Train/Test split
train_df, test_df = ...        # Split with 80/20 and stratification

print(f"Training set: {len(train_df):,} students")
print(f"Testing set: {len(test_df):,} students")

In [None]:
# Prepare feature matrices (one-hot encode categorical features)
train_encoded = ...            # One-hot encode training features
test_encoded = ...             # One-hot encode test features

# Align columns
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1, fill_value=0)

# Prepare X and y
X_train = ...                  # Training features
y_train = ...                  # Training target
X_test = ...                   # Test features
y_test = ...                   # Test target

# Scale features
scaler = StandardScaler()
X_train_scaled = ...           # Fit and transform
X_test_scaled = ...            # Transform only

print(f"X_train shape: {X_train.shape}")

# Analyze

## Part 1: Train the Best Model

#### **Step 5: Train Best-Performing Model**

In [None]:
# Train a Random Forest model
print("Training Random Forest model...")

rf_model = ...                 # Create RandomForestClassifier with balanced class weights
...                            # Fit on training data

# Get predictions
y_pred = ...                   # Get class predictions
y_prob = ...                   # Get probability of positive class

print(f"\nOverall Model Performance:")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")

## Part 2: Fairness Analysis by Subgroup

#### **Step 6: Define Fairness Metrics Function**

In [None]:
def calculate_subgroup_metrics(y_true, y_pred, y_prob, group_labels, group_name):
    """
    Calculate performance metrics for each subgroup.
    """
    results = []
    
    for group in group_labels.unique():
        mask = group_labels == group
        n = mask.sum()
        
        if n < 10:  # Skip groups with too few samples
            continue
            
        y_true_group = y_true[mask]
        y_pred_group = y_pred[mask]
        y_prob_group = y_prob[mask]
        
        # Calculate confusion matrix components
        tn, fp, fn, tp = confusion_matrix(y_true_group, y_pred_group, labels=[0, 1]).ravel()
        
        metrics = {
            'Group': group,
            'N': n,
            'Base Rate': ...           # Calculate actual departure rate
            'Positive Rate': ...       # Calculate predicted departure rate
            'Accuracy': ...            # Calculate accuracy
            'Precision': ...           # Calculate precision
            'Recall': ...              # Calculate recall
            'F1 Score': ...            # Calculate F1 score
            'ROC-AUC': ...             # Calculate ROC-AUC (handle single-class case)
            'FPR': ...                 # Calculate False Positive Rate
            'FNR': ...                 # Calculate False Negative Rate
        }
        results.append(metrics)
    
    return pd.DataFrame(results)

#### **Step 7: Analyze Performance by Race/Ethnicity**

In [None]:
# Get test set demographic labels
test_race = ...                # Extract race/ethnicity from test_df

# Calculate metrics by race/ethnicity
race_metrics = calculate_subgroup_metrics(
    y_test.values, y_pred, y_prob, 
    pd.Series(test_race), 'Race/Ethnicity'
)

print("MODEL PERFORMANCE BY RACE/ETHNICITY")
print("="*100)
print(race_metrics.round(4).to_string(index=False))

In [None]:
# Visualize racial equity metrics
# Your code to create visualizations
...

#### **Step 8: Analyze Performance by First Generation Status**

In [None]:
# Get test set first gen labels
test_firstgen = ...            # Extract first gen status from test_df

# Calculate metrics by first gen status
fg_metrics = calculate_subgroup_metrics(
    y_test.values, y_pred, y_prob, 
    pd.Series(test_firstgen), 'First Gen Status'
)

print("MODEL PERFORMANCE BY FIRST GENERATION STATUS")
print("="*100)
print(fg_metrics.round(4).to_string(index=False))

In [None]:
# Visualize first gen equity metrics
# Your code here
...

#### **Step 9: Analyze Performance by Gender**

In [None]:
# Get test set gender labels
test_gender = ...              # Extract gender from test_df

# Calculate metrics by gender
gender_metrics = calculate_subgroup_metrics(
    y_test.values, y_pred, y_prob, 
    pd.Series(test_gender), 'Gender'
)

print("MODEL PERFORMANCE BY GENDER")
print("="*100)
print(gender_metrics.round(4).to_string(index=False))

In [None]:
# Visualize gender equity metrics
# Your code here
...

## Part 3: Fairness Disparity Analysis

#### **Step 10: Calculate Disparity Ratios**

In [None]:
def calculate_disparity_ratios(metrics_df, metric_name, reference_group=None):
    """
    Calculate disparity ratios relative to a reference group.
    A ratio of 1.0 indicates parity.
    """
    if reference_group is None:
        # Use group with largest N as reference
        reference_group = metrics_df.loc[metrics_df['N'].idxmax(), 'Group']
    
    reference_value = ...      # Get reference group's metric value
    
    disparity = metrics_df.copy()
    disparity[f'{metric_name} Ratio'] = ...  # Calculate ratio for each group
    
    return disparity[['Group', 'N', metric_name, f'{metric_name} Ratio']], reference_group

In [None]:
# Calculate disparity ratios for key metrics
print("FAIRNESS DISPARITY ANALYSIS")
print("="*80)

# ROC-AUC disparity by race
auc_disparity, ref = calculate_disparity_ratios(race_metrics, 'ROC-AUC')
print(f"\nROC-AUC Disparity by Race/Ethnicity (Reference: {ref})")
print(auc_disparity.round(4).to_string(index=False))

# FPR disparity by race
fpr_disparity, ref = calculate_disparity_ratios(race_metrics, 'FPR')
print(f"\nFalse Positive Rate Disparity by Race/Ethnicity (Reference: {ref})")
print(fpr_disparity.round(4).to_string(index=False))

#### **Step 11: Create Comprehensive Fairness Dashboard**

In [None]:
# Create comprehensive fairness dashboard visualization
# Your code to create a multi-panel visualization showing:
# - ROC-AUC by each demographic dimension
# - Error rates (FPR, FNR) by each demographic dimension
...

## Part 4: Subgroup-Specific Models

#### **Step 12: Train Separate Models for Key Subgroups**

In [None]:
def train_subgroup_model(train_df, test_df, subgroup_col, subgroup_val, X_cols, target):
    """
    Train a model specifically for a demographic subgroup.
    """
    # Filter to subgroup
    train_sub = ...            # Filter training data to subgroup
    test_sub = ...             # Filter test data to subgroup
    
    if len(train_sub) < 100 or len(test_sub) < 20:
        return None, None, None
    
    # Prepare features
    X_train_sub = ...          # Prepare training features
    y_train_sub = ...          # Prepare training target
    X_test_sub = ...           # Prepare test features
    y_test_sub = ...           # Prepare test target
    
    # Train model
    model = ...                # Create and train RandomForestClassifier
    
    # Evaluate
    y_pred_sub = ...           # Get predictions
    y_prob_sub = ...           # Get probabilities
    
    metrics = {
        'Subgroup': subgroup_val,
        'N_train': len(train_sub),
        'N_test': len(test_sub),
        'Accuracy': accuracy_score(y_test_sub, y_pred_sub),
        'F1 Score': f1_score(y_test_sub, y_pred_sub),
        'ROC-AUC': roc_auc_score(y_test_sub, y_prob_sub) if len(np.unique(y_test_sub)) > 1 else np.nan
    }
    
    return model, metrics, (y_test_sub, y_pred_sub, y_prob_sub)

In [None]:
# Train models for first gen vs continuing gen
print("Training subgroup-specific models...")
print("="*80)

X_cols = numeric_features
subgroup_results = []

for fg_status in ['First Generation', 'Continuing Generation']:
    model, metrics, _ = train_subgroup_model(
        train_df, test_df, 'FIRST_GEN_STATUS', fg_status, X_cols, target
    )
    if metrics:
        subgroup_results.append(metrics)
        print(f"\n{fg_status}:")
        print(f"  Training samples: {metrics['N_train']:,}")
        print(f"  Test samples: {metrics['N_test']:,}")
        print(f"  ROC-AUC: {metrics['ROC-AUC']:.4f}")

In [None]:
# Compare subgroup-specific models vs global model
print("\nCOMPARISON: SUBGROUP MODELS vs GLOBAL MODEL")
print("="*80)

# Your code to compare performance
...

# Deploy

#### **Step 13: Generate Equity Report Summary**

In [None]:
# Calculate and display summary statistics
print("="*80)
print("EQUITY ANALYSIS EXECUTIVE SUMMARY")
print("="*80)

# Your code to generate the executive summary
...

#### **Step 14: Produce Comprehensive Policy Report**

### Deliverable: Equity Analysis and Policy Recommendations Report

Using the analyses above, write a comprehensive report that addresses the following:

1. **Fairness Metrics Summary**: Create tables showing model performance metrics for each demographic subgroup.

2. **Disparity Analysis**: Calculate and interpret disparity ratios for key metrics.

3. **Bias Identification**: Identify which groups experience higher false positive or false negative rates.

4. **Root Cause Analysis**: Discuss potential reasons for performance disparities.

5. **Policy Recommendations**: Provide specific recommendations for equitable model deployment.

6. **Ethical Considerations**: Discuss the ethical implications of using predictive models for student success.

> **Rubric**: Your report should be 3-4 pages and include:
> - Summary tables of performance metrics by demographic group
> - At least 3 visualizations from your fairness analysis
> - Specific, actionable policy recommendations
> - Discussion of ethical considerations and limitations

---

## Your Report (Write Below)

*[Write your comprehensive equity analysis and policy recommendations report here]*

---