# 1.3 **Train** and Compare Regularized Logistic Regression Models

## Model Cycle: The 5 Key Steps

### 1. Build the Model : Create the pipeline with regularization.  
### **2. Train the Model : Fit the model on the training data.**  
### 3. Generate Predictions : Use the trained model to make predictions.  
### 4. Evaluate the Model : Assess performance using evaluation metrics.  
### 5. Improve the Model : Tune hyperparameters for optimal performance.

### **Table of Contents**

<div style="overflow-x: auto;">

- [Introduction](#scrollTo=intro)
- [1. Load Dependencies and Data](#scrollTo=section1)
- [2. Load the Regularized Models](#scrollTo=section2)
- [3. Train All Models](#scrollTo=section3)
- [4. Compare Coefficients](#scrollTo=section4)
  - [4.1 Coefficient Magnitudes](#scrollTo=section4_1)
  - [4.2 Feature Selection with L1](#scrollTo=section4_2)
- [5. Cross-Validation Comparison](#scrollTo=section5)
- [6. Save Trained Models](#scrollTo=section6)
- [7. Summary](#scrollTo=section7)

</div>

## Introduction

In this notebook, we train the three regularized logistic regression models we built in the previous notebook and compare their behavior:

1. How do coefficients differ across regularization types?
2. Which features does L1 regularization select?
3. How do the models perform in cross-validation?

We also compare to the unregularized baseline from Course 2.

## 1. Load Dependencies and Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import pickle
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score

pd.options.display.max_columns = None

In [None]:
# Set up file paths
root_filepath = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/'
data_filepath = f'{root_filepath}data/'
course3_models = f'{root_filepath}course_3/models/'
course2_models = f'{root_filepath}models/'

In [None]:
# Load training data
df_training = pd.read_csv(f'{data_filepath}training.csv')

X_train = df_training
y_train = df_training['SEM_3_STATUS']

print(f"Training data: {X_train.shape[0]} samples")

## 2. Load the Regularized Models

In [None]:
# Load regularized models from Course 3
l2_model = pickle.load(open(f'{course3_models}l2_ridge_logistic_model.pkl', 'rb'))
l1_model = pickle.load(open(f'{course3_models}l1_lasso_logistic_model.pkl', 'rb'))
elasticnet_model = pickle.load(open(f'{course3_models}elasticnet_logistic_model.pkl', 'rb'))

# Load baseline from Course 2 for comparison
baseline_model = pickle.load(open(f'{course2_models}baseline_logistic_model.pkl', 'rb'))

print("All models loaded successfully!")

In [None]:
# Create dictionary of all models for easy iteration
models = {
    'Baseline (No Penalty)': baseline_model,
    'L2 (Ridge)': l2_model,
    'L1 (Lasso)': l1_model,
    'ElasticNet': elasticnet_model
}

## 3. Train All Models

In [None]:
# Train all models
trained_models = {}

for name, model in models.items():
    print(f"Training {name}...", end=" ")
    model.fit(X_train, y_train)
    trained_models[name] = model
    print("Done!")

print("\nAll models trained successfully!")

## 4. Compare Coefficients

One of the key differences between regularization types is how they affect coefficient values. Let's examine this.

### 4.1 Coefficient Magnitudes

In [None]:
# Get feature names from the preprocessor
preprocessor = trained_models['Baseline (No Penalty)'].named_steps['preprocessing']
feature_names = preprocessor.get_feature_names_out()

# Clean up feature names for display
feature_names_clean = [name.split('__')[-1] for name in feature_names]
print(f"Number of features: {len(feature_names_clean)}")
print(f"\nFeatures: {feature_names_clean}")

In [None]:
# Extract coefficients from each model
coef_data = []

for name, model in trained_models.items():
    classifier = model.named_steps['classifier']
    coefficients = classifier.coef_[0]
    
    for feat, coef in zip(feature_names_clean, coefficients):
        coef_data.append({
            'Model': name,
            'Feature': feat,
            'Coefficient': coef
        })

coef_df = pd.DataFrame(coef_data)
coef_df.head(10)

In [None]:
# Create coefficient comparison visualization
fig = px.bar(
    coef_df, 
    x='Feature', 
    y='Coefficient', 
    color='Model',
    barmode='group',
    title='Coefficient Comparison Across Regularization Types',
    height=500
)

fig.update_xaxes(tickangle=45)
fig.update_layout(legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig.show()

In [None]:
# Compare coefficient magnitudes (L2 norm per model)
coef_norms = {}

for name, model in trained_models.items():
    classifier = model.named_steps['classifier']
    coefficients = classifier.coef_[0]
    l2_norm = np.sqrt(np.sum(coefficients**2))
    l1_norm = np.sum(np.abs(coefficients))
    coef_norms[name] = {'L2 Norm': l2_norm, 'L1 Norm': l1_norm}

norms_df = pd.DataFrame(coef_norms).T
print("Coefficient Norms (measure of model complexity):")
display(norms_df)

**Interpretation:**
- The **baseline model** (no regularization) typically has the largest coefficient norms
- **L2 regularization** shrinks coefficients proportionally
- **L1 regularization** shrinks some coefficients to exactly zero, reducing overall norm
- **ElasticNet** combines both behaviors

### 4.2 Feature Selection with L1

In [None]:
# Examine which features L1 keeps vs zeros out
l1_classifier = trained_models['L1 (Lasso)'].named_steps['classifier']
l1_coefficients = l1_classifier.coef_[0]

l1_selection = pd.DataFrame({
    'Feature': feature_names_clean,
    'Coefficient': l1_coefficients,
    'Selected': l1_coefficients != 0
}).sort_values('Coefficient', key=abs, ascending=False)

print("L1 (Lasso) Feature Selection:")
print(f"Total features: {len(l1_selection)}")
print(f"Selected (non-zero): {l1_selection['Selected'].sum()}")
print(f"Eliminated (zero): {(~l1_selection['Selected']).sum()}")
print("\nFeatures by importance:")
display(l1_selection)

In [None]:
# Visualize L1 feature selection
fig = go.Figure()

colors = ['green' if s else 'lightgray' for s in l1_selection['Selected']]

fig.add_trace(go.Bar(
    x=l1_selection['Feature'],
    y=l1_selection['Coefficient'],
    marker_color=colors
))

fig.update_layout(
    title='L1 (Lasso) Coefficients: Green = Selected, Gray = Eliminated',
    xaxis_tickangle=45,
    height=400
)

fig.show()

## 5. Cross-Validation Comparison

Let's compare model performance using cross-validation, similar to what we did in Course 2.

In [None]:
# Define cross-validation scorer for minority class
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

f1_scorer = make_scorer(f1_score, pos_label='N')
precision_scorer = make_scorer(precision_score, pos_label='N')
recall_scorer = make_scorer(recall_score, pos_label='N')

In [None]:
# Run cross-validation for all models
cv_results = []

for name, model in models.items():  # Use original (untrained) models
    print(f"Cross-validating {name}...", end=" ")
    
    f1_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=f1_scorer)
    precision_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=precision_scorer)
    recall_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=recall_scorer)
    
    cv_results.append({
        'Model': name,
        'F1 Mean': f1_scores.mean(),
        'F1 Std': f1_scores.std(),
        'Precision Mean': precision_scores.mean(),
        'Precision Std': precision_scores.std(),
        'Recall Mean': recall_scores.mean(),
        'Recall Std': recall_scores.std()
    })
    print("Done!")

cv_df = pd.DataFrame(cv_results)
print("\nCross-Validation Results (Positive class: 'N' - Students who leave):")
display(cv_df)

In [None]:
# Visualize CV results
fig = make_subplots(rows=1, cols=3, subplot_titles=('F1 Score', 'Precision', 'Recall'))

for i, metric in enumerate(['F1', 'Precision', 'Recall'], 1):
    fig.add_trace(
        go.Bar(
            x=cv_df['Model'],
            y=cv_df[f'{metric} Mean'],
            error_y=dict(type='data', array=cv_df[f'{metric} Std']),
            name=metric,
            showlegend=False
        ),
        row=1, col=i
    )

fig.update_layout(
    height=400,
    title_text='Cross-Validation Results by Regularization Type'
)
fig.update_xaxes(tickangle=45)
fig.show()

## 6. Save Trained Models

In [None]:
# Save trained models
for name, model in trained_models.items():
    filename = name.lower().replace(' ', '_').replace('(', '').replace(')', '')
    filepath = f'{course3_models}{filename}_trained.pkl'
    pickle.dump(model, open(filepath, 'wb'))
    print(f"Saved: {filepath}")

## 7. Summary

In this notebook, we trained and compared regularized logistic regression models:

### Key Findings

1. **Coefficient Shrinkage**: Regularization reduces coefficient magnitudes compared to the baseline

2. **Feature Selection**: L1 (Lasso) naturally performs feature selection by zeroing out less important features

3. **Cross-Validation Performance**: Regularized models often show similar or improved performance with better generalization

### Comparison Summary

In [None]:
# Final comparison table
summary = cv_df[['Model', 'F1 Mean', 'Precision Mean', 'Recall Mean']].copy()
summary = summary.round(3)
print("Cross-Validation Summary:")
display(summary)

### Next Steps

In the next notebook, we will:
1. Tune the regularization strength (`C` parameter) using GridSearch
2. Find optimal hyperparameters for each regularization type
3. Evaluate final models on the test set

**Proceed to:** `1.4 Tune Regularization Hyperparameters`