# 1.2 **Build** a Regularized Logistic Regression Model - Predict Student Departure with L1, L2, and ElasticNet

## Model Cycle: The 5 Key Steps

### **1. Build the Model : Create the pipeline with regularization.**  
### 2. Train the Model : Fit the model on the training data.  
### 3. Generate Predictions : Use the trained model to make predictions.  
### 4. Evaluate the Model : Assess performance using evaluation metrics.  
### 5. Improve the Model : Tune hyperparameters for optimal performance.

### **Table of Contents**

<div style="overflow-x: auto;">

- [Introduction](#scrollTo=intro)
- [1. Load Dependencies and Data](#scrollTo=section1)
- [2. Quick Recap: The Baseline Model](#scrollTo=section2)
- [3. Building Regularized Models](#scrollTo=section3)
  - [3.1 L2 Regularization (Ridge)](#scrollTo=section3_1)
  - [3.2 L1 Regularization (Lasso)](#scrollTo=section3_2)
  - [3.3 ElasticNet Regularization](#scrollTo=section3_3)
- [4. Save Models for Future Use](#scrollTo=section4)
- [5. Summary](#scrollTo=section5)

</div>

## Introduction

In this notebook, we build upon the baseline logistic regression model from Course 2 by adding regularization. We will create three regularized variants:

1. **L2 (Ridge)**: Shrinks all coefficients, handles multicollinearity
2. **L1 (Lasso)**: Performs feature selection by zeroing coefficients
3. **ElasticNet**: Combines L1 and L2 benefits

We use the same preprocessing pipeline and data from Course 2, allowing direct comparison between models.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Build regularized logistic regression pipelines in scikit-learn
2. Understand the key hyperparameters (`C`, `penalty`, `l1_ratio`)
3. Create multiple model variants for comparison

## 1. Load Dependencies and Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import pickle

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder

pd.options.display.max_columns = None

In [None]:
# Set up file paths - using Course 2 data
root_filepath = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/'
data_filepath = f'{root_filepath}data/'
course3_filepath = f'{root_filepath}course_3/'

In [None]:
# Load training data
df_training = pd.read_csv(f'{data_filepath}training.csv')

print(f"Training data shape: {df_training.shape}")
print(f"\nTarget distribution:")
print(df_training['SEM_3_STATUS'].value_counts(normalize=True))

In [None]:
# Define feature matrix and target
X_train = df_training
y_train = df_training['SEM_3_STATUS']

## 2. Quick Recap: The Baseline Model

In Course 2, we built a baseline logistic regression model without regularization (`penalty=None`). Here's a quick recap of our preprocessing setup:

In [None]:
# Feature groupings from Course 2

# Columns with bounded values (e.g., 0–4 GPA or 0–1 ratios)
minmax_columns = [
    'HS_GPA',
    'GPA_1', 'GPA_2',
    'DFW_RATE_1', 'DFW_RATE_2'
]

# Columns with larger, more variable values
standard_columns = [
    'UNITS_ATTEMPTED_1', 'UNITS_ATTEMPTED_2',
]

# Categorical columns for one-hot encoding
categorical_columns = [
    'GENDER',
    'RACE_ETHNICITY',
    'FIRST_GEN_STATUS',
]

In [None]:
# Preprocessing transformer (same as Course 2)
preprocessor = ColumnTransformer(
    transformers=[
        ('minmax', MinMaxScaler(), minmax_columns),
        ('standard', StandardScaler(), standard_columns),
        ('onehot', OneHotEncoder(handle_unknown='ignore', 
                                  drop=['Female', 'Other', 'Unknown'], 
                                  sparse_output=False), categorical_columns)
    ],
    remainder='drop'
)

print("Preprocessor configured successfully.")

## 3. Building Regularized Models

Now we create three regularized versions of our logistic regression model. Each uses the same preprocessing but with different regularization settings.

### Key Hyperparameters

| Parameter | Description | Values |
|:----------|:------------|:-------|
| `penalty` | Type of regularization | 'l1', 'l2', 'elasticnet', None |
| `C` | Inverse of regularization strength | float > 0 (default=1.0) |
| `solver` | Optimization algorithm | 'saga' (required for L1/ElasticNet) |
| `l1_ratio` | ElasticNet mixing (1=L1, 0=L2) | float between 0 and 1 |

### 3.1 L2 Regularization (Ridge)

L2 regularization is scikit-learn's default. It shrinks all coefficients proportionally but doesn't zero any out.

In [None]:
# L2 (Ridge) Regularized Logistic Regression
l2_logistic_model = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(
        penalty='l2',           # L2 regularization
        C=1.0,                  # Regularization strength (will tune later)
        class_weight='balanced', # Handle class imbalance
        solver='lbfgs',         # Standard solver for L2
        max_iter=1000,
        random_state=42
    ))
])

print("L2 (Ridge) Logistic Regression Model:")
l2_logistic_model

### 3.2 L1 Regularization (Lasso)

L1 regularization can shrink coefficients to exactly zero, effectively performing feature selection. Note that L1 requires the 'saga' solver.

In [None]:
# L1 (Lasso) Regularized Logistic Regression
l1_logistic_model = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(
        penalty='l1',           # L1 regularization
        C=1.0,                  # Regularization strength (will tune later)
        class_weight='balanced', # Handle class imbalance
        solver='saga',          # Required for L1 penalty
        max_iter=1000,
        random_state=42
    ))
])

print("L1 (Lasso) Logistic Regression Model:")
l1_logistic_model

### 3.3 ElasticNet Regularization

ElasticNet combines L1 and L2 penalties. The `l1_ratio` parameter controls the mix:
- `l1_ratio=1.0`: Pure L1 (Lasso)
- `l1_ratio=0.0`: Pure L2 (Ridge)  
- `l1_ratio=0.5`: Equal mix of both

In [None]:
# ElasticNet Regularized Logistic Regression
elasticnet_logistic_model = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(
        penalty='elasticnet',   # ElasticNet regularization
        C=1.0,                  # Regularization strength (will tune later)
        l1_ratio=0.5,           # 50% L1, 50% L2
        class_weight='balanced', # Handle class imbalance
        solver='saga',          # Required for ElasticNet
        max_iter=1000,
        random_state=42
    ))
])

print("ElasticNet Logistic Regression Model:")
elasticnet_logistic_model

### Model Comparison Summary

In [None]:
# Summary of models we've built
models = {
    'L2 (Ridge)': l2_logistic_model,
    'L1 (Lasso)': l1_logistic_model,
    'ElasticNet': elasticnet_logistic_model
}

print("Models Built:")
print("="*50)
for name, model in models.items():
    classifier = model.named_steps['classifier']
    print(f"\n{name}:")
    print(f"  - penalty: {classifier.penalty}")
    print(f"  - C (regularization): {classifier.C}")
    print(f"  - solver: {classifier.solver}")
    if hasattr(classifier, 'l1_ratio') and classifier.l1_ratio is not None:
        print(f"  - l1_ratio: {classifier.l1_ratio}")

## 4. Save Models for Future Use

We save these untrained model pipelines so they can be loaded and trained in subsequent notebooks.

In [None]:
# Create models directory for Course 3 if it doesn't exist
import os
models_path = f'{course3_filepath}models/'
os.makedirs(models_path, exist_ok=True)

# Save each model pipeline
for name, model in models.items():
    filename = name.lower().replace(' ', '_').replace('(', '').replace(')', '')
    filepath = f'{models_path}{filename}_logistic_model.pkl'
    pickle.dump(model, open(filepath, 'wb'))
    print(f"Saved: {filepath}")

## 5. Summary

In this notebook, we built three regularized logistic regression models:

| Model | Key Characteristics | Best For |
|:------|:--------------------|:---------|
| **L2 (Ridge)** | Shrinks all coefficients, stable | Many small effects |
| **L1 (Lasso)** | Zeros out coefficients | Feature selection |
| **ElasticNet** | Combines L1 + L2 | Correlated features |

### Key Points

1. All models use `class_weight='balanced'` to handle class imbalance
2. L1 and ElasticNet require `solver='saga'`
3. The `C` parameter is the inverse of regularization strength
4. We're using default `C=1.0`—we'll tune this later

### Next Steps

In the next notebook, we will:
1. Train these models on our data
2. Compare coefficient values across regularization types
3. Examine which features are selected by L1/ElasticNet

**Proceed to:** `1.3 Train and Compare Regularized Logistic Regression Models`