![alt text](http://drive.google.com/uc?export=view&id=1IFEWet-Aw4DhkkVe1xv_2YYqlvRe9m5_)

# 10. Regularized Generalized Linear Model in Python

A Regularized Generalized Linear Model (GLM) extends the traditional GLM framework by incorporating regularization techniques to prevent overfitting and improve model generalization. Regularization is particularly useful when dealing with high-dimensional data or when the model is prone to overfitting due to multicollinearity, complex interactions, or a limited number of observations relative to the number of predictors. In Python, the `scikit-learn` library provides robust tools for fitting regularized GLMs, particularly through its `LogisticRegression`, `LinearRegression`, and other related classes, which support Lasso (L1), Ridge (L2), and Elastic Net regularization.

## Overview

In a standard GLM, the model assumes that the outcome $y$ is drawn from an exponential family distribution (e.g., normal, binomial, Poisson). The relationship between the predictor variables $\mathbf{X}$ and the expected value of $y$ is defined through a link function $g$ as follows:

$$
g(\mathbb{E}[y]) = \mathbf{X} \beta
$$

where:
- $\mathbf{X}$ is the design matrix containing $p$ predictor variables,
- $\beta$ is the vector of coefficients.

The likelihood function for GLM coefficients $\beta$ is maximized to estimate the parameters.

### Regularization in GLM

In a Regularized GLM, the model includes a penalty term to the objective function, which discourages large coefficient values and prevents overfitting. Two common types of regularization are **Lasso** (L1) and **Ridge** (L2) regularization:

1. **Lasso (L1) Regularization**: This adds a penalty proportional to the sum of the absolute values of the coefficients. Lasso encourages sparsity, meaning it often shrinks some coefficients to exactly zero. The objective function for Lasso regularization is given by:

 $$
 \text{minimize} \quad - \log L(\beta) + \lambda \sum_{j=1}^{p} |\beta_j|
 $$

2. **Ridge (L2) Regularization**: This adds a penalty proportional to the sum of the squared values of the coefficients. Ridge tends to shrink coefficients uniformly rather than pushing them to zero. The objective function for Ridge regularization is:

  $$
  \text{minimize} \quad - \log L(\beta) + \lambda \sum_{j=1}^{p} \beta_j^2
  $$

3. **Elastic Net Regularization**: Elastic Net is a combination of Lasso and Ridge regularization, balancing both L1 and L2 penalties. It’s especially useful when there are correlated predictors. The objective function for Elastic Net is:

  $$
  \text{minimize} \quad - \log L(\beta) + \lambda_1 \sum_{j=1}^{p} |\beta_j| + \lambda_2 \sum_{j=1}^{p} \beta_j^2
  $$

## Objective Function with Regularization

The objective function with regularization combines the primary goal of the model (such as minimizing error or maximizing likelihood) with a penalty term to discourage overfitting. This penalty imposes constraints on the model's parameters, typically to ensure simplicity, prevent large coefficients, or enforce sparsity.

For a Generalized Linear Model (GLM), the regularized objective function can be written as:

$$
\text{minimize} \quad - \log L(\beta) + \lambda P
$$

where:
- $\log L(\beta)$ is the negative log-likelihood for the GLM,
- $P$ is the regularization penalty (e.g., L1, L2, or a combination),
- $\lambda$ is the regularization parameter that controls the strength of the penalty.

The regularization parameter $\lambda$ is typically chosen through cross-validation to optimize the model’s performance on unseen data.

## Optimization of the Generalized Linear Model (GLM)

Optimization of the Generalized Linear Model (GLM) refers to the process of finding the best set of coefficients that minimize the loss function. The loss function quantifies the difference between the predicted values and the actual values in the training data. In the context of regularized GLMs, the loss function includes a penalty term that discourages large coefficients, thereby preventing overfitting. In regularized GLM, we typically aim to find the parametSers that maximize the likelihood function, but we also want to avoid overfitting by applying a **penalty** (regularization). Let's break down three common optimization techniques for regularized GLMs: **Maximum Likelihood Estimation (MLE)**, **Gradient Descent (GD)**, and **Coordinate Descent**. Each of these techniques can be used to fit regularized GLMs, with regularization terms like **Lasso (L1)**, **Ridge (L2)**, and **Elastic Net**.

For the Gaussian family (linear regression), the loss is typically mean squared error (MSE), and the objective simplifies accordingly.

### 1. **Maximum Likelihood Estimation (MLE)**

**Maximum Likelihood Estimation (MLE)** is a fundamental method used in statistical modeling to estimate the parameters of a model. In the context of GLMs, the goal is to **maximize the likelihood** of observing the data given a set of model parameters (coefficients).

***For a GLM***:
Given the data {$\mathbf{X}, y$}, where $\mathbf{X}$ is the matrix of feature vectors and $y$ is the vector of responses, we want to estimate the coefficients $\beta$ that maximize the likelihood function $L(\beta)$.

For a GLM with **Gaussian distribution** (continuous outcome), the likelihood assumes normally distributed errors:

$$
L(\beta) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - \mathbf{X}_i \beta)^2}{2\sigma^2} \right)
$$

The **log-likelihood** is:

$$
\log L(\beta) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \mathbf{X}_i \beta)^2
$$

Maximizing this is equivalent to minimizing the mean squared error.

***Regularized MLE***:
In **regularized GLMs**, we add a penalty term to the log-likelihood to avoid overfitting. For example:
- **Ridge (L2 regularization)**: $R(\beta) = \lambda \sum_{j=1}^{p} \beta_j^2$
- **Lasso (L1 regularization)**: $R(\beta) = \lambda \sum_{j=1}^{p} |\beta_j|$
- **Elastic Net**: A combination of L1 and L2 regularization.

Thus, the **regularized log-likelihood** becomes:

$$
\log L(\beta) - R(\beta)
$$

The goal is to **maximize** this objective function with respect to $\beta$. In practice, this maximization is typically done numerically, since there is no closed-form solution when regularization is applied (except for Ridge, which has a closed-form).

### 2. **Gradient Descent (GD)**

**Gradient Descent** is a first-order optimization algorithm that iteratively adjusts the parameters in the direction of the steepest decrease of the objective function. This is used to **minimize** a loss function, which in the case of regularized GLMs includes both the **log-likelihood** and the regularization term.

**Steps in Gradient Descent**:
- The **objective function** for regularized GLMs is the **negative log-likelihood** plus the regularization term.
  $$
  \mathcal{L}(\beta) = -\log L(\beta) + \lambda R(\beta)
  $$
  where $L(\beta)$ is the likelihood function, and $R(\beta)$ is the regularization term.
- **Gradient**: We compute the gradient (the derivative) of the objective function with respect to the parameters $\beta$:
  $$
  \nabla \mathcal{L}(\beta) = \frac{1}{n} \sum_{i=1}^{n} \mathbf{X}_i^T (\mathbf{X}_i \beta - y_i) + \lambda \nabla R(\beta)
  $$
 where $\nabla R(\beta)$ is the gradient of the regularization term (e.g., $\nabla R(\beta) = 2\beta$ for Ridge and $\nabla R(\beta) = \text{sign}(\beta)$ for Lasso, though Lasso is not differentiable and requires subgradients).
- **Update Rule**: Using the gradient, we update the parameters in the direction of the negative gradient:
$$
\beta^{(t+1)} = \beta^{(t)} - \eta \nabla \mathcal{L}(\beta^{(t)})
$$
  where $\eta$ is the **learning rate** that determines the size of each step.

***Variants***:
- **Batch Gradient Descent**: Uses the entire dataset to compute the gradient.
- **Stochastic Gradient Descent (SGD)**: Uses one sample at a time to compute the gradient, which can be more efficient for large datasets.
- **Mini-batch Gradient Descent**: A compromise between Batch GD and SGD, using a small random subset of the data to compute the gradient.

***Pros***:
- **Flexibility**: Works with any regularization term (L1, L2, or Elastic Net).
- **Scalability**: Works well for large datasets, especially with **SGD**.

***Cons***:
- **Slow Convergence**: Can take many iterations to converge, especially if the learning rate is not chosen well.
- **Requires tuning**: Needs careful tuning of the learning rate and regularization parameter $\lambda$.

### 3. **Coordinate Descent**

**Coordinate Descent** is an optimization algorithm that updates one parameter (or coordinate) at a time, while keeping all other parameters fixed. This method is particularly efficient for models with **Lasso (L1)** regularization and is used to solve the **Lasso regression** problem, which involves **sparse solutions** (many coefficients becoming zero).

**Steps in Coordinate Descent**:
- The **objective function** for regularized GLMs is the same as in Gradient Descent:
  $$
  \mathcal{L}(\beta) = -\log L(\beta) + \lambda R(\beta)
  $$
   with the goal to minimize the negative log-likelihood and apply regularization.
- In **Coordinate Descent**, we update one coefficient at a time, while holding the others fixed. The update rule for each coefficient $\beta_j$ is:
  $$
  \beta_j^{(t+1)} = \text{soft-threshold}( \hat{\beta}_j, \lambda)
  $$
   where $\hat{\beta}_j$ is the coefficient obtained by solving the subproblem for $\beta_j$ (e.g., partial residual fit), and **soft-thresholding** is used for Lasso (L1 penalty):
   $$
   \text{soft-threshold}(z, \lambda) = \text{sign}(z) \cdot \max(|z| - \lambda, 0)
   $$
This operation shrinks the coefficient $\beta_j$ toward zero, and forces it to zero if the magnitude is smaller than $\lambda$.
- **Update**: For each coordinate $j$, we compute the residuals and update $\beta_j$ to minimize the objective with respect to that coordinate.

***Pros***:
- **Efficiency**: Especially efficient for **Lasso** (L1 regularization) and **Elastic Net**.
- **Sparsity**: It leads to sparse solutions where many coefficients become exactly zero (important for feature selection).
- **Simple to implement**.

***Cons***:
- **Inefficient for Ridge**: Coordinate Descent is not as efficient for **Ridge** (L2) regularization since it doesn’t exploit the closed-form solution.
- **Convergence Speed**: It may take many iterations to converge, especially when the coefficients are not sparse or the regularization is small.

### Summary Comparison of the Methods:

| Method | Description | Best For | Pros | Cons |
|---------------|---------------|---------------|---------------|---------------|
| **Maximum Likelihood Estimation (MLE)** | Maximizes the likelihood function, adding regularization as a penalty term | Gaussian GLMs (linear regression) in general | Provides the most statistically sound estimates | No closed-form solution with L1 regularization, computationally expensive |
| **Gradient Descent (GD)** | Iterative optimization using the gradient of the objective function | General GLMs with regularization | Flexible for various regularization terms | Can be slow, requires tuning of learning rate |
| **Coordinate Descent** | Updates coefficients one by one, often used for Lasso (L1) | Lasso, Elastic Net | Efficient for sparse models, leads to sparse solutions | Slow for Ridge, needs many iterations for convergence |

### Benefits of Regularized GLM

Regularized GLMs provide several advantages:
- **Improved Generalization**: By penalizing large coefficients, regularized GLMs tend to generalize better to new data.
- **Handling Multicollinearity**: Regularization can stabilize estimates when predictors are highly correlated.
- **Sparse Solutions**: L1 regularization promotes sparsity, which can result in simpler models by setting some coefficients to zero, making interpretation easier.

Regularized GLMs are a powerful tool for modeling complex data, especially in high-dimensional settings, by balancing model complexity with predictive performance.

::: callout-note
In machine learning, a **loss function** is a mathematical function that measures the difference between the predicted output of a model and the actual output given a set of input data. The loss function quantifies how well the model performs and provides a way to optimize the model's parameters during training.

The goal of machine learning is to minimize the loss function, which represents the error between the predicted output of the model and the actual output. By minimizing the loss function, the model learns to make better predictions on new data that it has not seen before.
:::

The following table describes the type of penalized model that results based on the values specified for the `lambda` and `alpha` options (adapted from H2O documentation, where `lambda` is the regularization strength and `alpha` is the mixing parameter).

| `lambda` value | `alpha` value | Result                          |
|----------------|---------------|---------------------------------|
| `lambda` == 0  | `alpha` = any value | No regularization. `alpha` is ignored. |
| `lambda` > 0   | `alpha` == 0 | Ridge Regression               |
| `lambda` > 0   | `alpha` == 1 | LASSO                          |
| `lambda` > 0   | 0 < `alpha` < 1 | Elastic Net Penalty            |

## Regularized GLM Models in Python

In Python, the `scikit-learn` library provides a powerful and efficient way to fit Regularized Generalized Linear Models (GLMs) for Gaussian outcomes (linear regression) using classes like `Lasso`, `Ridge`, and `ElasticNet`. These specialize in Lasso (L1) and Ridge (L2) regularization techniques, making them ideal for dealing with high-dimensional data where the number of predictors exceeds the number of observations or when multicollinearity poses a challenge.

This library offers efficient algorithms for fitting penalized regression models, including the elastic net - a combination of Lasso and Ridge regression. Additionally, `scikit-learn` provides cross-validation variants like `ElasticNetCV` to identify the best regularization parameter and for extracting fitted model coefficients. It is widely used in fields such as machine learning, statistics, and data science. It is an excellent choice for feature selection, prediction, and variable importance assessment tasks.

When working with `scikit-learn`'s linear models, users can utilize various arguments to adjust the fitting process to their needs. This flexibility allows for a more tailored approach to the analysis and can lead to more accurate results. To help with this customization, we have outlined some of the most commonly used arguments below. However, for a more comprehensive understanding of the library and its capabilities, we recommend referring to the official documentation at [scikit-learn.org](https://scikit-learn.org  ).

The `ElasticNet` class, which is used for fitting generalized linear models with regularization (Lasso, Ridge, or Elastic Net), employs **Coordinate Descent** optimization method.

```python
from sklearn.linear_model import ElasticNet

model = ElasticNet(alpha=1.0, l1_ratio=1.0)
```

- `alpha`: The regularization strength (equivalent to `lambda` in some contexts). Higher values increase regularization.
- `l1_ratio`: The elastic net mixing parameter, with range `[0,1]`. `l1_ratio=1` is Lasso (default), `l1_ratio=0` is Ridge.
- `max_iter`: The maximum number of iterations (default is 1000).
- `tol`: The tolerance for optimization convergence.
- `normalize`: If True, the regressors X will be normalized before regression (deprecated; use preprocessing instead).
- `fit_intercept`: Whether to calculate the intercept for this model (default True).

**K-fold cross-validation** can be performed using the `ElasticNetCV` class. In addition to the base parameters, `ElasticNetCV` has special parameters including `cv` (the number of folds, default 5), and `scoring` (the metric used for cross-validation, e.g., 'neg_mean_squared_error' for MSE).

### Step-by-Step Guide

1. **Install and Import the scikit-learn Library**
    ```python
    # Install if needed: pip install scikit-learn
    from sklearn.linear_model import ElasticNet, ElasticNetCV
    import numpy as np
    import pandas as pd  # For data handling
    ```

2. **Prepare the Data**
    `scikit-learn` expects the predictor matrix $\mathbf{X}$ and response vector $y$ in NumPy array format:
  - $\mathbf{X}$: A 2D array of predictor variables.
  - $y$: A 1D array of the response variable.
    ```python
    # Example assuming 'your_dataframe' is a pandas DataFrame
    X = your_dataframe.iloc[:, :-1].values  # predictors as NumPy array
    y = your_dataframe['response_variable'].values  # response as NumPy array
    ```

3. **Fitting Regularized GLMs for Gaussian (Linear Regression)**
    `scikit-learn` fits models using both Lasso (L1) and Ridge (L2) regularization, controlled by the `l1_ratio` parameter:
   - `l1_ratio = 1`: Lasso regression
   - `l1_ratio = 0`: Ridge regression
   - `0 < l1_ratio < 1`: Elastic Net, a mixture of L1 and L2 regularization

#### Gaussian Model (Linear Regression)
For a continuous response variable (Gaussian family), use `ElasticNet` (which defaults to Gaussian loss):
```python
# Gaussian model (Linear Regression) with Lasso regularization
fit_gaussian = ElasticNet(alpha=1.0, l1_ratio=1.0)  # Lasso
fit_gaussian.fit(X, y)
# Alternatively, use l1_ratio=0 for Ridge or l1_ratio=0.5 for Elastic Net
```

### Cross-Validation to Select Optimal Alpha
The `ElasticNetCV` class performs cross-validation to find the optimal value of `alpha` (regularization strength). This ensures the best balance between bias and variance.
```python
# Cross-validation for the Gaussian model
cv_fit_gaussian = ElasticNetCV(l1_ratio=1.0, alphas=None, cv=5)  # Lasso example; alphas=None auto-generates
cv_fit_gaussian.fit(X, y)
best_alpha_gaussian = cv_fit_gaussian.alpha_  # Optimal alpha value
```

### Making Predictions
After determining the best `alpha` value, you can use it to make predictions on new data:
```python
# Predict on new data using the optimal alpha
new_X = new_data.iloc[:, :-1].values  # Convert predictors to NumPy array
# Gaussian model prediction
pred_gaussian = cv_fit_gaussian.predict(new_X)
```

## Summary

By following this approach, you can perform regularized GLMs for Gaussian models (linear regression) in Python using the `scikit-learn` library. The regularization parameter `alpha` is chosen through cross-validation to enhance model performance on unseen data.

## Resources

Here are some books related to the Regularized Generalized Linear Model:

1. [Generalized Linear Models - Taylor & Francis eBooks](https://www.taylorfrancis.com/books/mono/10.1201/9780203753736/generalized-linear-models-mccullagh  )

2. [Generalized Linear Models (Chapman & Hall/CRC Monographs on Statistics)](https://www.amazon.com/Generalized-Chapman-Monographs-Statistics-Probability/dp/0412317605  )

3. [An Introduction to Generalized Linear Models - Taylor & Francis eBooks](https://www.taylorfrancis.com/books/mono/10.1201/9781315182780/introduction-generalized-linear-models-adrian-barnett-annette-dobson  )