# Regularization

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext(master='local',appName="Linear Regression")
spark = SparkSession(sparkContext=sc)

In [2]:
import numpy as np
import matplotlib.pyplot as plt 

| **Aspect**                           | **Ridge Regularization**                                                                                                                                                                              | **Lasso Regularization**                                                                                                                                                                                  |
|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Penalty Term**                     | - Employs an L2 norm penalty: $ \lambda \sum_{j=1}^p \beta_j^2 $ . <br> - Adds the sum of squared coefficients to the loss function.                                                             | - Employs an L1 norm penalty $ \lambda \sum_{j=1}^p \|\beta_j\| $ . <br> - Adds the sum of the absolute values of coefficients to the loss function.                                                |
| **Coefficient Shrinkage vs. Selection** | - Shrinks all coefficients continuously towards zero. <br> - Does not force any coefficient to be exactly zero, thus retaining all predictors in the model.                                       | - Encourages sparsity by driving some coefficients exactly to zero. <br> - Performs variable selection by effectively eliminating less important predictors from the model.                        |
| **Handling of Correlated Predictors** | - Distributes the coefficient weight among correlated predictors, reducing their magnitude uniformly. <br> - Useful when it is believed that all predictors contribute some information.           | - May select one predictor from a group of correlated predictors while setting others to zero. <br> - Can lead to a more interpretable model, though it might be less stable with multicollinearity. |
| **Computational Considerations**     | - Often has a closed-form solution (via the normal equations), making it computationally straightforward.                                                                                            | - Lacks a closed-form solution due to the non-differentiability of the L1 norm. <br> - Requires iterative optimization techniques (e.g., coordinate descent) to determine the optimal coefficients.     |
| **Bias-Variance Trade-off**          | - Primarily reduces variance by uniformly shrinking coefficients, resulting in a moderate increase in bias.                                                                                           | - May result in higher bias if important predictors are omitted, but often substantially reduces variance through model simplification and sparsity.                                                     |
| **Interpretability and Model Complexity** | - Retains all predictors, which may complicate model interpretation when many predictors are involved. <br> - More suitable when it is believed that all predictors have some effect on the response. | - Yields a sparse model by selecting only a subset of predictors. <br> - Enhances interpretability and simplifies the model, although over-regularization may lead to the exclusion of relevant predictors.  |
