# Week 8: Advanced Regression

Complex regression types like Polynomial, Ridge, Lasso, Elastic Net, etc.

https://www.geeksforgeeks.org/machine-learning/ml-different-regression-types/

Regression Analysis




Regularization Techniques

Types of Regression in ML

Linear

Logistic

Polynomial

Softmax Regression

Ridge Regression

Lasso Regression 

Elastic Net Regression

Need for Regression

Gradient Descent

https://www.geeksforgeeks.org/machine-learning/ml-linear-regression/

https://www.analyticsvidhya.com/blog/2022/01/different-types-of-regression-models/

https://scikit-learn.org/stable/modules/linear_model.html



---


#### Polynomial Regression

Concept: Understanding how to model non-linear relationships by introducing polynomial features $$x^2, x^3$$

Key Challenge: The risk of Overfitting with high-degree polynomials.

What if our data is actually more complex than a simple straight line? Surprisingly, we can actually use a linear model to fit nonlinear data. A simple way to do this is to add powers of each feature as new features, then train a linear model on this extended set of features. This technique is called Polynomial Regression.The equation below represents a polynomial equation:

$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \cdots + \beta_n x^n + \varepsilon.$$

### Regularized Regression 
Concept: Introduction to the concept of a penalty term or regularizer added to the loss function to constrain model complexity.

##### Ridge Regression 
$$L_2 Regularization$$

 Rigid Regression is a regularized version of Linear Regression where a regularized term is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. Note that the regularized term should not be added to the cost function during training. Once the model is trained, you want to evaluate the model's performance using the unregularized performance measure. The formula for ridge regression is:
$$J(\mathbf{\theta}) = \text{MSE}(\mathbf{\theta}) + \alpha \frac{1}{2}\sum_{i=1}^{n} \theta_i^2$$

Explanation of the Formula Components:

* $J(\mathbf{\theta})$: The **Cost Function** (or loss function) that the model minimizes.
* $\text{MSE}(\mathbf{\theta})$: The **Mean Squared Error** component, which measures the model's performance on the training data. This is the standard part of the cost function for linear regression.
* $\alpha \frac{1}{2}\sum_{i=1}^{n} \theta_i^2$: The **Regularization Term** (or penalty term).
    * $\mathbf{\theta}$: The vector of model **parameters** (the weights or coefficients).
    * $\theta_i$: The $i$-th **parameter** (excluding the intercept, $\theta_0$).
    * $\sum_{i=1}^{n} \theta_i^2$: The sum of the squared parameters (**L2-norm**).
    * $\alpha$ (alpha): The **regularization hyperparameter** that controls the strength of the penalty. A larger $\alpha$ forces the model to use smaller weights.
    * $\frac{1}{2}$: A constant factor often included for mathematical convenience, ensuring that the derivative of the term is simply $\alpha\sum \theta_i$.

Ridge Regression, also known as **L2 Regularization**, adds this penalty term to prevent **overfitting** by keeping the model weights small.

- Adds the squared magnitude of coefficients to the loss function.
- Shrinks coefficients towards zero, reducing variance.
- Supplementary: The λ (or α) hyperparameter and its role.

#### Lasso Regression 
$$L_1 ​Regularization$$

 Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) is another regularized version of Linear regression : it adds a regularized term to the cost function, but it uses the l1 norm of the weighted vector instead of half the square of the l2 term Lasso regression is given as:

$$J(\mathbf{\theta}) = \text{MSE}(\mathbf{\theta}) + \alpha \sum_{i=1}^{n} |\theta_i|$$

Explanation of the Formula Components

This equation is very similar to the Ridge Regression cost function, but it uses the **absolute value of the weights** instead of the square, which has a different effect on the model.

* $J(\mathbf{\theta})$: The **Cost Function** that the model minimizes.
* $\text{MSE}(\mathbf{\theta})$: The **Mean Squared Error** component (the loss from prediction accuracy).
* $\alpha \sum_{i=1}^{n} |\theta_i|$: The **Regularization Term** (**L1 Penalty**).
    * $\mathbf{\theta}$: The vector of model **parameters** (weights).
    * $\sum_{i=1}^{n} |\theta_i|$: The sum of the absolute values of the parameters (**L1-norm**).
    * $\alpha$ (alpha): The **regularization hyperparameter**.

The key distinction of Lasso ($\sum |\theta_i|$) is its ability to drive the coefficients of unimportant features to **exactly zero**, effectively performing **feature selection**.

- Adds the absolute magnitude of coefficients to the loss function.
- Can drive some coefficients exactly to zero, effectively performing Feature Selection.
- Supplementary: The λ (or α) hyperparameter and its role.





## Regularization Comparison: L2 vs. L1

| Feature | L2 (Ridge Regression) | L1 (Lasso Regression) |
| :--- | :--- | :--- |
| **Penalty Term** | Sum of **squared** coefficients ($\sum \theta_i^2$). | Sum of the **absolute values** of coefficients ($\sum |\theta_i|$). |
| **Effect on Coefficients** | **Drives coefficients down** (shrinks them) toward zero. | **Drives coefficients of unimportant features to *exactly* zero.** |
| **Primary Benefit** | Reduces variance and prevents **overfitting** by keeping all model weights small. | Performs **automatic feature selection** by eliminating unimportant features. |
| **Outcome** | All original features remain in the model, but with reduced influence. | The resulting model is simpler and more interpretable as it only includes the most relevant features. |

***
### Key Takeaway on Magnitude and Prediction

> "so the more magnitude a feature has, the more it will be felt and the less or close to 0, it will not affect the prediction"

* **L2 (Ridge):** Penalizes *large* coefficients heavily because the penalty is proportional to $\theta_i^2$. It forces large coefficients to shrink, meaning that a feature with a very high magnitude (influence) will have its contribution to the prediction ***reduced***.
* **L1 (Lasso):** Also penalizes large coefficients, but its unique geometry allows it to push coefficients for irrelevant features **all the way to zero** instead of just close to zero. This means that features driven to $0$ truly have **zero effect** on the prediction, fulfilling the goal of **feature selection**.

Basically:

- L2 (Ridge Regression): Correctly drives coefficients down or reduces their magnitude using a penalty term to penalize large weights. This keeps the overall model weights small to prevent overfitting.

- L1 (Lasso Regression): Correctly drives coefficients of unimportant features to 0 (exactly zero) using its penalty term, which allows it to be used for feature selection.

the more magnitude a coefficient has, the more the feature will influence the prediction. L1 and L2 reduce this influence, but L1 goes the extra step of setting it to zero for useless features.

- Coefficient≈0⟹No/Minimal effect on prediction.

- Coefficient=0⟹Zero effect on prediction (Lasso’s unique ability).

#### Role of hyperparameter lambda and alpha 

The hyperparameter λ (lambda), often represented as α (alpha) in practical implementations like Python's scikit-learn, is the single most important control mechanism in both L1 (Lasso) and L2 (Ridge) regularization. It controls the strength of the penalty applied to the model's coefficients.

The hyperparameter λ (or α) controls the strength of the penalty term added to the loss function. This penalty manages the bias-variance trade-off to ensure the model generalizes well, mitigating the risks of both overfitting (too complex) and underfitting (too simple).

##### Role of λ / α in Regularization
$$Cost=Loss(Data Fit)+λ×Penalty(Model Complexity)$$

The λ or α value is the constant that multiplies the penalty term, determining the trade-off between:
1. Fitting the data well (minimizing the original Loss term, which risks overfitting).
2. Keeping the model simple (minimizing the Penalty term, which risks underfitting).

### Effect of $\lambda$ / $\alpha$ on Model Behavior

| $\lambda$ / $\alpha$ Value | Regularization Strength | Effect on Coefficients | Model Outcome |
| :--- | :--- | :--- | :--- |
| **$\lambda = 0$** | **Zero** (No regularization) | Coefficients are unconstrained. | Model reverts to **Ordinary Least Squares (OLS)**. High risk of overfitting. |
| **Small $\lambda$** | **Weak** | Small penalty; coefficients are shrunk slightly. | Model is complex. Still risks overfitting. |
| **Optimal $\lambda$** | **Balanced** | Achieves the best trade-off between bias and variance. | **Optimal Generalization** (the goal). |
| **Large $\lambda$** | **Strong** | Coefficients are heavily penalized (pushed very close to zero). | Model becomes overly simple. High risk of **underfitting** (high bias). |

### $\lambda$ / $\alpha$ Specific Role for L1 and L2

| Regularization Type | Penalty Term | The Role of $\lambda$ |
| :--- | :--- | :--- |
| **L2 (Ridge)** | $\lambda \sum \theta_i^2$ (Sum of **squared** coefficients) | $\lambda$ controls the overall **magnitude** of coefficients, making them small without eliminating any features. |
| **L1 (Lasso)** | $\lambda \sum \theta_i$ (Sum of **absolute values** of coefficients) | $\lambda$ controls the **number of features used**. A higher $\lambda$ forces more unimportant features' coefficients to become **exactly zero** (Feature Selection). |

Specific Action on Model Coefficients:
1. L2 Regularization (Ridge)
 - Primary Action: The λ value controls the overall magnitude of the coefficients.
 - Mechanism: It works by making coefficients small or shrinking them toward zero. Coefficients with very high weights are penalized more aggressively to reduce their individual influence on the prediction.
 - Result: Coefficients are reduced but never set exactly to zero, meaning no features are eliminated. The model maintains all features but with smaller, more evenly distributed weights.

2. L1 Regularization (Lasso)
 - Primary Action: The λ value controls feature selection and model sparsity.
 - Mechanism: It penalizes the absolute value of the coefficients. While it shrinks all coefficients, it is mathematically more aggressive, forcing the weights of irrelevant or redundant features to become exactly zero.
 - Result: It effectively removes the effect of useless features from the prediction, achieving the goal of feature selection.

 | Regularization Type | Coefficient Treatment | Feature Status |
| :--- | :--- | :--- |
| **L2 (Ridge)** | Shrinks coefficients (minimizes their effects) | Features are retained but never zero|
| **L1 (Lasso)** | Shrinks coefficients and sets irrelevant ones to zero |Features are eliminated / feature selection |

Elastic Net (Optional but Recommended): A hybrid of Ridge and Lasso.