# Week 8: Advanced Regression

Complex regression types like Polynomial, Ridge, Lasso, Elastic Net, etc.

https://www.geeksforgeeks.org/machine-learning/ml-different-regression-types/

Regression Analysis




Regularization Techniques

Types of Regression in ML

Linear

Logistic

Polynomial

Softmax Regression

Ridge Regression

Lasso Regression 

Elastic Net Regression

Need for Regression

Gradient Descent

https://www.geeksforgeeks.org/machine-learning/ml-linear-regression/

https://www.analyticsvidhya.com/blog/2022/01/different-types-of-regression-models/

https://scikit-learn.org/stable/modules/linear_model.html



---


#### Polynomial Regression

Concept: Understanding how to model non-linear relationships by introducing polynomial features $$x^2, x^3$$

Key Challenge: The risk of Overfitting with high-degree polynomials.

What if our data is actually more complex than a simple straight line? Surprisingly, we can actually use a linear model to fit nonlinear data. A simple way to do this is to add powers of each feature as new features, then train a linear model on this extended set of features. This technique is called Polynomial Regression.The equation below represents a polynomial equation:

$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \cdots + \beta_n x^n + \varepsilon.$$

### Regularized Regression 
Concept: Introduction to the concept of a penalty term or regularizer added to the loss function to constrain model complexity.

##### Ridge Regression 
$$L_2 Regularization$$

 Rigid Regression is a regularized version of Linear Regression where a regularized term is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. Note that the regularized term should not be added to the cost function during training. Once the model is trained, you want to evaluate the model's performance using the unregularized performance measure. The formula for ridge regression is:
$$J(\mathbf{\theta}) = \text{MSE}(\mathbf{\theta}) + \alpha \frac{1}{2}\sum_{i=1}^{n} \theta_i^2$$

Explanation of the Formula Components:

* $J(\mathbf{\theta})$: The **Cost Function** (or loss function) that the model minimizes.
* $\text{MSE}(\mathbf{\theta})$: The **Mean Squared Error** component, which measures the model's performance on the training data. This is the standard part of the cost function for linear regression.
* $\alpha \frac{1}{2}\sum_{i=1}^{n} \theta_i^2$: The **Regularization Term** (or penalty term).
    * $\mathbf{\theta}$: The vector of model **parameters** (the weights or coefficients).
    * $\theta_i$: The $i$-th **parameter** (excluding the intercept, $\theta_0$).
    * $\sum_{i=1}^{n} \theta_i^2$: The sum of the squared parameters (**L2-norm**).
    * $\alpha$ (alpha): The **regularization hyperparameter** that controls the strength of the penalty. A larger $\alpha$ forces the model to use smaller weights.
    * $\frac{1}{2}$: A constant factor often included for mathematical convenience, ensuring that the derivative of the term is simply $\alpha\sum \theta_i$.

Ridge Regression, also known as **L2 Regularization**, adds this penalty term to prevent **overfitting** by keeping the model weights small.

- Adds the squared magnitude of coefficients to the loss function.
- Shrinks coefficients towards zero, reducing variance.
- Supplementary: The λ (or α) hyperparameter and its role.

#### Lasso Regression 
$$L_1 ​Regularization$$

 Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) is another regularized version of Linear regression : it adds a regularized term to the cost function, but it uses the l1 norm of the weighted vector instead of half the square of the l2 term Lasso regression is given as:

$$J(\mathbf{\theta}) = \text{MSE}(\mathbf{\theta}) + \alpha \sum_{i=1}^{n} |\theta_i|$$

Explanation of the Formula Components

This equation is very similar to the Ridge Regression cost function, but it uses the **absolute value of the weights** instead of the square, which has a different effect on the model.

* $J(\mathbf{\theta})$: The **Cost Function** that the model minimizes.
* $\text{MSE}(\mathbf{\theta})$: The **Mean Squared Error** component (the loss from prediction accuracy).
* $\alpha \sum_{i=1}^{n} |\theta_i|$: The **Regularization Term** (**L1 Penalty**).
    * $\mathbf{\theta}$: The vector of model **parameters** (weights).
    * $\sum_{i=1}^{n} |\theta_i|$: The sum of the absolute values of the parameters (**L1-norm**).
    * $\alpha$ (alpha): The **regularization hyperparameter**.

The key distinction of Lasso ($\sum |\theta_i|$) is its ability to drive the coefficients of unimportant features to **exactly zero**, effectively performing **feature selection**.

- Adds the absolute magnitude of coefficients to the loss function.
- Can drive some coefficients exactly to zero, effectively performing Feature Selection.
- Supplementary: The λ (or α) hyperparameter and its role.





## Regularization Comparison: L2 vs. L1

| Feature | L2 (Ridge Regression) | L1 (Lasso Regression) |
| :--- | :--- | :--- |
| **Penalty Term** | Sum of **squared** coefficients ($\sum \theta_i^2$). | Sum of the **absolute values** of coefficients ($\sum |\theta_i|$). |
| **Effect on Coefficients** | **Drives coefficients down** (shrinks them) toward zero. | **Drives coefficients of unimportant features to *exactly* zero.** |
| **Primary Benefit** | Reduces variance and prevents **overfitting** by keeping all model weights small. | Performs **automatic feature selection** by eliminating unimportant features. |
| **Outcome** | All original features remain in the model, but with reduced influence. | The resulting model is simpler and more interpretable as it only includes the most relevant features. |

***
### Key Takeaway on Magnitude and Prediction

> "so the more magnitude a feature has, the more it will be felt and the less or close to 0, it will not affect the prediction"

* **L2 (Ridge):** Penalizes *large* coefficients heavily because the penalty is proportional to $\theta_i^2$. It forces large coefficients to shrink, meaning that a feature with a very high magnitude (influence) will have its contribution to the prediction ***reduced***.
* **L1 (Lasso):** Also penalizes large coefficients, but its unique geometry allows it to push coefficients for irrelevant features **all the way to zero** instead of just close to zero. This means that features driven to $0$ truly have **zero effect** on the prediction, fulfilling the goal of **feature selection**.

Basically:

- L2 (Ridge Regression): Correctly drives coefficients down or reduces their magnitude using a penalty term to penalize large weights. This keeps the overall model weights small to prevent overfitting.

- L1 (Lasso Regression): Correctly drives coefficients of unimportant features to 0 (exactly zero) using its penalty term, which allows it to be used for feature selection.

the more magnitude a coefficient has, the more the feature will influence the prediction. L1 and L2 reduce this influence, but L1 goes the extra step of setting it to zero for useless features.

- Coefficient≈0⟹No/Minimal effect on prediction.

- Coefficient=0⟹Zero effect on prediction (Lasso’s unique ability).

#### Role of hyperparameter lambda and alpha 

The hyperparameter λ (lambda), often represented as α (alpha) in practical implementations like Python's scikit-learn, is the single most important control mechanism in both L1 (Lasso) and L2 (Ridge) regularization. It controls the strength of the penalty applied to the model's coefficients.

The hyperparameter λ (or α) controls the strength of the penalty term added to the loss function. This penalty manages the bias-variance trade-off to ensure the model generalizes well, mitigating the risks of both overfitting (too complex) and underfitting (too simple).

##### Role of λ / α in Regularization
$$Cost=Loss(Data Fit)+λ×Penalty(Model Complexity)$$

The λ or α value is the constant that multiplies the penalty term, determining the trade-off between:
1. Fitting the data well (minimizing the original Loss term, which risks overfitting).
2. Keeping the model simple (minimizing the Penalty term, which risks underfitting).

### Effect of $\lambda$ / $\alpha$ on Model Behavior

| $\lambda$ / $\alpha$ Value | Regularization Strength | Effect on Coefficients | Model Outcome |
| :--- | :--- | :--- | :--- |
| **$\lambda = 0$** | **Zero** (No regularization) | Coefficients are unconstrained. | Model reverts to **Ordinary Least Squares (OLS)**. High risk of overfitting. |
| **Small $\lambda$** | **Weak** | Small penalty; coefficients are shrunk slightly. | Model is complex. Still risks overfitting. |
| **Optimal $\lambda$** | **Balanced** | Achieves the best trade-off between bias and variance. | **Optimal Generalization** (the goal). |
| **Large $\lambda$** | **Strong** | Coefficients are heavily penalized (pushed very close to zero). | Model becomes overly simple. High risk of **underfitting** (high bias). |

### $\lambda$ / $\alpha$ Specific Role for L1 and L2

| Regularization Type | Penalty Term | The Role of $\lambda$ |
| :--- | :--- | :--- |
| **L2 (Ridge)** | $\lambda \sum \theta_i^2$ (Sum of **squared** coefficients) | $\lambda$ controls the overall **magnitude** of coefficients, making them small without eliminating any features. |
| **L1 (Lasso)** | $\lambda \sum \theta_i$ (Sum of **absolute values** of coefficients) | $\lambda$ controls the **number of features used**. A higher $\lambda$ forces more unimportant features' coefficients to become **exactly zero** (Feature Selection). |

Specific Action on Model Coefficients:
1. L2 Regularization (Ridge)
 - Primary Action: The λ value controls the overall magnitude of the coefficients.
 - Mechanism: It works by making coefficients small or shrinking them toward zero. Coefficients with very high weights are penalized more aggressively to reduce their individual influence on the prediction.
 - Result: Coefficients are reduced but never set exactly to zero, meaning no features are eliminated. The model maintains all features but with smaller, more evenly distributed weights.

2. L1 Regularization (Lasso)
 - Primary Action: The λ value controls feature selection and model sparsity.
 - Mechanism: It penalizes the absolute value of the coefficients. While it shrinks all coefficients, it is mathematically more aggressive, forcing the weights of irrelevant or redundant features to become exactly zero.
 - Result: It effectively removes the effect of useless features from the prediction, achieving the goal of feature selection.

 | Regularization Type | Coefficient Treatment | Feature Status |
| :--- | :--- | :--- |
| **L2 (Ridge)** | Shrinks coefficients (minimizes their effects) | Features are retained but never zero|
| **L1 (Lasso)** | Shrinks coefficients and sets irrelevant ones to zero |Features are eliminated / feature selection |

### Elastic Net (Optional but Recommended): A hybrid of Ridge and Lasso.

ElasticNet is middle ground between Lasso and Ridge Regression techniques.The regularization term is a simple mix of both Rigid and Lasso's regularization term. when r=0, Elastic Net is equivalent to Rigid Regression and when r=1, Elastic Net is equivalent to Lasso Regression.The expression for Elastic Net Regression is given as:
$$J(\boldsymbol{\theta}) = \text{MSE}(\boldsymbol{\theta}) + r\alpha\sum_{i=1}^{n} |\theta_i| + \frac{1-r}{2}\alpha\sum_{i=1}^{n} \theta_i^2$$

This equation represents the cost function for Elastic Net Regularization, which combines the penalty terms from Lasso (L1) and Ridge (L2) regression.

The formula you provided is the **cost function** for the **Elastic Net Regularization** model. It consists of three main parts: the loss function ($\text{MSE}$), the $L_1$ penalty (Lasso), and the $L_2$ penalty (Ridge).

The model's goal is to find the set of parameter values, $\boldsymbol{\theta}$, that **minimizes** this function.

#### 1. Loss Term: $\text{MSE}(\boldsymbol{\theta})$

| Term | Name | Purpose |
| :--- | :--- | :--- |
| $$\text{MSE}(\boldsymbol{\theta})$$ | **Mean Squared Error** (MSE) | This is the primary **loss function**. It measures the average of the squared differences between the predicted values ($$\hat{y}$$) and the actual target values ($$y$$). Its purpose is to make the model's predictions as close as possible to the observed data. |

---

#### 2. Regularization Terms (The Penalty)

These two terms are added to **prevent overfitting** by penalizing large coefficients, which helps to simplify the model.

##### A. $L_1$ Penalty (Lasso)

| Term | Name | Effect |
| :--- | :--- | :--- |
| $$r\alpha\sum_{i=1}^{n}\theta_i$$ | **$$L_1$$ Regularization** | The sum of the absolute values of the coefficients. It forces some coefficients ($$\theta_i$$) to be **exactly zero**, which is highly effective for **automatic feature selection** by dropping irrelevant features. |

##### B. $L_2$ Penalty (Ridge)

| Term | Name | Effect |
| :--- | :--- | :--- |
| $$\frac{1-r}{2}\alpha\sum_{i=1}^{n} \theta_i^2$$ | **$L_2$ Regularization** | The sum of the squared values of the coefficients. It **shrinks the magnitude** of the coefficients toward zero (but rarely to exactly zero), which stabilizes the model and reduces its sensitivity to noise. |

---

#### 3. Hyperparameters (The Control Knobs)

These values are set **before** the training process to control the learning.

| Term | Name | Role |
| :--- | :--- | :--- |
| $$\alpha$$ | **Regularization Strength** | The **overall penalty strength**. A larger $\alpha$ means a stronger penalty, leading to smaller coefficients and a simpler model. If $$\alpha=0$$, there is no regularization. |
| $$r$$ | **$L_1$ Ratio** | The **mixing parameter** that balances the $L_1$ and $L_2$ penalties ($0 \le r \le 1$). |

#### The Role of $r$:

* If **$r = 1$**: The model is pure **Lasso Regression** (only $L_1$ penalty is active).
* If **$r = 0$**: The model is pure **Ridge Regression** (only $L_2$ penalty is active).
* If **$0 < r < 1$**: The model is **Elastic Net**, combining the benefits of both Lasso and Ridge.

Elastic Net is often preferred over Lasso when there are many highly correlated features, as Lasso tends to randomly select only one of them, while Elastic Net will keep them all but shrink their weights, providing a more stable and robust model.

### Effect of hyperparameter

The two hyperparameters, **$\alpha$** and **$r$**, are the "control knobs" for the Elastic Net model. They directly determine the **overall strength** and the **type** of regularization applied.

---

#### 1. Effect of $\alpha$ (Overall Regularization Strength)

$\boldsymbol{\alpha}$ controls the total magnitude of the penalty applied to the coefficients. It manages the **bias-variance tradeoff**.

| $\alpha$ Value | Impact on Penalty | Effect on Model | Risk |
| :---: | :--- | :--- | :--- |
| $\alpha \rightarrow 0$ | Penalty is negligible. | Behaves like standard **Linear Regression**. Model complexity is high. | High **Overfitting** |
| $\alpha \rightarrow \infty$ | Penalty dominates the cost function. | Coefficients are aggressively shrunk towards zero. Model complexity is low. | High **Underfitting** |

---

#### 2. Effect of $r$ (Mixing Parameter or $L_1$ Ratio)

$r$ controls the blend between the $L_1$ (Lasso) and $L_2$ (Ridge) penalties, where $0 \le r \le 1$.

| $r$ Value | Model Type | Dominant Penalty Term | Primary Effect |
| :---: | :--- | :--- | :--- |
| **$r = 1$** | **Lasso Regression** | $$r\alpha\sum_{i=1}^{n} \theta_i$$ | Strong **Feature Selection** (forces coefficients to $\mathbf{0}$). |
| **$r = 0$** | **Ridge Regression** | $\frac{1-r}{2}\alpha\sum_{i=1}^{n} \theta_i^2$ | **Coefficient Shrinkage** (reduces magnitude for stability). |
| **$0 < r < 1$** | **Elastic Net** | Both terms are active. | **Blend** of both effects, offering both stability and feature selection. |

#### Differences and Similarities of Ridge, Lasso, and Elastic Net
- The strength of the penalty is controlled by α in all three.
- The mix of the penalties in Elastic Net is controlled by r. By tuning r, you ensure the model is stable (like Ridge) without sacrificing the ability to remove irrelevant features (like Lasso).

Here is a comparison of the three main regularization techniques:

| Model | Penalty Term ($\mathbf{P(\boldsymbol{\theta})}$) | Core Mechanism | Primary Benefit |
| :--- | :--- | :--- | :--- |
| **Ridge Regression** | $L_2 = \sum_{i=1}^{n} \theta_i^2$ | **Shrinks** coefficients **towards zero** but rarely to exactly zero. | **Model Stability**, especially when features are highly correlated (multicollinearity). |
| **Lasso Regression** | $L_1 = \sum_{i=1}^{n} \theta_i$ | **Forces** the coefficients of **irrelevant features to exactly zero**. | **Automatic Feature Selection** and model interpretability (sparsity). |
| **Elastic Net** | $r L_1 + \frac{1-r}{2} L_2$ | **Combines** both $L_1$ and $L_2$ penalties based on the mixing ratio $r$. | **Best of both:** Offers both **feature selection** and **enhanced stability** (outperforms Lasso when correlated features exist). |

### When to use and how to use
In practice, the best approach is often to use the Elastic Net. Because Ridge and Lasso are just special cases of Elastic Net (by setting r=0 or r=1, respectively), a robust cross-validation procedure on Elastic Net will naturally find the optimal blend, including pure Lasso or pure Ridge, if they are the best fit for your data.

## Factors to Consider When Choosing a Model

| Factor | Description | Best Choice | Rationale |
| :--- | :--- | :--- | :--- |
| **Feature Set Size ($\mathbf{p}$)** | The number of features relative to the number of samples ($\mathbf{n}$). When $\mathbf{p \gg n}$ (many features, few samples). | **Lasso or Elastic Net** | Feature selection is necessary to simplify the model and resolve the underdetermined problem. |
| **Multicollinearity** | Whether two or more features are highly correlated with each other. | **Ridge or Elastic Net** | Ridge handles correlated features well by shrinking their coefficients equally, providing stability. |
| **Sparsity/Interpretability** | Whether you need a model that only relies on a small, select group of features. | **Lasso or Elastic Net** ($\mathbf{r}$ closer to $\mathbf{1}$) | Forces irrelevant coefficients to exactly zero, making the model simpler and easier to interpret. |
| **Belief in True Model** | Do you believe all features contribute, or only a few contribute significantly? | **Ridge** (many small effects) or **Lasso** (few large effects) | Ridge for stable, dispersed coefficient values; Lasso for sparse, dominant features. |

---

## Regularization Mechanism Summary

* **Ridge Regression** ($L_2$ Penalty): Shrinks coefficients **toward zero** to stabilize the model.
* **Lasso Regression** ($L_1$ Penalty): Forces irrelevant coefficients to **exactly zero**, effectively picking only those features that strongly impact the model.
* **Elastic Net**: A mix of both ($L_1$ and $L_2$) that ensures the model is more **stable** (like Ridge) without losing the ability to **remove irrelevant features** (like Lasso), depending on the strength ($\alpha$) and ratio ($r$) of the penalty.

### When to Use and How to Use Regularization Techniques

#### 1. Ridge Regression ($L_2$ Penalty)

#### When to Use:
* You have **multicollinearity** (highly correlated features) and you want a **stable model**.
* You believe **all features are relevant** but want to reduce their overall impact and stabilize their weights.
* You are primarily focused on **prediction accuracy and stability**, not feature selection.

#### How to Use:
* Set the hyperparameter $\boldsymbol{\alpha}$ (or $\boldsymbol{\lambda}$) via **cross-validation**.
* $\boldsymbol{\alpha}$ controls the strength: choose the $\boldsymbol{\alpha}$ that minimizes the cross-validated error.

***

#### 2. Lasso Regression ($L_1$ Penalty)

#### When to Use:
* You suspect many of your features are **irrelevant noise**.
* You want to perform **automatic feature selection** and create a sparse, highly interpretable model.
* You have a dataset where the number of features ($\mathbf{p}$) is much larger than the number of observations ($\mathbf{n}$) ($\mathbf{p \gg n}$).

#### How to Use:
* Set the hyperparameter $\boldsymbol{\alpha}$ via **cross-validation**.
* $\boldsymbol{\alpha}$ controls the sparsity: a higher $\boldsymbol{\alpha}$ will set more coefficients to **zero**.

***

#### 3. Elastic Net Regularization ($L_1 + L_2$ Penalty)

#### When to Use:
* This is the **default recommendation** when you have no strong prior knowledge.
* You have many **correlated features** (like Ridge), *but* you also need **feature selection** (like Lasso).
* You encounter the issue where Lasso selects only one feature from a group of correlated features and you want to keep them all with reduced magnitude.

#### How to Use:
* Requires tuning **two** hyperparameters using **cross-validation** on a 2D grid:
    * $\boldsymbol{\alpha}$: The overall regularization strength (how strong the penalty is).
    * $\mathbf{r}$ ($L_1$ ratio): The mix between $L_1$ and $L_2$ (how much feature selection vs. stability).

***

### Understanding $p \gg n$ (Features $\gg$ Observations)

The statement "**You have a dataset where the number of features ($\mathbf{p}$) is much larger than the number of observations ($\mathbf{n}$)**" is crucial in statistical modeling. It describes a situation known as a **high-dimensional problem** or when a model is **underdetermined**.

#### Explanation

* $\mathbf{p}$ (Features/Variables): The number of independent variables (columns) in your dataset.
    * *Example:* Predicting house price ($\mathbf{y}$) using $5,000$ characteristics ($\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_{5000}$), so $\mathbf{p}=5000$.
* $\mathbf{n}$ (Observations/Samples): The number of data points (rows) you have.
    * *Example:* Data for only $50$ houses, so $\mathbf{n}=50$.
* When $\mathbf{p \gg n}$ (e.g., $p=5000, n=50$), you have **far more potential explanatory variables than data points** to reliably estimate their effects.

#### The Problem

This scenario leads to several major issues in standard linear regression (Ordinary Least Squares, or OLS):

* **Non-Unique Solution:** In OLS, the exact solution for the coefficients is mathematically impossible because the system is **underdetermined**. There are **infinitely many coefficient vectors** ($\boldsymbol{\theta}$) that could perfectly fit the limited training data.
* **Overfitting:** With many more features than samples, the model tends to learn the noise of the small dataset **perfectly**, leading to very high variance and extremely poor performance on unseen data.
* **Multicollinearity:** It is highly likely that many features will be highly correlated, which **destabilizes the coefficient estimates** (they can swing wildly).

#### The Solution (Regularization)

Regularization methods are specifically designed to handle $\mathbf{p \gg n}$:

* **Lasso ($L_1$)**: Is excellent here because it performs **feature selection**, immediately forcing the coefficients of many superfluous features to **exactly zero**. This reduces the effective number of features, $\mathbf{p}$, down to a manageable size.
* **Elastic Net**: Is often the better choice because it combines the **feature selection** of Lasso with the **stability** of Ridge, which helps manage the high likelihood of correlated features in this high-dimensional setting.

#### When to use - revisited 

Ridge Regression (L2):

When to Use: You are correct. Use Ridge when you have multicollinearity (correlated features) and your domain knowledge suggests all features are relevant. The goal is to shrink the magnitude of all coefficients toward zero to stabilize the weights and improve model accuracy, without explicitly removing any features (no "cherry picking").

Lasso Regression (L1):

When to Use: You are correct. Use Lasso when your core belief is that the final model should be sparse because many features are irrelevant. This is highly beneficial in high-dimensional data where the number of features (p) is much greater than the number of observations (n) (e.g., p=500 columns, n=50 rows). The key mechanism is feature selection by forcing coefficients to exactly zero.

Elastic Net:

When to Use: You are correct. Elastic Net is the default compromise. It ensures irrelevant features are removed (Lasso benefit) while simultaneously managing multicollinearity to improve stability (Ridge benefit). It solves the specific Lasso drawback where, if you have a group of highly correlated features, Lasso often arbitrarily picks one and drops the rest. Elastic Net will tend to keep the entire group of correlated features while shrinking their weights.

Concept of Model Evaluation and Use of Metrics like RMSE and MAE

Learn how to calculate and interpret key model evaluation metrics like Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE)

RMSE And MAE

RMSE (Root Mean Square Error) and MAE (Mean Absolute Error) are both model evaluation metrics that measure the average magnitude of errors between predicted and actual values, but RMSE penalizes larger errors more heavily by squaring them, making it sensitive to outliers, while MAE treats all errors equally by taking their absolute values, resulting in a more robust measure. RMSE's units are the same as the data, while MAE is also in the original units. 

Mean Absolute Error or MAE 
- MAE is the average of the absolute differences between the predicted and actual values.
- Robust to Outliers: MAE is less sensitive to extreme values (outliers) because it takes the absolute value of errors, rather than squaring them.
- Equal Weighting: It assigns the same weight to all errors, meaning an error of 10 has the same impact as an error of 40.
- Interpretation: Provides a straightforward average error in the original units of the data
- Penalty term: Absolute value, treats all errors equally
$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_{\text{pred}, i} - y_{\text{true}, i}|$$


Root Mean Square Error
- RMSE is the square root of the average of the squared errors. 
- Sensitive to Outliers: Because it squares the errors, larger errors contribute disproportionately to the RMSE, making it a good measure when large errors are particularly undesirable as it is highly sensitive to outliers
- Penalizes Larger Errors: Gives more weight to larger errors, meaning being off by 10 is significantly worse than being off by 5.
- Interpretation: Also reports the error in the original units of the data. 
- Penalty term: Squared value, Penalizes large errors disproportionately (magnifies the impact of outliers).
$$\text{RMSE} = \sqrt{\frac{\sum_{i=1}^{n} (y_{\text{pred}, i} - y_{\text{true}, i})^2}{n}}$$

When to use: 
1. Choose MAE when:
- You want a metric that is less affected by outliers and gives equal weight to all errors. 
- When the data does not have many outliers, or you want to give equal weight across all errors as ot is less sensitive to outliers and provides a straightforward average error.
- Used if the data does not have many outliers or you feel equal weighting is appropriate.
2. Choose RMSE when:
- You want to penalize larger errors more severely, which is useful when outliers are a significant concern and large deviations are particularly problematic for your application. 
- When you expect larger errors or have a lot of outliers/larger deviations, and you want to penalize those errors more severely. It penalizes larger errors more severely by squaring them, making it a good measure when large deviations are particularly problematic.
- Used when you expect larger errors and need to account for larger deviations.



