# Lebesque $p$-Norms

The $L$ in $L_p$ norms stands for "Lebesgue," named after Henri Lebesgue, a French mathematician known for his work on integration and measure theory. $L_p$ norms are a family of norms in a vector space, where "$p$" is a parameter that determines the type of norm. The most common ones are:
- **L1 norm**: Also known as the "taxicab" or "Manhattan" norm, it sums the absolute values of the components.
- **L2 norm**: Also known as the Euclidean norm, it calculates the square root of the sum of the squares of the components.

The $L_p$ norm generalizes these, defined as the $p$-th root of the sum of the absolute values of the components raised to the power of $p$. So, you can think of as a formula where you raise each component to the power of $p$, sum them up, and then take the $p$-th root of the total. When $p$ equals 1 or 2, you get the $L_1$ and $L_2$ norms, respectively.

Lebesque $p$-Norms exist for any positive integer $n$. However, as $n$ approaches infinity, the norm converges to what's called the L-infinity norm, which is the maximum absolute value among the components.

# Effect of Lasso and Ridge Regression on Parameters

Ridge regression and Lasso regression are types of linear regression that include regularization to prevent overfitting, especially when dealing with multicollinearity or when the number of predictors exceeds the number of observations. Here's how they work:

## **Ridge Regression**
- **Also known as**: Tikhonov regularization or L2 regularization.
- **Penalty term**: Adds the squared magnitude of the coefficients (multiplied by a regularization parameter lambda) to the loss function.
- **Effect on coefficients**: Shrinks the coefficients towards zero but never exactly to zero. This means all variables remain in the model, but their impact is reduced.
- **When to use**: Useful when you have many small/medium-sized effects that you don't want to exclude.

## **Lasso Regression**
- **Also known as**: Least Absolute Shrinkage and Selection Operator or L1 regularization.
- **Penalty term**: Adds the absolute value of the coefficients (multiplied by lambda) to the loss function.
- **Effect on coefficients**: Can shrink some coefficients to exactly zero, effectively performing **variable selection**.
- **When to use**: Helpful when you want a simpler, more interpretable model, as it tends to select a subset of the provided predictors.



# How to calculate/estimate a value for Lambda

Choosing lambda in ridge and lasso regression is crucial for balancing the trade-off between bias and variance. There are a few approaches to finding an appropriate lambda:

  * Cross-Validation
    - **K-fold Cross-Validation**: The dataset is divided into $k$ subsets. The model is trained on $k-1$ of these and validated on the remaining subset. This process is repeated $k$ times, and the average performance is used to select lambda.
    - **Leave-One-Out Cross-Validation**: A special case of $k$-fold where $k$ equals the number of data points. It's more computationally expensive but can be effective for smaller datasets.
  * Information Criteria
    - **Akaike Information Criterion (AIC)** and **Bayesian Information Criterion (BIC)**: These criteria balance model fit with model complexity. Lower values indicate a better model.
  * Algorithms
    - **Grid Search**: You define a range of lambda values, and the algorithm evaluates the model's performance at each point. The lambda that provides the best performance according to a chosen metric is selected.
    - **Random Search**: Similar to grid search but samples lambda values randomly from a specified range. It's often more efficient for high-dimensional spaces.
