# Introduction:

In this week, I am going to learn how to select the best model for a given dataset from all perspectives using regularisation techniques. After that, I will move on to feature engineering, and then I will learn about cross-validation theory.


## Model Selection

* In this part, I learned what model selection actually means — it's not just picking the most accurate model, but the one that performs well on new data. The main idea is to find a model that balances between underfitting and overfitting.
* Too simple models don't learn the patterns (underfitting), and too complex ones start memorizing the data (overfitting). So, model selection helps to figure out what level of complexity is just right.
* We also looked at different ways to select models, like subset selection, shrinkage methods (Ridge, Lasso), and using cross-validation for reliable evaluation.

The goal is to build a model that’s accurate, but also generalises well.



## Prediction Accuracy & Model Interpretability

### Prediction Accuracy

- Least Squares works well when:
  - Relationship is linear
  - \( n >> p \) (more observations than features) low bias,low varience good for test data also.
- Issues arise when:
  - \( n ~~ p \): High variance → overfitting,poor prediction on test data.
  - \( p > n \): Infinite solutions, perfect fit on training data but poor on test data
- Solution:
  - Apply shrinkage (e.g., Ridge/Lasso)
  - Reduces variance significantly
  - Slight increase in bias, but much better test prediction

### Model Interpretability

- Many features may not be related to the response. so remove them from model
- Including them adds **unnecessary complexity**.
- Least Squares doesn’t give zero coefficients → doesn't eliminate useless variables.
- Use feature selection methods (like Lasso or subset selection) to:
  - Automatically drop irrelevant features
  - Improve interpretability


## Methods of Model Selection :

We have 3 main alternatives when least squares is not working well:

### 1. Subset Selection
* Choose only relevant features, drop useless ones
* Then apply least squares on that subset

**Pros:**
* Simple model
* Easy to explain and interpret

**Cons:**
* Finding best subset is slow and computationally expensive
* May miss best combination of features

### 2. Shrinkage (Regularization)

* Use all features, but shrink the coefficients
* Methods include:
  * Ridge: shrinks all coefficients
  * Lasso: shrinks and sets some coefficients to zero

**Pros:**
* Prevents overfitting
* Can perform variable selection (Lasso)
* Works well when number of features is large

**Cons:**
* Introduces bias
* Interpretation of model is less straightforward

### 3. Dimension Reduction

* Combine original variables into fewer components (e.g., using PCA)
* Use these new variables in regression

**Pros:**
* Reduces model complexity
* Handles multicollinearity among features

**Cons:**
* Hard to interpret new variables
* Original meaning of features is lost



# Subset Selection :
There are some parts of subset selection methods.
## 1.Best - Subset Selection :
### What is it?
Best Subset Selection is a method where we try every possible combination of predictors (features)  
and select the one that gives the best performance.  
It is a complete search of all 2^p possible models, where p is the number of predictors.

### Steps:
**Step 1:** Start with the null model (no predictors).  
It only predicts the mean of the response variable.

**Step 2:** For k = 1 to p:  
- Try all models that use exactly k predictors  
- From these, choose the model with the lowest RSS or highest R²  
- This gives the best model Mk for each subset size

**Step 3:** Choose the final best model from M0 to Mp using:  
- Validation error  
- Cp (AIC)  
- BIC  
- Adjusted R²  
- Cross-validation  

These help to avoid overfitting and focus on test accuracy, not just training accuracy.

### Example
If we have 4 predictors: A, B, C, D  
Then all combinations are tried: A, B, C, D, AB, AC, AD, BC, BD, CD, ABC, ABD, etc.  
Total = 2^4 = 16 models

### Important Terms
- Null model: a model with no predictors, only predicts the mean
- RSS: residual sum of squares (lower is better)
- R²: tells how well the model explains the output (higher is better)
- Cp, AIC, BIC, Adjusted R²: used to select final model based on test error
- Cross-validation: tests how well the model performs on new data

### Pros
- Tries all combinations of predictors  
- Finds best model based on training performance  
- Easy to understand

### Cons
- Very slow when number of predictors is large  
- Total models = 2^p  
  For p = 10 → 1024 models  
  For p = 20 → over 1 million models  
- Can overfit if final model is selected only based on training RSS or R²

### When to use
- Use when number of predictors is small (like p ≤ 10)  
- Not suitable for large p  
- For large p, use stepwise or shrinkage methods

## 2. Stepwise Selection :
#### Why Stepwise Selection?
* When the number of predictors (p) becomes large, Best Subset Selection becomes:
    * Too slow (computational problem)
    * More likely to overfit (statistical problem)
* This is because:
    * Best Subset Selection tries 2^p models
    * Large search space = higher chance of fitting patterns that do not generalize to test data
* So we use **Stepwise Selection** to solve these issues:
    * It explores fewer models
    * It is faster and more practical
    * It reduces overfitting risk
* There are three main types:
    1. **Forward Stepwise Selection**
    2. **Backward Stepwise Selection**
    3. **Hybrid Stepwise Selection**

## a. Forward Stepwise Selection

### Definition:
A stepwise method that starts from a model with **no predictors**,  
and adds one predictor at a time, selecting the one that improves the model the most.

### Algorithm Steps :
1. Start with the null model (no predictors)  
2. For **k = 0** to **p - 1**:
   - Check all **p - k** predictors not yet used
   - Add each one to current model, fit new models
   - Select the one with **lowest RSS** or **highest R²**
   - Add it to form new model **Mk+1**
3. Select the final model using:
   - Validation error  
   - AIC (Akaike Information Criterion)  
   - BIC (Bayesian Information Criterion)  
   - Adjusted R²  
   - Cross-validation

### Example:
If you have predictors A, B, C, D:
- Start with none  
- Try A, B, C, D → pick best  
- Next round: try AB, AC, AD, BC, etc.  
- Repeat till no further improvement

### Time Advantage:
- Best Subset: 2^p models  
- Forward Stepwise: only **1 + p(p + 1)/2** models  
  For p = 20 → only 211 models (vs over 1 million)

### Important Terms:
- Null Model: A model with no predictors; only predicts the mean  
- RSS (Residual Sum of Squares): Measures training error; lower is better  
- R²: Measures how much variance the model explains; higher is better  
- Adjusted R² / AIC / BIC: Used to evaluate and compare models  
- Cross-validation: Used to estimate test error

### Pros:
- Much faster than best subset  
- Works when **p > n**  
- Easy to understand and apply

### Cons:
- Greedy method — once a predictor is added, it can't be removed  
- May miss best overall best model

## b. Backward Stepwise Selection

### Definition:

A stepwise method that starts with a **full model** (all predictors),  
and removes one predictor at a time, selecting the least useful one to remove.

### Algorithm Steps: 
1. Start with the full model (all predictors included)  
2. For **k = p** down to **1**:
   - Consider all **k** models that remove one predictor
   - Choose the one with **lowest RSS** or **highest R²**
   - This becomes model **Mk-1**
3. Choose the final model using:
   - AIC, BIC, Adjusted R², or Cross-validation

### Important Requirement:
- Backward Stepwise needs the **full model to be fit** using least squares  
- So it requires **n > p** 
- Cannot be used when **p > n**

### Pros:
- Also faster than Best Subset Selection  
- Good for datasets where **n > p**

### Cons:
- Cannot work when **p ≥ n**  
- Greedy like forward method — once removed, a predictor can't be added back  
- May miss the best model

## c. Hybrid Stepwise Selection

### Definition:

A method that combines forward and backward steps:  
- Start with no predictors  
- Add predictors one by one (like forward)  
- After each addition, check if any previous predictors can now be removed (means if add new predictor and it gives the low accuracy)
  - If yes, remove them

### Goal:
To get closer to Best Subset Selection while keeping speed advantages  
of Stepwise methods.

## Comparison Table

| Feature                     | Best Subset     | Forward Stepwise | Backward Stepwise | Hybrid Stepwise |
|-----------------------------|-----------------|------------------|-------------------|-----------------|
| Search Type                | All combinations| Adds one at a time| Removes one at a time| Adds + Removes |
| Speed                      | Very slow       | Fast             | Fast              | Fast            |
| Works with p > n?          | No              | Yes              | No                | Yes             |
| Can remove added predictors?| Yes             | No               | Yes               | Yes             |
| Model Quality              | Best (on training) | Good           | Good              | Better          |

## Summary:
- Best Subset is most complete, but too slow for large **p**  
- Forward Stepwise is fast, works for **p > n**, but greedy may miss the global optimal model 
- Backward Stepwise is fast but only works when **n > p**  
- Hybrid gives flexibility by allowing both adding and removing predictors

Use Stepwise Selection when:
- You want a fast model selection  
- You have a large number of predictors  
- You want a balance between accuracy and efficiency

## Training Error vs Test Error in Regression Models : 
In regression, if we add more predictors in the model, then training RSS decreases and R² increases.  
That means training error always goes down as we add more variables.  
But this does not mean that test error will also reduce.

When we add too many predictors, the model starts to overfit the training data.  
It learns noise or patterns that are only present in the training set, not in the real-world test data.  
This causes the test error to increase even if training error is very low.

So we cannot select the best model just by looking at training RSS or R²,  
because the full model (with all predictors) will always have the lowest RSS and highest R² on training data.  
But that model may perform badly on test data.

### How to estimate test error ?
Subset selection methods like best subset, forward selection, and backward selection  
give us a set of different models with different numbers of predictors.  
Now we need a way to choose the best model among them based on test error, not training error.

There are two ways to do this:

### 1. Indirect estimation of test error : 
We can adjust the training error by applying a penalty for model complexity.  
This helps to avoid overfitting.

Common metrics:
- Adjusted R²  
- AIC (Akaike Information Criterion)  
- BIC (Bayesian Information Criterion)  
- Cp statistic  

These metrics reduce the score if extra predictors are added without actual improvement.

### 2. Direct estimation of test error : 
We can estimate test error directly using:
- Validation set approach  
- Cross-validation (like k-fold CV)

* In validation set, we split the data into training and test set.  
* In cross-validation, we divide the data into multiple parts and train/test the model multiple times.  
This gives a better estimate of how the model performs on unseen data.

### Conclusion
- Training error is not a good estimate of test error.  
- RSS and R² are only useful for training error.  
- Use adjusted metrics or cross-validation to estimate test error.  
- Always select the model which gives the lowest test error, not just the lowest training error.


## Common Metrics Evaluation for Selecting best model based on test error.
## Cp Statistic (Mallow's Cp) :
In regression, training RSS always goes down when we add more predictors.  
But test error may go up due to overfitting.  
So training RSS is not a good estimate of test error.

Cp statistic is used to estimate test error by adjusting the training RSS.

### Cp Formula:
Cp = (1/n) * [ RSS + 2 * d * σ²_hat ]

Where:
- RSS = training residual sum of squares  
- d = number of predictors in the model  
- σ²_hat = estimated variance of error  
- n = number of observations

σ²_hat is usually estimated from the full model (model with all predictors)

### Why we use Cp ?
- Training error underestimates the real test error  
- Cp adds a penalty for model size  
- This penalty adjusts the RSS to give better test error estimate  
- Cp helps to compare models with different numbers of predictors

### How to use Cp ?
- Calculate Cp for all models  
- The model with the **lowest Cp value** is considered the best  
- Because it is expected to have the **lowest test error**

### Important points :
- Cp balances fit and complexity  
- It is used in model selection methods like best subset selection  
- Cp works only if σ²_hat is a good estimate of real variance  
- Lower Cp means better generalization to unseen data

## AIC (Akaike Information Criterion) :
AIC is used to select the best model by adjusting the training RSS with a penalty for model complexity.

### When to use AIC ?
AIC is defined for models fitted by maximum likelihood function.  
In case of linear regression with Gaussian errors, maximum likelihood and least squares are the same.  
So AIC can be used for linear regression models.

### AIC Formula (for least squares model)
AIC = (1/n) * [ RSS + 2 * d * σ²_hat ]

Where:
- RSS = residual sum of squares (training error)  
- d = number of predictors used in the model  
- σ²_hat = estimate of variance of error  
- n = number of observations  

Constants are ignored because they don’t affect model comparison.

### AIC and Cp relation :
For least squares models, AIC and Cp are proportional to each other. 
Both give the same result in this case.

### Why we use AIC ?
- Training RSS always decreases as we add more variables  
- But test error may increase due to overfitting  
- AIC adds a penalty for the number of predictors to control overfitting  
- Helps estimate the test error more accurately

### How to use AIC ?
- Calculate AIC for all models  
- The model with the **lowest AIC value** is selected and that is best model.
- This model has the best trade-off between fit and complexity

### Key points
- AIC works when model is fitted by likelihood (like linear regression with normal errors)  
- AIC = fit + penalty  
- AIC and Cp are equal for least squares models  
- Lower AIC means better model

## BIC (Bayesian Information Criterion) :
BIC is used to select the best model by estimating test error using training RSS with a stronger penalty.

### BIC Formula (for least squares model) :
BIC = (1/n) * [ RSS + log(n) * d * σ²_hat ]

Where:
- RSS = residual sum of squares  
- d = number of predictors in the model  
- n = number of observations  
- σ²_hat = estimated variance of error  
Constants are ignored as they don't affect comparison

### Intuition of BIC :
Like Cp and AIC, BIC also balances model fit and complexity  
But BIC uses a stronger penalty because log(n) > 2 when n > 7  
So BIC prefers smaller models unless additional predictors give large improvement

### Why we use BIC ?
- Training RSS always decreases with more predictors  
- But test error may increase due to overfitting  
- BIC penalizes model size more than Cp and AIC  
- Helps to select a simpler model that generalizes better

### How to use BIC ?
- Calculate BIC for each model  
- The model with the lowest BIC value is considered the best  
- Because it has best trade-off between error and complexity

### Comparison with Cp and AIC
- Cp and AIC penalty = 2dσ²_hat  
- BIC penalty = log(n) * d * σ²_hat  
- So BIC gives larger penalty for large n and prefers simpler models

## Adjusted R² :

Adjusted R² is used to select the best model by adjusting the normal R² with a penalty for unnecessary variables.

### R² Recap :

R² = 1 − RSS / TSS  
RSS = residual sum of squares  
TSS = total sum of squares

R² always increases when we add more variables, even if they are not useful.  
So it is not reliable for model selection.

### Adjusted R² Formula : 

Adjusted R² = 1 − (RSS / (n − d − 1)) / (TSS / (n − 1))

Where:
- n = number of observations  
- d = number of predictors  
- RSS = residual sum of squares  
- TSS = total sum of squares  

### Why we use Adjusted R² ?

- Unlike R², it penalizes for adding extra variables  
- If the added variable improves the model, adjusted R² increases  
- If the added variable is noise, adjusted R² decreases  
- So it helps in selecting the correct model size

### How to use Adjusted R² ?

- Calculate adjusted R² for different models  
- The model with the highest adjusted R² is selected  
- It gives a good balance between fit and simplicity

### Key points

- R² always increases with more variables  
- Adjusted R² may increase or decrease  
- It is easy to compute and useful in practice  
- But it has less theoretical justification compared to Cp, AIC, and BIC



## Validation and Cross-Validation

These are direct methods to estimate test error. Unlike Cp, AIC, BIC, or adjusted R² which adjust training error mathematically, here we actually test the model on Validation data to get the error.

### Validation Set Approach

- Split the available data into two parts:
  - Training set
  - Validation set (also called hold-out set)

- Fit the model on training data

- Predict on validation set

- Calculate validation error (usually MSE) to estimate test error

- The model with the lowest validation error is selected

### Example

In Credit dataset, different models (2-variable to 11-variable) were tested using validation set error (MSE). The quadratic model had lower error than the linear model. Cubic model had slightly higher error than quadratic. So quadratic model was best.

### Drawbacks of validation set

1. High variability – depends on how data is split
2. Wastes data – only part of data is used for training, which weakens the model

## Cross-Validation (CV)

To solve the above issues, we use CV. It is more stable and uses more data for training.

### k-Fold Cross-Validation

- Split the data into k equal parts (folds)

- Repeat k times:
  - One fold = validation
  - k-1 folds = training
  - Calculate error

- Final test error = average of k error values

- Usually k = 5 or 10 is used

### LOOCV (Leave-One-Out Cross-Validation)

- Special case of CV where k = n (every point is used once as validation)

- Very accurate but slower for large n

- For linear models, a formula exists to compute LOOCV without fitting n models

### Benefits of CV

- Gives a direct and realistic estimate of test error

- Works with any model (linear, logistic, tree, etc.)

- Does not require assumptions like error variance or degrees of freedom

## One-Standard-Error Rule

If multiple models have similar test error, choose the **simplest model** whose error is within 1 standard error of the lowest one.

This reduces overfitting and gives a stable, interpretable model.

## Comparison with Cp, AIC, BIC, and Adjusted R²

| Method         | Type      | Uses real test data? | Needs error variance or d? | Works for all models? |
|----------------|-----------|----------------------|-----------------------------|------------------------|
| Cp, AIC, BIC   | Indirect  | No                   | Yes                         | No (linear/MLE only)   |
| Adjusted R²    | Indirect  | No                   | Yes                         | No (linear only)       |
| Validation/CV  | Direct    | Yes                  | No                          | Yes                    |



## Bias-Variance Trade-Off

The bias-variance trade-off is a fundamental concept for understanding model performance. 

- **Bias** is error due to simplifying assumptions made by the model. High bias means the model is too simple and underfits the data.  
- **Variance** is error due to model sensitivity to small fluctuations in the training data. High variance means the model overfits the data.

A model with low bias and low variance is ideal but usually hard to achieve. Increasing model complexity reduces bias but increases variance. Conversely, simplifying the model increases bias but reduces variance.

Regularization techniques such as Ridge, Lasso, and Elastic Net introduce bias by constraining model parameters, but this reduces variance, helping improve generalization. Understanding this trade-off explains why these shrinkage methods are effective.


# Shrinkage Methods :

Shrinkage is used when we want to make our model more stable and less sensitive to noise or correlation in the predictors. Instead of dropping variables like subset selection, shrinkage keeps all predictors but shrinks their coefficients.

### Why use shrinkage?

- Subset selection chooses only a few predictors, but can be unstable
- Small changes in data can give different selected subsets
- If predictors are correlated or p is large, model may overfit
- Shrinkage reduces model variance and helps avoid overfitting

### What is shrinkage?

Shrinkage means:
- Fit model using all p predictors
- But apply a penalty so that coefficient values are pulled closer to 0
- This makes the model more stable and generalizes better to test data

### Which techniques use shrinkage?

There are two main methods:

1. **Ridge Regression**
   - Shrinks all coefficients but none become exactly zero
   - Useful when all predictors are somewhat useful
   - Good for multicollinearity

2. **Lasso**
   - Shrinks some coefficients to exactly zero
   - So it also performs variable selection
   - Useful when we want a simpler model with fewer predictors

## 1. Ridge Regression

Ridge is a shrinkage method where we fit all predictors but add a penalty on the size of the coefficients. It helps in reducing model variance and controlling overfitting, especially when predictors are correlated or $p$ is large.

### Why use Ridge?

- Least squares works well only when $n \gg p$ and predictors are not highly correlated
- If predictors are correlated or model is too flexible, variance becomes high
- Ridge controls this by shrinking coefficients closer to zero
- It keeps all variables in the model but makes their influence smaller

### Ridge Regression Formula

Ridge minimizes this function:

$$
\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2
$$

- First term = usual least squares (fit the data)
- Second term = shrinkage penalty (regularization)
- $\lambda$ is a tuning parameter $\geq 0$

### What lambda ($\lambda$) does

- $\lambda = 0$ → Ridge = OLS (no penalty)
- $\lambda > 0$ → coefficients are pushed closer to 0
- $\lambda \rightarrow \infty$ → all coefficients go to 0
- Larger $\lambda$ means stronger penalty

We choose the best $\lambda$ using cross-validation.

### What is the penalty term?

Penalty term:  
$$
\lambda (\beta_1^2 + \beta_2^2 + \dots + \beta_p^2)
$$

- This term discourages large values of coefficients
- It is added to the RSS, so the optimizer prefers smaller $\beta$ values
- Helps in reducing overfitting and model complexity

### Why is $\lambda$ a penalty even though it's positive?

Because we are minimizing:

$$
\text{RSS} + \lambda \sum_{j=1}^{p} \beta_j^2
$$

- Adding $\lambda \sum \beta^2$ increases total loss
- So the model avoids large coefficients
- That’s why it acts as a penalty even though $\lambda$ is positive

### What is L2 regularization?

- The term "L2" comes from the L2 norm of the coefficient vector:

$$
||\beta||_2 = \sqrt{\beta_1^2 + \beta_2^2 + \dots + \beta_p^2}
$$

- In ridge, we use the **squared L2 norm** as the penalty
- So ridge regression is also called **L2 regularization**
- It limits the total "size" of coefficients without making any of them exactly zero

### Centering the data before Ridge

Before applying ridge, we must center the predictors:

$$
x_j^{\text{centered}} = x_j - \bar{x}_j
$$

- Intercept $\beta_0$ is not penalized
- Centering ensures $\beta_0$ only captures the mean of $y$
- Most libraries like sklearn do this internally

### What happens if we don’t center?

- The intercept gets mixed with penalized coefficients
- Penalty behaves incorrectly
- Model can become biased or unstable

### Summary of Ridge Regression

- Adds penalty on large coefficients
- Keeps all predictors but shrinks their influence
- Doesn’t set any coefficient to zero
- $\lambda$ controls how strong the shrinkage is
- Choose $\lambda$ using cross-validation
- Always center the data before applying ridge
- Ridge is also called **L2 Regularization**, because it adds the squared L2 norm of coefficients as a penalty

## Ridge Regression – Application to the Credit Data

This example shows how ridge regression behaves on the Credit dataset with 10 predictors.

### What happens to coefficients as λ increases?

We fit ridge regression with different λ values and observe:

- Each predictor’s coefficient is plotted as a function of λ
- When λ = 0 → Ridge = OLS → coefficients are same as least squares
- As λ increases → coefficients shrink towards 0
- When λ is very large → all coefficients become almost 0 → this is like a null model with no predictors

Some variables like **income**, **limit**, **rating**, and **student** have the largest initial coefficients. These are shown in color in the plot.

While most coefficients shrink smoothly, some (like **rating**) may increase slightly at certain λ values. But overall, the **total size of all coefficients decreases** as λ increases.

### Plot with L2 Norm Ratio

Instead of showing λ on the x-axis, we can also plot:

$$
\frac{||\hat{\beta}_\lambda||_2}{||\hat{\beta}_{\text{OLS}}||_2}
$$

- This ratio tells us **how much the ridge coefficients have shrunk compared to OLS**
- At λ = 0 → ratio = 1 → no shrinkage
- At large λ → ratio → 0 → all coefficients are almost zero

This helps us understand the **overall shrinkage effect** in a single number.

### Ridge is not scale-invariant

- In OLS: If we multiply a variable (e.g. income) by 1000, the coefficient gets divided by 1000 → no effect on model
- In Ridge: Because we penalize squares of coefficients, large-scale variables get **penalized more**

This means ridge regression is **sensitive to how each variable is scaled**.

### Solution: Standardize the predictors

To fix this, we **standardize** each predictor before fitting ridge:

$$
\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{\sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_{ij} - \bar{x}_j)^2}}
$$

After standardizing:

- Each predictor has mean = 0 and std deviation = 1
- All variables are on the same scale
- Ridge now treats all predictors equally

The coefficients plotted in this example are based on **standardized predictors**.

### Summary of Ridge Regression on Credit Data

- As λ increases, coefficients shrink towards 0
- L2 norm ratio shows total shrinkage amount
- Ridge is **not scale-invariant**
- Always **standardize** predictors before applying ridge
- This example shows how ridge controls model complexity and handles multicollinearity

## Why Ridge Regression is Better than OLS

- Ridge regression improves over OLS by balancing bias and variance.
- OLS has low bias but very high variance, especially when number of predictors (p) is close to or larger than number of observations (n).
- Ridge adds a penalty ($\lambda$) to shrink coefficients and reduce model flexibility.
- As lambda increases:
  - Coefficients get smaller
  - Variance decreases
  - Bias increases slightly
- Ridge works well because reducing variance helps more than the small increase in bias.
- In some cases, OLS can perform as bad as a null model (when $\lambda$ is very high), but ridge with a good $\lambda$ gives much lower test error.
- OLS fails completely when p > n (no unique solution), but ridge still works fine by shrinking coefficients.
- Ridge is also much faster to compute than subset selection, which tries all possible models.
- Overall, ridge gives more stable, generalizable, and efficient models than OLS when predictors are many or correlated.

## 2. Lasso Regression

Lasso stands for **Least Absolute Shrinkage and Selection Operator**.  
It is a regularization method like Ridge, but it solves one main problem that Ridge cannot. Ridge regression shrinks coefficients but never sets any of them to zero, so it always keeps all variables in the model. Lasso, on the other hand, can shrink some coefficients exactly to zero. This means it does both shrinkage and variable selection.

### Why use Lasso?

- Ridge always keeps all variables, even if they are not important
- Lasso gives simpler and more interpretable models by removing unimportant predictors
- It is useful when we want to know which features actually matter

### Lasso Loss Function

Lasso minimizes this function:

$$
\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|
$$

- First term is the usual least squares error (RSS)
- Second term is the L1 penalty (sum of absolute values of coefficients)
- λ is the tuning parameter

### L1 vs L2

- L1 norm = $|\beta_1| + |\beta_2| + ... + |\beta_p|$
- L2 norm = $\sqrt{\beta_1^2 + \beta_2^2 + ... + \beta_p^2}$

L1 norm has sharp corners in its geometry. So when the model tries to minimize the loss, it naturally sets some coefficients to exactly zero. This is why Lasso performs variable selection.

### What does Lasso do?

- Shrinks coefficients like Ridge
- But also sets some coefficients exactly to 0
- Gives sparse models (only a few variables used)
- More interpretable when p is large

### Example from Credit Dataset

- When λ = 0 → Lasso = OLS → all variables used
- As λ increases → Lasso removes unimportant variables
- In the Credit data:
  - **rating** enters first
  - Then **student**, **limit**, and **income**
  - Other variables are only included if λ is small

So based on λ, Lasso can give models with 1, 2, 3, ... or all 10 variables. This makes it flexible and useful.

### Summary of Lasso

- Full form: **Least Absolute Shrinkage and Selection Operator**
- Adds L1 penalty to the loss
- Can shrink some coefficients to 0
- Performs both shrinkage and variable selection
- Produces simpler, more interpretable models
- Works best when some variables are not important
- Choose λ using cross-validation

## Another Form of Ridge and Lasso Regression

Ridge and Lasso can also be written in another way. Instead of adding a penalty, we can write them as constraint problems with a fixed budget.

This is just another way to write the same idea. Both forms give the same solution for the right value of λ or s.

### Lasso Constraint Form

Minimize RSS subject to:

$$
\sum_{j=1}^{p} |\beta_j| \leq s
$$

This means: Find the best model that keeps the total absolute size of coefficients within a fixed budget s.

When s is large → gives the least squares solution  
When s is small → forces some coefficients to 0

### Ridge Constraint Form

Minimize RSS subject to:

$$
\sum_{j=1}^{p} \beta_j^2 \leq s
$$

This means: Find the best model that keeps the total squared size of coefficients within the budget s.

As s gets smaller → coefficients shrink more  
Ridge will never set coefficients exactly to 0

### Geometry when p = 2

Lasso: constraint region is a diamond  
→ sharp corners → leads to sparse models (some β = 0)

Ridge: constraint region is a circle  
→ smooth boundary → all β are small but non-zero

### Connection to Best Subset Selection

Best subset selection can also be written as:

$$
\sum I(\beta_j \neq 0) \leq s
$$

This means: choose the best model using at most s predictors  
I() = 1 if the coefficient is non-zero, otherwise 0

It gives best interpretation but is computationally slow when p is large

### Comparison of All Three

Ridge uses squared penalty → shrinks coefficients  
Lasso uses absolute value penalty → shrinks and selects  
Best subset selection limits the number of predictors directly

Lasso is a good middle point. It does feature selection like subset but is fast like ridge

### Summary

Lasso and Ridge can be written with constraints instead of penalties  
Lasso: L1 constraint → forces some β to 0  
Ridge: L2 constraint → shrinks all β but none go to 0  
Subset: direct limit on number of variables  
Lasso gives sparse models with better interpretation and good computation


## Why Lasso Sets Some Coefficients to Zero but Ridge Does Not

We already saw that Ridge and Lasso both shrink coefficients using a constraint on their size. But only Lasso sets some coefficients exactly to 0. This happens because of the shape of their constraint regions.

### Understanding with Geometry

Both models try to minimize RSS (error), but they are restricted to stay within a certain region.

- The least squares solution is at the center of ellipses (error contours)
- The model stops where the first ellipse touches the constraint boundary

In the image below:

- **Left side** shows Lasso (diamond constraint)
- **Right side** shows Ridge (circular constraint)

At the point of contact:
- Lasso can hit a **corner**, which means one coefficient becomes **exactly zero**
- Ridge hits a **smooth edge**, so all coefficients are **non-zero**

This is the reason Lasso can **remove features** (feature selection), and Ridge only **shrinks them**

![Lasso vs Ridge Contours](lasso_ridge.png)

### When p > 2

- Ridge becomes a sphere or hypersphere
- Lasso becomes a polyhedron or polytope
- Lasso still has corners → gives sparse solutions
- Ridge stays smooth → no zero coefficients

### Summary

- Lasso uses L1 constraint → gives sharp corners → some β = 0
- Ridge uses L2 constraint → smooth boundary → all β ≠ 0
- Lasso does automatic variable selection, Ridge only shrinks


## Elastic Net Regression

### What is Elastic Net?

Elastic Net is a regression technique that combines both Lasso (L1) and Ridge (L2) penalties.  
It is useful when:

- Features are correlated
- Some features are irrelevant
- We want both regularization and feature selection

### Elastic Net Objective Function (2 Forms)

#### 1. Separate penalties:

$$
\text{Loss} = \text{RSS} + \lambda_1 \sum_{j=1}^{p} |\beta_j| + \lambda_2 \sum_{j=1}^{p} \beta_j^2
$$

#### 2. Combined λ with mixing parameter α:

$$
\text{Loss} = \text{RSS} + \lambda \left[ \alpha \sum_{j=1}^{p} |\beta_j| + (1 - \alpha) \sum_{j=1}^{p} \beta_j^2 \right]
$$

- $\lambda$ : total regularization strength  
- $\alpha$ in [0, 1] : controls mixing between L1 and L2  
  - $\alpha$ = 1 : Lasso  
  - $\alpha$ = 0 : Ridge  
  - 0 < $\alpha$ < 1 : Elastic Net

### What Elastic Net Does

- Shrinks coefficients like Ridge
- Sets some coefficients to zero like Lasso
- Performs both variable selection and regularization
- Works better than Lasso when predictors are correlated

### Why Use Elastic Net?

- Lasso may randomly drop one of the correlated variables
- Ridge keeps all variables but can’t remove unimportant ones
- Elastic Net handles both situations:
  - Keeps useful correlated variables
  - Removes irrelevant ones

### Elastic Net vs Ridge vs Lasso

| Method        | Penalty Type | Feature Selection | Handles Correlation |
|---------------|--------------|-------------------|---------------------|
| Ridge         | L2           | No                | Yes                 |
| Lasso         | L1           | Yes               | No (fails when corr)|
| Elastic Net   | L1 + L2      | Yes               | Yes                 |

### Real-life Example

If **area** and **number_of_rooms** are highly correlated:

- Lasso may drop one randomly
- Ridge keeps both but doesn’t remove others
- Elastic Net may keep both if needed and remove the rest

### Hyperparameters

- $\lambda$ : controls penalty size (higher = more shrinkage)
- $\alpha$: controls L1 vs L2 balance

Use **cross-validation** to choose best values for $\lambda$ and $\alpha$

### Summary

- Elastic Net = Lasso + Ridge
- Works well with correlated features
- Gives sparse and stable models
- Helps avoid overfitting and improves interpretation


# Selecting the Tuning Hyperparameter (λ) Using Cross-Validation/ Grid Search

When using **ridge regression** or **lasso**, choosing the right tuning parameter **λ** (lambda) is very important. This parameter controls how much the model coefficients are shrunk towards zero, which affects the model's accuracy and complexity.

## What is the tuning parameter λ?

- **λ controls shrinkage:**
  - A **large λ** means strong shrinkage → coefficients get closer to zero → simpler model.
  - A **small λ** means little shrinkage → model is similar to least squares → more complex.
- Choosing λ balances the trade-off between **bias and variance**.

## Why do we use cross-validation to select λ?

- We want to find the λ that results in the best prediction accuracy on new data.
- Cross-validation (CV) helps estimate the model’s prediction error on unseen data.
- By comparing CV errors for different λ values, we pick the one that minimizes the error.

## Step-by-step process to select λ with Cross-Validation:

1. **Choose a set (grid) of candidate λ values.**  
   For example, a sequence from very small (close to 0) to large values.

2. **For each λ value, fit the model on training data.**

3. **Perform cross-validation:**  
   - Split data into folds (e.g., 10-fold CV).  
   - For each fold:  
     - Fit the model on the training folds.  
     - Calculate prediction error on the validation fold.  
   - Average these errors to get the CV error for that λ.

4. **Select the λ with the smallest average CV error.**

5. **Refit the model on the entire dataset using the selected λ to get the final model.**

## What do the examples tell us?

- In ridge regression on Credit data (Figure 6.12), the optimal λ from leave-one-out CV was small, meaning only a little shrinkage was needed.  
- The CV error curve was flat near the minimum, showing many λ values work similarly well.  
- In such cases, using the least squares model (λ close to zero) may be fine.

- In the lasso example (Figure 6.13), ten-fold CV helped pick a λ that gave zero coefficients for noise variables and non-zero for true predictors (signal variables).  
- This shows lasso + CV can successfully identify important variables even with few observations and many predictors.

## Important points to remember:

- If the CV error curve is **flat around the minimum λ**, many values perform similarly.  
  You can pick the simplest model or even the least squares solution.

- **Types of CV:**  
  - Leave-One-Out CV (LOOCV): more accurate but slower.  
  - K-Fold CV (e.g., 10-fold): common and faster, good balance.

- For **small or high-dimensional datasets**, CV is especially useful to avoid overfitting.

- After selecting λ, always **refit the model on the entire dataset** using that λ.

- Cross-validation gives a good estimate but results can vary; consider combining it with domain knowledge.

## Summary Table

| Step               | What to do                                  |
|--------------------|--------------------------------------------|
| 1. Select λ grid   | Choose a range of λ values to try            |
| 2. Fit models      | Fit model on training folds for each λ       |
| 3. Compute CV error | Calculate average validation error per λ     |
| 4. Choose λ        | Pick λ with the lowest average CV error      |
| 5. Refit model     | Fit final model on all data with chosen λ    |

This method ensures that the chosen tuning parameter λ leads to the best model performance on new, unseen data by using cross-validation error as a reliable guide.


## 3. Dimension Reduction Methods

### Why Dimension Reduction?

When the number of predictors (**p**) is large:

- The model becomes complex  
- Variance increases  
- Risk of **overfitting** becomes high

Earlier methods like **Subset Selection** and **Shrinkage (Ridge/Lasso)** handled this by:

- Using a **subset** of predictors  
- Or **shrinking** the coefficients  

But both methods work **using the original variables**. Now, we look at a different approach: **transforming the variables** before fitting the model.

### What is Dimension Reduction?

Instead of using the original predictors $X_1, X_2, \ldots, X_p$, we create new predictors $Z_1, Z_2, \ldots, Z_M$ where $M < p$.

Each $Z_m$ is a **linear combination** of the original predictors:

$$
Z_m = \sum_{j=1}^{p} \phi_{jm} X_j
$$

Then, we fit a linear model:

$$
y_i = \theta_0 + \sum_{m=1}^{M} \theta_m Z_{im} + \epsilon_i
$$

This is still a linear model, but now with **M transformed variables** instead of $p$.

### Why This Helps?

We are estimating **fewer coefficients**: From $p + 1$ (in original regression) to $M + 1$ (in transformed model). This helps:

- Reduce **variance**  
- Keep only the most important components  
- Avoid overfitting

### Connection to Original Model

From the transformation:

$$
Z_m = \sum_{j=1}^{p} \phi_{jm} X_j
$$

So the model becomes:

$$
\sum_{m=1}^{M} \theta_m Z_m = \sum_{j=1}^{p} \beta_j X_j
$$

where

$$
\beta_j = \sum_{m=1}^{M} \theta_m \phi_{jm}
$$

This means: The dimension reduction model is a **special case** of linear regression where coefficients $\beta_j$ are **constrained** to take a certain form.

### What Happens When $M = p$?

If:

- We take $M = p$  
- And the $Z_m$'s are linearly independent  

Then: No reduction happens, and this model is the same as original least squares. So, **dimension reduction only occurs when $M < p$**.

### Two-Step Process

Every dimension reduction method follows 2 main steps:

1. **Create transformed predictors** $Z_1, Z_2, \ldots, Z_M$  
2. **Fit linear regression** on these new predictors  

The challenge is: How do we choose the combinations (i.e., the $\phi_{jm}$'s)?  

There are different methods for that. In this chapter, we look at:  

- **Principal Component Regression (PCR)**  
- **Partial Least Squares (PLS)**  

### Summary

- **Dimension Reduction** transforms original features into new combinations  
- Reduces problem size from $p+1$ to $M+1$  
- Helps in **high-dimensional** settings and **correlated variables**  
- Still uses **linear regression** but with transformed predictors  
- Introduces **bias** but reduces **variance**  
- Two main methods: **PCR** and **PLS**  

## a. Principal Component Regression (PCR) :
## What is PCR?
Principal Components Regression (PCR) is a two-step process:
1. First, we use Principal Component Analysis (PCA) to reduce the number of input variables.
2. Then, we use linear regression on these principal components instead of the original variables.

PCR is useful when there are many predictors that are highly correlated or when we want to avoid overfitting.

## Steps of PCR

### Step 1: Standardize the data
Standardize each variable so that all have mean zero and standard deviation one.

$$
X_j^{scaled} = \frac{X_j - \bar{X}_j}{s_j}
$$

### Step 2: Perform PCA on the predictors
PCA finds new variables $Z_1, Z_2, \dots, Z_p$ that are linear combinations of the original variables:

$$
Z_1 = \phi_{11}(X_1 - \bar{X}_1) + \phi_{21}(X_2 - \bar{X}_2) + \dots + \phi_{p1}(X_p - \bar{X}_p)
$$

These $Z$ variables are called principal components. The first component captures the most variation.

### Step 3: Select top M components
We select the top M components that capture most of the variation. The value of M is usually chosen by cross-validation.

### Step 4: Run linear regression on the selected components

$$
Y = \beta_0 + \beta_1 Z_1 + \beta_2 Z_2 + \dots + \beta_M Z_M + \varepsilon
$$

## Important Notes
- PCR reduces dimension but does not select specific features.
- Each component is a combination of all original features.
- PCR does not perform as well when the response depends on variables that have small variance.
- Always standardize the data before applying PCA.
- Choose number of components M using cross-validation.

## When to Use PCR
- When predictors are highly correlated
- When number of predictors is large compared to number of observations
- When we want to avoid overfitting by reducing model complexity

## b. Partial Least Squares (PLS)

PLS is a dimension reduction method that helps us when we have many predictors and we want to predict a response variable (Y). It works in a similar way to PCR but with one important difference.

### Key Point
PLS is **supervised**, while PCR is **unsupervised**.

### What PLS Does
- PLS creates new variables Z₁, Z₂, ..., Zₘ which are linear combinations of original predictors X₁, X₂, ..., Xₚ.
- These new variables are used in linear regression to predict Y.
- But when creating Zs, PLS uses Y to guide the process. So the new components are not just good at summarizing X but also good for predicting Y.

### How PLS Works (Step by Step)
1. Standardize all predictors X.
2. For each predictor Xⱼ, do a simple linear regression of Y on Xⱼ.
3. Use the slope (coefficient) from that regression as a weight φⱼ₁.
   - These weights are proportional to the correlation between Xⱼ and Y.
4. Use these weights to create Z₁:
   
   Z₁ = φ₁₁·X₁ + φ₂₁·X₂ + ... + φₚ₁·Xₚ

5. Z₁ gives more weight to predictors that are more related to Y.
6. Then adjust all predictors and Y by removing the effect of Z₁ (take residuals).
7. From these residuals, compute Z₂ in the same way.
8. Repeat this to get as many components as needed.

### Choosing Number of Components
- The number of components M is a tuning parameter.
- We choose M using cross-validation to get best prediction performance.

### Important Notes
- PLS focuses on both explaining X and predicting Y.
- PCR only focuses on X (ignores Y), so it may miss important relationships.
- PLS often performs better than PCR in prediction tasks.
- But, since PLS is supervised, it may also increase variance, so it's not always guaranteed to be better.

### When to Use PLS
- You have many predictors.
- Predictors are correlated.
- You want better prediction of Y.
- You want to reduce dimensionality but still keep relationship with Y.



# Feature Engineering

Feature engineering is the process of creating or modifying input features to improve model performance. It helps make the data more suitable for learning and often leads to better accuracy.

We do not change the model — we change the data that goes into the model.

### Why use Feature Engineering?

* Raw data may not be ready for modeling
* Many ML models perform better with clean, informative features
* Can help reveal hidden patterns
* Improves model performance and generalization
* Often more impactful than changing the algorithm

### What are common feature engineering techniques?

We apply transformations, create new variables, and prepare data so the model can learn better.
Some common techniques:

## 1. Transforming Features

Apply mathematical functions to variables to reduce skewness or handle nonlinearity.

* **Log transform**: useful for highly skewed data (e.g., income, price)
  **log(x + 1)**
* **Square root / square**: used when effect increases nonlinearly
  **sqrt(x)**, **x²**
* **Inverse**: when large values should have smaller effect
  **1 / x**

## 2. Creating Interaction Terms

Combine two or more features to capture interactions between them.

* Example:
  **age × income**
  **bedrooms × area**

* Helps linear models learn more complex relationships

## 3. Encoding Categorical Variables

Convert categories to numbers so models can use them.

* **One-Hot Encoding**
  Creates binary columns for each category
  Example: **color\_red**, **color\_blue**
  Used in linear models, logistic regression, etc.

* **Label Encoding**
  Assigns an integer to each category
  Example: Red = 0, Blue = 1
  Suitable for tree-based models

## 4. Binning (Discretization)

Convert continuous values into categories.

* Example:
  Age → **\[0–18]**, **\[19–60]**, **60+**
  Income → **low**, **medium**, **high**

* Reduces noise and helps detect threshold-based effects

## 5. Handling Missing Values

Fill in missing data before modeling.

* **Imputation**

  * Numerical: use mean or median
  * Categorical: use mode (most frequent)

* **Missing Indicator**
  Add a new feature:
  Example: **is\_missing = 1** if value is missing

## 6. Scaling Features

Some models (e.g., ridge, lasso, SVM) are sensitive to feature scale.

* **Standardization**

  **z = (x - mean) / std**

* **Min-Max Scaling**
  Scales values between 0 and 1

* Tree-based models (e.g., Random Forest, XGBoost) do not need scaling

## 7. Creating New Features

Use domain knowledge to create more informative variables.

Examples:

* From dates → extract **year**, **month**, **weekday**
* From text → **word count**, **length**, **keyword presence**
* From location → **distance**, **region**
* From transactions → **total spend**, **frequency**

New features can reveal patterns not obvious in raw data.

### Summary of Feature Engineering

* Helps models perform better by improving input data
* Includes transformations, encoding, scaling, and feature creation
* Should be based on data understanding and model needs
* Good features often matter more than the model itself


# Conclusion :
In this notebook, I have learned how to select the best model and apply regularization techniques such as ridge, lasso, and elastic-net to improve model performance. I gained a deeper understanding of the bias-variance trade-off and how it is influenced by different regularization methods. Additionally, I explored feature engineering and cross-validation techniques and many more things, which are essential for building robust predictive models. Overall, this work enhanced my knowledge of model selection and regularization techniques.