# Bias Variance Trade off and Regularization Techniques: Ridge, LASSO, and Elastic Net

## Bias Variance Trade off

Learning Goals
- Model complexity vs. error
- Bias and variance of a model
- Sources of model error
- The bias-variance tradeoff

3 Sources of Model Error
- Bias: being wrong
  - Tendency of predictions to miss true values.
    - Worsened by missing information, overly-simplistic assumptions.
    - Miss real patterns (underfit).
- Variance: being unstable
  - Tendency of predictions to fluctuate.
    - Characterized by sensitivity or output to small changes in input data.
    - Often due to overly complex or poorly-fit models.
- Irreducible Error: unavoidable randomness
  - Tendency to intrinsic uncertainty/randomness.
    - Present in even the best possible model.


Summary of bias-variance tradeoff:
- Model adjustments that decrease bias
often increase variance, and vice versa.
- The bias-variance tradeoff is analogous
to a complexity tradeoff.
- Finding the best model means choosing
the right level of complexity.
- Want a model elaborate enough to not
underfit, but not so exceedingly
elaborate that it overfits.

Discussion:
- The higher the degree of a polynomial regression,
the more complex the model (lower bias, higher variance).
- At lower degrees, we can see visual signs of bias:
predictions are too rigid to capture the curve pattern in the data.
- At higher degrees, we see visual signs of variance:
predictions fluctuate wildly because of the model's sensitivity.
- The goal is to find the right degree, such that the model
has sufficient complexity to describe the data without overfitting.

## Regularization and Model Selection

Topics:

- Model complexity and error

- Regularization as an approach to over-fitting

- Standard approaches to regularization including Ridge, Lasso, and Elastic Net

- Recursive feature elimination

### Tuning the Model

Can we tune with more granularity than choosing polynomial degrees?

Yes, by using regularization.

#### What does Regularization Accomplish?


Adjusted cost function: $ M(w) + \lambda R(w) $

M(w) : model error

R(w): function of estimated parameter(s)

$\lambda$ : regularization strength parameter

Regularization adds an (adjustable)
regularization strength parameter directly
into the cost function.

This $\lambda$ (lambda) adds a penalty proportional
to the size of the estimated model
parameter, or a function of the parameter.

Increasing the cost function controls the
amount of the penalty.

But the takeaway is that the larger this lambda is, the more we penalize stronger parameters. And again, the more we penalize our model for being stronger and having stronger parameters, the less complex that model will be able to be as we try to minimize this function, right? That'll make it so that we are trying to minimize the strength of all of our parameters while minimizing our original cost function.

The regularization strength parameter $\lambda$
(lambda) allows us to manage
the
complexity tradeoff:
- more regularization introduces
a simpler model or more bias.
- less regularization makes the model
more complex and increases variance.

If our model is overfit (variance too high),
regularization can improve the
generalization error and reduce variance.

### Regularization and Feature Selection

Regularization performs feature selection by shrinking the contribution of features.

For L1-regularization, this is accomplished by driving some coefficients to zero.

Feature selection can also be performed by removing features.

### Why is Feature Selection Important?

Reducing the number of features can prevent overfitting.

For some models, fewer features can improve fitting time and/or results.

Identifying most critical features can improve model interpretability.

## Ridge Regression

![Reg Cost Function: Ridge Regression](./images/08_RegCostFunctionRidgeRegression.png "Reg Cost Function: Ridge Regression")

### Standard scaling 

![Scale matter](./images/09_Scale_matter.png "Scale matter")



### Ridge Regression:

the complexity penalty is applied proportionally to squared coefficient values.

- The penalty term has the effect of "shrinking" coefficients toward 0.
- This imposes bias on the model, but also reduces variance.
- We can select the best regularization strength lambda via cross-validation.
- It's best practice to scale features (i.e. using StandardScaler)
so penalties aren't impacted by variable scale.



$$ J(\beta_0,\beta_1)=\dfrac{1}{2m}\sum_{i=1}^m((\beta_0+\beta_1 x_{obs}^i)-y_{obs}^i)^2 + \lambda\sum_{j=1}^k\beta_j^2$$

Penalty shrinks magnitude of all coefficients

Larger coefficients strongly penalized because of the squaring

![Ridge Regression in Actior](./images/10_RidgeRegressionInAction.png "Ridge Regression in Actior")

![Complexity Tradeoff](./images/11_ComplexityTradeoff.png "Complexity Tradeoff")

## Lasso Regression

![Lasso Regression](./images/12_LassoRegression.png "Lasso Regression")

With Ridge or L_2, we use the coefficient squared and with LASSO we'll be using the absolute value of each one of these coefficients.

In LASSO regression: the complexity penalty ^ (lambda) is proportional to
the absolute value of coefficients.
- LASSO: Least Absolute Shrinkage and Selection Operator.
- Similar effect to Ridge in terms of complexity tradeoff:
increasing lambda raises bias but lowers variance.
- LASSO is more likely than Ridge to perform feature selection,
in that for a fixed A, LASSO is more likely to result in coefficients being set to zero.

$$ J(\beta_0,\beta_1)=\dfrac{1}{2m}\sum_{i=1}^m((\beta_0+\beta_1 x_{obs}^i)-y_{obs}^i)^2 + \lambda\sum_{j=1}^k|\beta_j|$$

Penalty selectively shrinks some coefficients.

Can be used for feature selection.

Slower to converge than Ridge regression.

## Between Ridge and Lasso : Elastic Net 

![Elastic Net ](./images/13_ElasticNet.png "Elastic Net ")

$$ J(\beta_0,\beta_1)=\dfrac{1}{2m}\sum_{i=1}^m((\beta_0+\beta_1 x_{obs}^i)-y_{obs}^i)^2 + \lambda\sum_{j=1}^k|\beta_j| + \lambda_2\sum_{j=1}^k\beta_j^2$$


Elastic Net combines penalties from both Ridge and LASSO regression.

It requires tuning of an additional parameter that determines emphasis
of L1 VS. L2 regularization penalties.

The differences between L1 and L2 regularization:

L1 regularization penalizes the sum of absolute values of the weights, whereas L2 regularization penalizes the sum of squares of the weights. The L1 regularization solution is sparse. The L2 regularization solution is non-sparse.




### Recursive Feature Elimination

Recursive Feature Elimination (RFE) is an approach that combines:
- A model or estimation approach
- A desired number of features
RFE then repeatedly applies the model, measures feature importance,
and recursively removes less important features.

### Recursive Feature Elimination: The Syntax
Import the class containing the feature selection method
```python 
from sklearn.feature_selection import RFE
```
Create an instance of the class
```python
rfeMod = RFE(est, n_features_to_select=5)
```
Fit the instance on the data and then predict the expected value
```python
rfeMod = rfeMod.fit(X_train, y_train)
y_predict = rfeMod.predict(X_test)
```
The RFECV class will perform feature elimination using cross validation.