<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Regularization

_Authors:_ Tim Book, Matt Brems

## Learning Objectives

1. Describe what a loss function is.
2. Define regularization.
3. Describe and differentiate LASSO and Ridge regularization.
4. Understand how regularization affects the bias-variance tradeoff.
5. Implement LASSO regression and Ridge regression.

## Review

<details><summary>What is the bias-variance tradeoff?</summary>

- Mean squared error can be decomposed into a bias component plus a variance component (plus a systematic error, but we don't have control over this part, so we often ignore it).
- The bias-variance tradeoff refers to the fact that taking steps to minimize bias usually comes at the expense of an increase in variance. Similarly, taking steps to minimize variance usually comes at the expense of an increase in bias.

</details>

---

<details><summary>What evidence/information would lead me to believe that my model suffers from high variance?</summary>
    
- After splitting my data into training and testing sets, if I see that my model performs way better on my training set than my testing set, this means that my model is not generalizing very well to "new" data.
- An example might be where our training MSE is substantially lower than our testing MSE, or where our training R-squared is substantially higher than our testing R-squared.
</details>

## Why is high variance bad?

High variance is bad because it means that our model doesn't generalize well to new data. This means that our model looks as though it performs well on our training data but won't perform as well on new, unseen data.

---
<details><summary>How might we try to fix a model that suffers from high variance?</summary>

- Gather more data. (Although this is usually expensive and time-consuming.)
- Drop features.
- Make our existing features less complex. (i.e. get rid of interaction terms or higher order terms.)
- Choose a simpler model.
- Regularization!
</details>

## Pop Math Quiz

### Problem 1
**What is the value of $b$ that minimizes...**

$$ (y - b)^2 $$

<details><summary></summary>
When $b = y$, this expression has value 0. Since it's squared, it can't go below that.
</details>

### Problem 2
**What is the value of $b$ that minimizes...**

$$ (y - b)^2 + \alpha b^2 $$

where $\alpha > 0$?

<details><summary></summary>
This is more complicated, isn't it? You can use calculus and come up with an answer:
    
$$ \hat{b} = \frac{y}{1 + \alpha} $$

But what is the effect of $\alpha$ on our solution?
</details>

## Overview of regularization

---

**Regularizing** regression models is to:
- **automatically** avoid overfitting 
- **while** we fit our model
- by adding a "penalty" to our loss function.

### Before regularziation (OLS):

$$
\begin{align}
\text{minimize: MSE} &= \textstyle\frac{1}{n}\sum (y_i - \hat{y}_i)^2 \\ \\
                     &= \textstyle\frac{1}{n}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 \\ \\
                     &= \textstyle\frac{1}{n}\|\mathbf{y} - \mathbf{X\beta}\|^2
\end{align}
$$

### After regularization (Ridge):

$$
\begin{align}
\text{minimize: MSE + penalty} &= \textstyle\frac{1}{n}\sum (y_i - \hat{y}_i)^2 + \alpha \sum \beta_j^2 \\ \\
                               &= \textstyle\frac{1}{n}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 + \alpha \|\beta\|^2 \\ \\
                               &= \textstyle\frac{1}{n}\|\mathbf{y} - \mathbf{X}\hat{\beta}\|^2 + \alpha \|\beta\|^2
\end{align}
$$

Adding this penalty term onto the end and then minimizing has a similar effect to the one described above. That is, **ridge regression shrinks our regression coefficients closer to zero to make our model simpler**. We are accepting more bias in exchange for decreased variance. We'll be tasked with picking the "best" $\alpha$ that optimizes this bias-variance tradeoff.

### Other Variations

| Name | Loss Function |
| --- | --- |
| OLS | MSE |
| Ridge Regression | MSE + $\alpha\|\beta\|^2_2$ |
| LASSO Regression | MSE + $\alpha\|\beta\|_1$ |
| $L_q$-Regression | MSE + $\alpha\|\beta\|^q_q$ |

### Sidenote on notation:
We'll be using $\alpha$ to denote our **regularization parameter**, since that's what Scikit-Learn uses. However, this is contrary to data science literature. It is normally denoted with a $\lambda$. Why? Only Google knows.

### [Neat parameter space visualization!](https://timothykbook.shinyapps.io/RegularizationPlot/)

## What is the effect of regularization?

---

**To demonstrate the effects of regularization, we will be using a dataset on wine quality.**

### Load the wine .csv

This version has red and white wines concatenated together and tagged with a binary 1/0 indicator (1 is red wine). There are many other variables purportedly related to the rated quality of the wine.

In [64]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')

In [65]:
# Load in the wine .csv.
wine = pd.read_csv('datasets/winequality_merged.csv')

# Convert all columns to lowercase and replace spaces in column names.
wine.columns = wine.columns.str.lower().str.replace(' ', '_')

In [66]:
# Check the first five rows.
wine.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality,red_wine
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


In [67]:
# How big is this dataset?
wine.shape

(6497, 13)

In [68]:
# Check for missing values.
wine.isnull().sum()

fixed_acidity           0
volatile_acidity        0
citric_acid             0
residual_sugar          0
chlorides               0
free_sulfur_dioxide     0
total_sulfur_dioxide    0
density                 0
ph                      0
sulphates               0
alcohol                 0
quality                 0
red_wine                0
dtype: int64

In [7]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# Create X and y.
X = wine.drop('quality', axis=1)
y = wine['quality']

In [10]:
X.columns

Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'ph', 'sulphates', 'alcohol', 'red_wine'],
      dtype='object')

In [8]:
ss = StandardScaler()
ss.fit(X)
X_scaled = ss.transform(X)


In [9]:
X_scaled[:1, :]

array([[ 0.14247327,  2.18883292, -2.19283252, -0.7447781 ,  0.56995782,
        -1.10013986, -1.44635852,  1.03499282,  1.81308951,  0.19309677,
        -0.91546416,  1.75018984]])

In [11]:
features=['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'ph', 'sulphates', 'alcohol', 'red_wine']

In [15]:
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

In [16]:
X_overfit = poly.fit_transform(X_scaled)

In [18]:
poly.get_feature_names(X_scaled.columns)

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

In [69]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler


# Create X and y.
X = wine.drop('quality', axis=1)
y = wine['quality']

# Instantiate our PolynomialFeatures object to create all two-way terms.
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

# Fit and transform our X data.
X_overfit = poly.fit_transform(X)

In [70]:
poly.get_feature_names(X.columns)

['fixed_acidity',
 'volatile_acidity',
 'citric_acid',
 'residual_sugar',
 'chlorides',
 'free_sulfur_dioxide',
 'total_sulfur_dioxide',
 'density',
 'ph',
 'sulphates',
 'alcohol',
 'red_wine',
 'fixed_acidity^2',
 'fixed_acidity volatile_acidity',
 'fixed_acidity citric_acid',
 'fixed_acidity residual_sugar',
 'fixed_acidity chlorides',
 'fixed_acidity free_sulfur_dioxide',
 'fixed_acidity total_sulfur_dioxide',
 'fixed_acidity density',
 'fixed_acidity ph',
 'fixed_acidity sulphates',
 'fixed_acidity alcohol',
 'fixed_acidity red_wine',
 'volatile_acidity^2',
 'volatile_acidity citric_acid',
 'volatile_acidity residual_sugar',
 'volatile_acidity chlorides',
 'volatile_acidity free_sulfur_dioxide',
 'volatile_acidity total_sulfur_dioxide',
 'volatile_acidity density',
 'volatile_acidity ph',
 'volatile_acidity sulphates',
 'volatile_acidity alcohol',
 'volatile_acidity red_wine',
 'citric_acid^2',
 'citric_acid residual_sugar',
 'citric_acid chlorides',
 'citric_acid free_sulfur_diox

In [71]:
# Check out the dimensions of X_overfit.
X_overfit.shape

(6497, 90)

#### Let's split our data up into training and testing sets. Why do we split our data into training and testing sets?

In [72]:
# Import train_test_split.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [73]:
# Create train/test splits.
X_train, X_test, y_train, y_test = train_test_split(
    X_overfit,
    y,
    test_size=0.7,
    random_state=42
)

In [74]:
# Scale our data.
# Relabeling scaled data as "Z" is common.
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

In [75]:
print(f'Z_train shape is: {Z_train.shape}')
print(f'y_train shape is: {y_train.shape}')
print(f'Z_test shape is: {Z_test.shape}')
print(f'y_test shape is: {y_test.shape}')

Z_train shape is: (1949, 90)
y_train shape is: (1949,)
Z_test shape is: (4548, 90)
y_test shape is: (4548,)


## Standardizing predictors is required

Let's remind ourselves of our new loss function:

$$MSE + \alpha \|\beta\|^2$$

<details><summary>Why do you think regularization is required?</summary>
Recall that the size of each coefficient depends on the scale of its corresponding variable. Our penalty term depends on these coefficients. Scaling is required so that the regularization penalizes each variable equally fairly.
</details>

## But First: OLS

In [76]:
# Import the appropriate library and fit our OLS model.
from sklearn.linear_model import LinearRegression

In [77]:
ols = LinearRegression()
ols.fit(Z_train, y_train)

LinearRegression()

In [78]:
# How does the model score on the training and test data?
print(ols.score(Z_train, y_train))
print(ols.score(Z_test, y_test))

0.40698195242682567
0.22021547039846145


In [30]:
ols.coef_

array([-6.25749886e+01, -3.02971610e+01, -5.01103963e+01, -1.31671753e+02,
       -6.05594065e+01, -6.69469156e+01,  1.06254306e+02,  1.17149975e+02,
       -3.63264002e+01, -5.28038087e+01,  9.10286690e+01, -2.06348431e-01,
       -7.51604769e-01, -1.51877470e-01, -2.69504397e-01, -6.37134850e-01,
       -1.46230396e+00, -5.61003144e-01,  6.66040229e-01,  6.43672195e+01,
       -1.63465852e-01, -2.26602773e-01,  4.96393198e-01,  3.61062695e-01,
       -1.51450896e-02, -5.61980374e-02, -2.76987557e-01,  3.72765182e-02,
       -8.77421185e-03,  2.51096607e-01,  2.84960544e+01,  3.51506311e-01,
       -1.52413698e-01,  1.45769693e+00,  2.63918881e-01, -1.26622223e-01,
       -2.43408218e-01,  1.68109546e-01,  8.29009668e-02,  4.50602888e-02,
        5.04419929e+01, -9.30023851e-01, -1.83270244e-01,  1.19871740e+00,
       -1.10452250e-01, -1.13190897e+00, -1.43140646e-01, -3.35884795e-01,
        6.51264578e-01,  1.37020668e+02, -4.10760519e+00, -4.42930669e-01,
        1.96024882e+00, -

(THREAD) What do these $R^2$s tell you?

## And Now: Ridge

### Let's think about this...

$$ \|\mathbf{y} - \mathbf{X}\beta\|^2 + \alpha\|\beta\|^2 $$

<details><summary>What's the optimal value of $\beta$ when $\alpha = 0$?</summary>
Our problem reduces to OLS, so it's the good old fashioned OLS solution! For the math nerds playing along from home, that's:
    
$$ \hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $$
</details>

<details><summary>What's the optimal value of $\beta$ when $\alpha = \infty$?</summary>
Anything besides $\hat{\beta} = \mathbf{0}$ will cause our whole loss function to be $\infty$. So, it must be that $\hat{\beta} = \mathbf{0}$!
</details>

<details><summary>Some facts...</summary>
$\alpha$ is a constant for the _strength_ of the regularization parameter. The higher the value, the greater the impact of this new component in the loss function. If the value was zero, we would revert back to just the least squares loss function. If the value was a billion, however, the residual sum of squares component would have a much smaller effect on the loss/cost than the regularization term.
</details>

### We can look at a traceplot to see this:

![](../imgs/ridge-trace.png)

### Ok, so which $\alpha$ is best?

We'll primarily choose the optimal $\alpha$ via **cross validation**.

In [36]:
# Ridge regressor lives here:
from sklearn.linear_model import Ridge

In [37]:
# Instantiate.
ridge_model = Ridge(alpha=10)

# Fit.
ridge_model.fit(Z_train, y_train)

# Evaluate model using R2.
print(ridge_model.score(Z_train, y_train))
print(ridge_model.score(Z_test, y_test))

0.37708346347575217
0.24423639703200628


(THREAD) What do these $R^2$s tell you?

## Brute-forcing the answer

In [39]:
from sklearn.linear_model import RidgeCV

In [40]:
# Set up a list of ridge alphas to check.
# np.logspace generates 100 values equally between 0 and 5,
# then converts them to alphas between 10^0 and 10^5.
r_alphas = np.logspace(0, 5, 100)

# Cross-validate over our list of ridge alphas.
ridge_cv = RidgeCV(alphas=r_alphas, scoring='r2', cv=5)

# Fit model using best ridge alpha!
ridge_cv = ridge_model.fit(Z_train, y_train)

In [41]:
# Here is the optimal value of alpha
ridge_cv.alpha

10

Our `ridge_model` object is actually already the model with the optimal $\alpha$. Let's get the corresponding value of $R^2$.

In [42]:
print(ridge_cv.score(Z_train, y_train))
print(ridge_cv.score(Z_test, y_test))

0.37708346347575217
0.24423639703200628


(THREAD) What do these $R^2$s tell you?

## Defining the LASSO

LASSO regression is largely the same as ridge, except with a different penalty term.

$$
\begin{align}
\text{minimize: MSE + penalty} &= \textstyle\frac{1}{n}\sum (y_i - \hat{y}_i)^2 + \alpha \sum |\beta_j| \\ \\
                               &= \textstyle\frac{1}{n}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 + \alpha \|\beta\|_1 \\ \\
                               &= \textstyle\frac{1}{n}\|\mathbf{y} - \mathbf{X}\hat{\beta}\|^2 + \alpha \|\beta\|_1
\end{align}
$$

The penalty is now made up from the **$\mathcal{l}_1$-norm**, otherwise known as **Manhattan distance**. It is simply the absolute sum of the vector components.

### The LASSO traceplot looks a little different...
But I don't want to show it to you yet! We'll see it soon and discuss what LASSO actually does differently from Ridge.

In [43]:
# Imports similar to Ridge
from sklearn.linear_model import Lasso, LassoCV

## LASSO Regression

In [44]:
# Reminders
print(" OLS ".center(18, "="))
print(ols.score(Z_train, y_train))
print(ols.score(Z_test, y_test))
print()
print(" Ridge ".center(18, "="))
print(ridge_cv.score(Z_train, y_train))
print(ridge_cv.score(Z_test, y_test))

0.40698195242682567
0.22021547039846145

0.37708346347575217
0.24423639703200628


In [48]:
# Set up a list of Lasso alphas to check.
l_alphas = np.logspace(-3, 1, 100)

# Cross-validate over our list of Lasso alphas.
lasso_cv = LassoCV(alphas=l_alphas, cv=5, max_iter=50000)

# Fit model using best ridge alpha!
lasso_cv.fit(Z_train, y_train);

In [49]:
# Here is the optimal value of alpha
lasso_cv.alpha_

0.007742636826811269

In [41]:
print(lasso_cv.score(Z_train, y_train))
print(lasso_cv.score(Z_test, y_test))

0.3366703928164061
0.28555443261565905


## Ridge vs LASSO, what's the diff?!
Let's check out the coefficients of the Lasso and Ridge models.

In [42]:
ridge_cv.coef_

array([ 0.0232576 , -0.21083501, -0.00531832,  0.06827918,  0.06703143,
        0.15182407,  0.0997072 , -0.09995024, -0.00398967,  0.02393939,
       -0.08495227, -0.03966673, -0.09405814, -0.0890898 , -0.0184676 ,
        0.46724216,  0.02233475, -0.04142989, -0.036124  ,  0.03309544,
        0.08587148,  0.08516666, -0.07044882,  0.2520272 ,  0.03247211,
       -0.01395507, -0.10934351, -0.01242831,  0.02979465,  0.01982511,
       -0.21132471, -0.18004511, -0.01112913,  0.39630623,  0.20066429,
       -0.07644422,  0.0185051 ,  0.06605118,  0.07014379, -0.00997621,
       -0.00407114, -0.06525476, -0.08554336,  0.15526435,  0.0293384 ,
        0.06240394,  0.08797771, -0.10236354,  0.08594535,  0.07878929,
       -0.3386912 , -0.02755032, -0.05000497, -0.01537785,  0.09256963,
        0.07097053, -0.04210641,  0.07177472, -0.11702032, -0.26385276,
       -0.11479456,  0.2241329 , -0.37864473, -0.08963746,  0.14850657,
        0.15819053,  0.01053663,  0.05926814, -0.22583862, -0.11

In [44]:
lasso_cv.coef_

array([-0.        , -0.        , -0.        ,  0.        , -0.        ,
        0.        ,  0.        , -0.        ,  0.        ,  0.        ,
        0.        ,  0.        , -0.        , -0.        , -0.        ,
        0.16835564,  0.        ,  0.        , -0.        , -0.        ,
       -0.        ,  0.        , -0.        ,  0.06311839,  0.        ,
       -0.        , -0.05120041, -0.        , -0.        , -0.07081332,
       -0.16039109, -0.        , -0.        , -0.        ,  0.00492753,
       -0.01006693,  0.        ,  0.        ,  0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        ,  0.01658083,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
       -0.        , -0.        , -0.        ,  0.        ,  0.        ,
        0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.        ,  0.00238066, -0.28613203, -0.        ,  0.        ,
        0.13542067,  0.        ,  0.25586092, -0.06659136, -0.  

## Cliffsnotes: L.A.S.S.O.
LASSO is actually an acronym:

* **L**east
* **A**bsolute
* **S**hrinkage and
* **S**election
* **O**perator

**SHRINKAGE**: Higher $\alpha$ "shrinks" $\beta$ towards $\mathbf{0}$.

**SELECTION**: Higher $\alpha$ zeros out small $\beta$s.

![](../imgs/lasso-trace.svg)

## So, um, what was LASSO doing here?
If you're an ultra math nerd, you might have noticed something fishy about our "penalty parameter" $\alpha$. We're doing an optimization problem, so actually, this $\alpha$ is a **Lagrange multiplier**. This means that optimizing our loss function:

$$ \|\mathbf{y} - \mathbf{X}\beta\|^2 + \alpha\|\beta\|_1 $$

is equivalent to optimizing the **constrained loss function**:

$$ \|\mathbf{y} - \mathbf{X}\beta\|^2 \quad \text{such that} \quad \|\beta\|_1 \le t $$

## [BRING IN THE APP!](https://timothykbook.shinyapps.io/RegularizationPlot/)

# Regularizing Logistic Regression: You've been doing it all along!

In [50]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

### Let's take a look at the LogisticRegression documentation:

In [51]:
LogisticRegression().get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

### Regularization is the hidden default for logistic regression. What a pain!
Unless regularization is necessary, **it should not be done!!** (It makes interpreting the coefficients much more difficult.) In newer version of Scikit-Learn, you can finally turn this feature off!

If you _do_ want to regularize, note that there is a much friendlier `LogisticRegressionCV` we will use.

In [52]:
LogisticRegression(penalty='none')

LogisticRegression(penalty='none')

In [53]:
from sklearn.datasets import make_classification

In [54]:
X, y = make_classification(
    n_samples=1000,
    n_features=200,
    n_informative=15,
    random_state=123
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [55]:
logreg = LogisticRegression(C=1e9, solver='lbfgs')
logreg.fit(X_train_sc, y_train)

# Overfit!
print(logreg.score(X_train_sc, y_train))
print(logreg.score(X_test_sc, y_test))

0.9306666666666666
0.636


In [53]:
logreg_cv = LogisticRegressionCV(Cs=10, cv=5, penalty="l1", solver="liblinear")
logreg_cv.fit(X_train_sc, y_train)

print(logreg_cv.score(X_train_sc, y_train))
print(logreg_cv.score(X_test_sc, y_test))

0.8146666666666667
0.812


In [55]:
logreg_cv.C_

array([0.04641589])

# Elephant in the Room: Categorical Variables
Think about it. What does it mean to scale a binary variable? How about a categorical variable dummified into several columns? What does it mean to shrink the coefficients associated with these columns? What happens if the LASSO zeros out one category, but not others? I don't know, either.

It turns out, it's not a great idea to combine scaling and categorical data. It often just doesn't make sense to do. This is true for all algorithms where we need to scale, including kNN. So what do we do? A few options:

* Set separate regularization parameters for each x-variable (not available in Scikit-Learn).
* Carry out the _grouped LASSO_ technique (not available in Scikit-Learn, and doesn't solve all problems anyway).
* Manually decide on a scale for these variables (time consuming, unintuitive, still doesn't work with regularization).
* Don't use those variables (but you want them!).
* Just do it anyway. Who knows, it'll probably be fine! (¯\_(ツ)_/¯)

## Important Notes
- The $\alpha$ hyperparameter for regularization is **unrelated** to significance level in hypothesis testing.
- In certain resources, including [ISLR](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), you'll see that $\lambda$ is used instead of $\alpha$ for regularization strength.
- We must standardize before regularizing, but regularization and standardization are not the same things!
- **FROM NOW ON, YOU MUST PAY ATTENTION TO REGULARIZATION WHEN CONDUCTING LOGISTIC REGRESSION!!!**
- Ridge regression is sometimes called **weight decay**, but usually only when regularizing neural networks.
- LASSO regression is sometimes called **basis pursuit**, but that's very old fashioned.
- The y-intercept for these models are not regularized.

## Recap
- Regularization is used when evidence suggests our model is suffering from high error due to variance.
- Evidence to suggest our model suffers from high error due to variance includes substantially better performance on our training set than our testing set.
- LASSO tends to be "more brutal" than Ridge regularization in that it will zero out coefficients.
- If you want to combine LASSO and Ridge regularization, there is a technique called "ElasticNet" that does exactly this.

## ElasticNet Regression (bonus)

---

Can't decide?

![](../imgs/why-not-both.jpg)

The Elastic Net combines the Ridge and Lasso penalties.  It adds *both* penalties to the loss function:

$$
\begin{eqnarray}
SSE + Ridge + Lasso &=& \sum_{i=1}^n \left(y_i - \hat{y}_i\right)^2 + \alpha\left[\rho\sum_{j=1}^p |\beta_j| + (1-\rho)\sum_{j=1}^p \beta_j^2\right] \\
&=& \|\mathbf{y} - \mathbf{X}\beta\|^2 + \alpha\left(\rho\|\beta\|_1 + (1 - \rho)\|\beta\|^2\right)
\end{eqnarray}
$$


In the elastic net, the effect of the ridge versus the lasso is balanced by the $\rho$ parameter.  It is the ratio of Lasso penalty to Ridge penalty and must be between zero and one.

`ElasticNet` in sklearn has two parameters:
- `alpha`: the regularization strength.
- `l1_ratio`: the amount of L1 vs L2 penalty (i.e., $\rho$). An l1_ratio of 0 is equivalent to the Ridge, whereas an l1_ratio of 1 is equivalent to the Lasso.


In [56]:
from sklearn.linear_model import ElasticNet

Calculate the coefficients with both alpha values and an l1_ratio of 0.05. Lasso can "overpower" the Ridge penalty in some datasets, and so rather than an equal balance I'm just adding a little bit of Lasso in.
- Using a $\rho$ value below 0.05 can empirically cause issues in `sklearn`.

In [57]:
from sklearn.linear_model import ElasticNetCV

In [58]:
# Set up a list of alphas to check.
enet_alphas = np.linspace(0.5, 1.0, 100)

# Set up our l1 ratio. (What does this do?)
enet_ratio = 0.5

# Instantiate model.
enet_model = ElasticNetCV(alphas=enet_alphas, l1_ratio=enet_ratio, cv=5)

# Fit model using optimal alpha.
enet_model = enet_model.fit(X_train, y_train)

# Generate predictions.
enet_model_preds = enet_model.predict(X_test)
enet_model_preds_train = enet_model.predict(X_train)

# Evaluate model.
print(enet_model.score(X_train, y_train))
print(enet_model.score(X_test, y_test))

0.21246113948141965
0.21368840628470098


In [59]:
# Here is the optimal value of alpha.
enet_model.alpha_

0.5