# Regularization

## Overfitting 

**What is overfitting?**

- Building a model that matches the training data "too closely"
- Learning from the noise in the data, rather than just the signal

**How does overfitting occur?**

- Evaluating a model by testing it on the same data that was used to train it
- Creating a model that is "too complex"

**What is the impact of overfitting?**

- Model will do well on the training data, but won't generalize to out-of-sample data
- Model will have low bias, but high variance

## Overfitting with linear models

**What are the general characteristics of linear models?**

- Low model complexity
- High bias, low variance
- Does not tend to overfit

Nevertheless, **overfitting can still occur** with linear models if you allow them to have **high variance**. Here are some common causes:

### Cause 1: Irrelevant features

Linear models can overfit if you include "irrelevant features", meaning features that are unrelated to the response. Why?

Because it will learn a coefficient for every feature you include in the model, regardless of whether that feature has the **signal** or the **noise**.

This is especially a problem when **p (number of features) is close to n (number of observations)**, because that model will naturally have high variance.

### Cause 2: Correlated features

Linear models can overfit if the included features are highly correlated with one another. Why?

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares):

> "...coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms are correlated and the columns of the design matrix X have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed response, producing a large variance."

### Cause 3: Large coefficients

Linear models can overfit if the coefficients (after feature standardization) are too large. Why?

Because the **larger** the absolute value of the coefficient, the more **power** it has to change the predicted response, resulting in a higher variance.

## Regularization of linear models

- Regularization is a method for "constraining" or "regularizing" the **size of the coefficients**, thus "shrinking" them towards zero.
- It reduces model variance and thus **minimizes overfitting**.
- If the model is too complex, it tends to reduce variance more than it increases bias, resulting in a model that is **more likely to generalize**.

Our goal is to locate the **optimum model complexity**, and thus regularization is useful when we believe our model is too complex.

<img src="Images/bias_variance.png" width="50%">

### How does regularization work?

For a normal linear regression model, we estimate the coefficients using the least squares criterion, which **minimizes the residual sum of squares (RSS):**

<img src="Images/estimating_coefficients.png" width="80%">


For a regularized linear regression model, we **minimize the sum of RSS and a "penalty term"** that penalizes coefficient size.

**Ridge regression** (or "L2 regularization") minimizes: $$\text{RSS} + \alpha \sum_{j=1}^p \beta_j^2$$

**Lasso regression** (or "L1 regularization") minimizes: $$\text{RSS} + \alpha \sum_{j=1}^p |\beta_j|$$

- $p$ is the **number of features**
- $\beta_j$ is a **model coefficient**
- $\alpha$ is a **tuning parameter:**
    - A tiny $\alpha$ imposes no penalty on the coefficient size, and is equivalent to a normal linear regression model.
    - Increasing the $\alpha$ penalizes the coefficients and thus shrinks them.

### Lasso and ridge Regularization

A larger alpha results in more regularization:

- **Lasso regression** shrinks coefficients all the way to zero, thus removing them from the model
- **Ridge regression** shrinks coefficients toward zero, but they rarely reach zero

Source code for the diagrams: [Lasso regression](http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_lars.html) and [Ridge regression](http://scikit-learn.org/stable/auto_examples/linear_model/plot_ridge_path.html)

## Advice for applying regularization

**How should you choose between Lasso regression and Ridge regression?**

- Lasso regression is preferred if we believe many features are irrelevant or if we prefer a sparse model.
- If model performance is your primary concern, it is best to try both.
- ElasticNet regression is a combination of lasso regression and ridge Regression.

## Regularized regression in scikit-learn

- Communities and Crime dataset from the UCI Machine Learning Repository: [data](http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data), [data dictionary](http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime)
- **Goal:** Predict the violent crime rate for a community given socioeconomic and law enforcement data

### Load and prepare the crime dataset

In [1]:
# read in the dataset
import pandas as pd
url = './Datasets/communities.data.txt'
crime = pd.read_csv(url, header=None, na_values=['?'])
crime.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,123,124,125,126,127
0,8,,,Lakewoodcity,1,0.19,0.33,0.02,0.9,0.12,...,0.12,0.26,0.2,0.06,0.04,0.9,0.5,0.32,0.14,0.2
1,53,,,Tukwilacity,1,0.0,0.16,0.12,0.74,0.45,...,0.02,0.12,0.45,,,,,0.0,,0.67
2,24,,,Aberdeentown,1,0.0,0.42,0.49,0.56,0.17,...,0.01,0.21,0.02,,,,,0.0,,0.43
3,34,5.0,81440.0,Willingborotownship,1,0.04,0.77,1.0,0.08,0.12,...,0.02,0.39,0.28,,,,,0.0,,0.12
4,42,95.0,6096.0,Bethlehemtownship,1,0.01,0.55,0.02,0.95,0.09,...,0.04,0.09,0.02,,,,,0.0,,0.03


In [2]:
# examine the response variable
crime[127].describe()

count    1994.000000
mean        0.237979
std         0.232985
min         0.000000
25%         0.070000
50%         0.150000
75%         0.330000
max         1.000000
Name: 127, dtype: float64

In [3]:
# remove categorical features
crime.drop([0, 1, 2, 3, 4], axis=1, inplace=True)

In [4]:
# remove rows with any missing values
crime.dropna(inplace=True)

In [5]:
# check the shape
crime.shape

(319, 123)

In [6]:
# define X and y
X = crime.drop(127, axis=1)
y = crime[127]

In [7]:
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

### Linear regression

In [8]:
# build a linear regression model
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [9]:
# examine the coefficients
print (linreg.coef_)

[ -3.66188167e+00   6.98124465e-01  -2.61955467e-01  -2.85270027e-01
  -1.64740837e-01   2.46972333e-01  -1.09290051e+00  -5.96857796e-01
   1.11200239e+00  -7.21968931e-01   4.27346598e+00  -2.28040268e-01
   8.04875769e-01  -2.57934732e-01  -2.63458023e-01  -1.04616958e+00
   6.07784197e-01   7.73552561e-01   5.96468029e-02   6.90215922e-01
   2.16759430e-02  -4.87802949e-01  -5.18858404e-01   1.39478815e-01
  -1.24417942e-01   3.15003821e-01  -1.52633736e-01  -9.65003927e-01
   1.17142163e+00  -3.08546690e-02  -9.29085548e-01   1.24654586e-01
   1.98104506e-01   7.30804821e-01  -1.77337294e-01   8.32927588e-02
   3.46045601e-01   5.01837338e-01   1.57062958e+00  -4.13478807e-01
   1.39350802e+00  -3.49428114e+00   7.09577818e-01  -8.32141352e-01
  -1.39984927e+00   1.02482840e+00   2.13855006e-01  -6.18937325e-01
   5.28954490e-01   7.98294890e-02   5.93688560e-02  -1.68582667e-01
   7.31264051e-01  -1.39635208e+00   2.38507704e-01   5.50621439e-01
  -5.61447867e-01   6.18989764e-01

In [10]:
# make predictions
y_pred = linreg.predict(X_test)

In [11]:
# calculate RMSE
from sklearn import metrics
import numpy as np
print (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.233813676495


### Ridge regression

- [Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) documentation
- **alpha:** must be positive, increase for more regularization
- **normalize:** scales the features (without using StandardScaler)

In [12]:
# alpha=0 is equivalent to linear regression
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=0, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)
print (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.233813676495


In [17]:
# try alpha=0.1
ridgereg = Ridge(alpha=0.1, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)
print (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.164279068049


In [18]:
# examine the coefficients
print (ridgereg.coef_)

[ -4.00298418e-03   3.51647445e-02   6.03535935e-02  -7.68532502e-02
  -1.76099849e-02   4.53791433e-02   8.81586468e-03  -2.88885814e-02
  -1.92143587e-02   3.36122201e-02   5.71590736e-04  -4.85438136e-02
   5.55725157e-02  -1.15934270e-01  -1.11880845e-01  -3.32742094e-01
  -1.12302031e-02   9.63833243e-02  -8.92057732e-02   8.42691702e-02
  -1.67246717e-02   7.42520308e-03  -1.21294025e-01  -6.70155789e-02
  -1.74250249e-03   1.69446833e-01   3.18217654e-02  -1.00209834e-01
   3.97535644e-02  -1.19173054e-01  -1.04445267e-01  -5.14946676e-03
   1.10071013e-01  -3.22958955e-02  -1.40601627e-01   7.72658029e-02
   9.07962536e-02  -3.78878862e-03   4.61941793e-02   6.30299731e-02
  -3.09236932e-02   1.02883578e-02   9.70425568e-02  -1.28936944e-01
  -1.38268907e-01  -6.37169778e-02  -8.80160419e-02  -4.01991014e-02
   8.11064596e-02  -6.30663975e-02   1.29756859e-01  -6.25210624e-02
   1.60531213e-02  -1.39061824e-01   6.39822353e-02   4.87118744e-02
  -7.68217532e-03  -1.53523412e-03

- [RidgeCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html): ridge regression with built-in cross-validation of the alpha parameter
- **alphas:** array of alpha values to try

In [19]:
# create an array of alpha values
alpha_range = 10.**np.arange(-2, 3)
alpha_range

array([  1.00000000e-02,   1.00000000e-01,   1.00000000e+00,
         1.00000000e+01,   1.00000000e+02])

In [20]:
# select the best alpha with RidgeCV
from sklearn.linear_model import RidgeCV
ridgeregcv = RidgeCV(alphas=alpha_range, normalize=True, scoring='mean_squared_error')
ridgeregcv.fit(X_train, y_train)
ridgeregcv.alpha_

1.0

In [21]:
# predict method uses the best alpha value
y_pred = ridgeregcv.predict(X_test)
print (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.163129782343


### Lasso regression

- [Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) documentation
- **alpha:** must be positive, increase for more regularization
- **normalize:** scales the features (without using StandardScaler)

In [22]:
# try alpha=0.001 and examine coefficients
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.001, normalize=True)
lassoreg.fit(X_train, y_train)
print (lassoreg.coef_)

[ 0.          0.          0.00891952 -0.27423369  0.          0.          0.
 -0.         -0.          0.          0.          0.         -0.         -0.
 -0.         -0.19414627  0.          0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.          0.          0.          0.
  0.04335664 -0.          0.         -0.          0.03491474 -0.
 -0.06685424  0.          0.         -0.          0.10575313  0.          0.
  0.00890807  0.         -0.1378172  -0.30954312 -0.         -0.         -0.
 -0.          0.          0.          0.          0.         -0.          0.
  0.          0.          0.          0.          0.         -0.          0.
  0.          0.         -0.          0.         -0.         -0.          0.
  0.05257892 -0.          0.         -0.         -0.          0.          0.
  0.          0.          0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.          0.         -0.         -0.          0.
  0.1386108

In [23]:
# try alpha=0.01 and examine coefficients
lassoreg = Lasso(alpha=0.01, normalize=True)
lassoreg.fit(X_train, y_train)
print (lassoreg.coef_)

[ 0.          0.          0.         -0.03974695  0.          0.          0.
  0.          0.         -0.          0.          0.         -0.         -0.
 -0.         -0.         -0.          0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.          0.
  0.          0.          0.         -0.          0.         -0.         -0.
  0.          0.         -0.          0.          0.          0.          0.
  0.         -0.         -0.27503063 -0.         -0.         -0.         -0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -0.          0.          0.
  0.          0.          0.          0.         -0.          0.          0.
 -0.          0.         -0.         -0.          0.          0.         -0.
  0.          0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.          0.          0.         -0.          0.          0.

In [24]:
# calculate RMSE (for alpha=0.01)
y_pred = lassoreg.predict(X_test)
print (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.198165225429


- [LassoCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html): lasso regression with built-in cross-validation of the alpha parameter
- **n_alphas:** number of alpha values (automatically chosen) to try

In [25]:
# select the best alpha with LassoCV
from sklearn.linear_model import LassoCV
lassoregcv = LassoCV(n_alphas=100, normalize=True, random_state=1)
lassoregcv.fit(X_train, y_train)
lassoregcv.alpha_



0.0015161594598125873

In [26]:
# examine the coefficients
print (lassoregcv.coef_)

[ 0.          0.          0.         -0.28113506  0.          0.          0.
  0.          0.          0.          0.          0.         -0.         -0.
 -0.         -0.15481092  0.          0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.          0.         -0.          0.
  0.06451487  0.          0.         -0.          0.         -0.
 -0.01920421  0.          0.         -0.          0.03386202  0.          0.
  0.08901243  0.         -0.08759757 -0.36986917 -0.         -0.         -0.
 -0.          0.          0.          0.          0.         -0.          0.
  0.          0.          0.          0.          0.         -0.          0.
  0.          0.         -0.          0.          0.         -0.          0.
  0.01740599 -0.          0.         -0.         -0.          0.          0.
  0.          0.          0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.          0.         -0.         -0.          0.
  0.1347103

In [27]:
# predict method uses the best alpha value
y_pred = lassoregcv.predict(X_test)
print (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.160209558014


## Elastic Net