# Cross Validation and Regularized Regression

### Jack Bennetto

## Objectives

By the end of the day you should be able to

 * Describe the three kinds of model error.
 * State the purpose of cross validation
 * Explain k-fold cross validation
 * Explain the training, validation, testing data sets
 * State the purpose of Lasso and Ridge regression
 * Choose the regularization hyperparameter with cross validation

## Agenda

This afternoon we will talk about

* Bias and Variance
* Train-test split
* K-fold cross validation
* Ridge
* LASSO

In [None]:
import numpy as np
import scipy.stats as scs
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

## Cross Validation

### Purpose

Let's plot some data.

In [None]:
rs=8;npts=6;b0=2;b1=0.5;x=scs.uniform(0,10).rvs(npts,random_state=rs);y=b0+b1*x+scs.norm(0,1).rvs(npts, random_state=rs)

fig, ax = plt.subplots()
ax.plot(x, y, 'bo', label='some data')
ax.set_ylim((0, 9))
ax.legend(loc='upper left')

Let's do linear regression!

In [None]:
x_col = x[:, None] # convert to column vector, to fit with sklearn
xpts = np.linspace(0,10)[:, None] # points for plotting

model1 = LinearRegression()
model1.fit(x_col, y)
yhat = model1.predict(xpts)
ax.plot(xpts, yhat, 'r:', label="linear fit")
ax.legend(loc='upper left')
fig

I don't know, that looks ok I guess, but it kind of looks quadratic. We can create a pipeline with `PolynomialFeatures`.

In [None]:
model2 = Pipeline([
        ('pf', PolynomialFeatures(2)),
        ('lr', LinearRegression())
        ])
model2.fit(x_col, y)
yhat = model2.predict(xpts)
ax.plot(xpts, yhat, 'g:', label="quadratic fit")
ax.legend(loc='upper left')
fig

That's better, but let's try a higher-order polynomial.

In [None]:
model3 = Pipeline([
        ('pf', PolynomialFeatures(4)),
        ('lr', LinearRegression())
        ])
model3.fit(x_col, y)
yhat = model3.predict(xpts)
ax.plot(xpts, yhat, 'k:', label='quartic fit')
ax.legend(loc='upper left')
fig

That looks great! We did a great job!

So how did we actually generate these points?

In [None]:
y_actual = b0 + b1 * xpts
ax.plot(xpts, y_actual, 'b:', label='actual function used to generate data')
ax.legend(loc='upper left')
fig

So we went off the rails there.

First, what we did is called **overfitting**, when we fit the specific available data in a way that doesn't generalize to over data. This happens pretty often, whenever we have a very complicated model with many independent parameters. Making out model too complicated is bad.

The opposite, called **underfitting**, is bad too. Suppose we'd just used the mean of the $y$ values to estimate $\hat y$ for all the points. That's too simple of a model.

### Bias and Variance

We'll come back to these again and again and again through the course. The error of a model can be divided into three components.

1. **Irreducable error** is the error inherent in any value. Even if we had all possible data and could build a perfect model, we can't predict values exactly because there's error at each data point.
2. **Bias** is due to the failure of the model to match our training sample. It's easy to get rid of bias with a complicated model that predicts all the data in our sample exactly.
3. **Variance** is the error from the differences of our training sample and the larger population. If we had access to entire population of data, we would have no variance.

In general, there is a tradeoff between bias and variance. A complex model might have very low bias, but will be highly dependent on the sample taken so wil have high variance. A simple model might have higher bias, because it underfits, but lower variance, predicting other data nearly as well as the training sample.

Some models have **hyperparameters** that can be tuned. Most represent that tradeoff: moving them in one direction will lower the bias and raise the variance; moving them in the other will do the opposite.

Ok, so how do we tell which model is the best?

### CROSS VALIDATION!

The basic concept behind cross validation is that we the data on which we train our model can't accurately access its effectiveness. That's due to overfitting, that no matter how hard we try to generalize the model, its always based more on the data we used than the data we didn't.

Cross validation really has two separate purposes.

First, it is used for **model comparison**. Over the coming week we'll learn a bunch of different models, and we need to evaluate which will do best for our data. In addition, many of these models have hyperparameters, and we need cross validation to choose the appropriate values. 

Second, it's used to **evaluate your model**. Part of the CRISP-DM is evaluation; you (usually) need to know how well your model will predict real-world results. There are many ways to measure that, like AUC/ROC or F-score some compination of precision and recall or sensitivity and specificity, based on your specific business case, but in the end the key thing is that you can't measure it on your training data.*


**You can measure on your training data in some circumstances, either because your statistical measure allows some estimation of the error, or you have an ensemble model where different submodels see different data (out-of-bag error). But those aren't as general.*

## The train-test split

The simplest approach we can use is the train-test split. You probably shouldn't call this cross validation, just say "train-test split" or "hold-out validation."

Let's start with the mtcars dataset that you've seen before in the interview.

In [None]:
cars = pd.read_csv('cars.csv')

There are a few rows without data for horsepower. We're just going to throw those away for now without worrying to much if that's ok.

In [None]:
cars = cars[cars.horsepower != '?']
cars.horsepower = cars.horsepower.astype('float128')
cars.mpg = cars.mpg.astype('float128')

In [None]:
fig, ax = plt.subplots()
ax.plot(cars.mpg, cars.horsepower, '.')
ax.set_xlabel("mpg")
ax.set_ylabel("horsepower")

In [None]:
cars.info()

In [None]:
X = cars[['mpg']]
y = cars.horsepower

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
def mean_squared_error(model, X, y):
    return np.mean((model.predict(X) - y) **2)

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)
print("R^2 on training data: {}".format(model.score(X_train, y_train)))
print("R^2 on testing data:  {}".format(model.score(X_test, y_test)))
                                             

Ok, we did a bit better on the training data, as expected...or did we?

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3)
model = LinearRegression()
model.fit(X_train, y_train)
print("R^2 on training data: {}".format(model.score(X_train, y_train)))
print("R^2 on testing data:  {}".format(model.score(X_test, y_test)))

Apparently it's pretty sensative to the random split. Let's explore more.

In [None]:
train_score = []
test_score = []
model = LinearRegression()

for _ in range(1000):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    model.fit(X_train, y_train)
    train_score.append(model.score(X_train, y_train))
    test_score.append(model.score(X_test, y_test))
                   
fig, ax = plt.subplots(figsize=(6,6))
ax.plot(train_score, test_score, '.', alpha=0.2)
ax.plot([0, 1], [0, 1], ':')
ax.set_aspect('equal')
ax.set_xlabel('train $R^2$')
ax.set_ylabel('test $R^2$')
ax.set_xlim((.4,.8))
ax.set_ylim((.4,.8))

So we usually do better with the training set. Usually.

Let's see if we can reproduce that train-test-split graph.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

train_score = []
test_score = []

for degree in range(1, 11):
    model = Pipeline([
        ('pf', PolynomialFeatures(degree)),
        ('lr', LinearRegression())
        ])
    model.fit(X_train, y_train)
    train_score.append(mean_squared_error(model, X_train, y_train))
    test_score.append(mean_squared_error(model, X_test, y_test))
    #train_score.append(-model.score(X_train, y_train))
    #test_score.append(-model.score(X_test, y_test))

fig, ax = plt.subplots(figsize=(6,6))
ax.plot(range(1, 11), train_score, '.-', label="train set")
ax.plot(range(1, 11), test_score, '.-', label="test set")
ax.set_xlabel('complexity')
ax.set_ylabel('mean squared error')
ax.legend()
ax.set_xticks(range(1, 11))
plt.show()

In [None]:
fig, ax = plt.subplots()

for t in range(50):
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    train_score = []
    test_score = []

    for degree in range(1, 11):
        model = Pipeline([
            ('pf', PolynomialFeatures(degree)),
            ('lr', LinearRegression())
            ])
        model.fit(X_train, y_train)
        train_score.append(mean_squared_error(model, X_train, y_train))
        test_score.append(mean_squared_error(model, X_test, y_test))
        #train_score.append(-model.score(X_train, y_train))
        #test_score.append(-model.score(X_test, y_test))
    if t == 0:
        ax.plot(range(1, 11), train_score, 'b.-', label="train set", alpha=0.1)
        ax.plot(range(1, 11), test_score, 'y.-', label="test set", alpha=0.1)
    else:
        ax.plot(range(1, 11), train_score, 'b.-', alpha=0.1)
        ax.plot(range(1, 11), test_score, 'y.-', alpha=0.1)
        
ax.set_xlabel('degree of polynomial')
ax.set_ylabel('mean squared error')
ax.legend()
ax.set_xticks(range(1, 11))
plt.show()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

fig, ax = plt.subplots()
ax.set_xlabel("mpg")
ax.set_ylabel("horsepower")
ax.plot(X_train, y_train, 'b.')
ax.plot(X_test, y_test, 'y.')


model = Pipeline([
    ('pf', PolynomialFeatures(11)),
    ('lr', LinearRegression())
    ])
model.fit(X_train, y_train)
xpts = np.linspace(9, 47, 100).reshape(-1, 1)
ax.plot(xpts, model.predict(xpts), 'b-')
ax.set_ylim(40, 240)

Overall, this isn't all that consistant. We need something better.

## K-fold Cross Validation

With Cross Validation, we randomly partition the data into $k$ groups, $D_1$, $D_2$, ..., $D_k$. For each $i \in [1..k]$ we:

 * Build a model using $D_{j \ne i}$ as a training data
 * Calculate the error of the model on $D_i$
 
We average all these errors to compute the overall error of the model, and compare those across different models to choose the best model.

There isn't a clear "best" value for $k$, but 5 is commonly choosen. The extreme version of k-fold cross validation, when $k=n$, is called leave-one-out cross validation.

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits = 5, shuffle = True)
scores = []

for train, test in kf.split(X):
    model = LinearRegression()
    model.fit(X.values[train], y.values[train])
    scores.append(model.score(X.values[test], y.values[test]))
    
print np.mean(scores)

Many `sklearn` models include "CV" versions that use cross validation to calculate hyperparameters automatically.

**Stratified cross validation** is a variation in which the partitions are choosen to have similar values for features.

## Overfitting on the testing data

There's a problem with all this. Because the model and hyperparameters are choosen based on the training and testing data, the errors of the model aren't an accurate representaton of how it would behave on outside data. If we want to know how it will behave in general, we need to hold out additional data. In this case we have

 * **Training data** are used to fit the model.
 * **Vaidation data** are used to choose the model and hyperparameters. Once these are determined, these are combined with the training data to re-fit the model.
 * **Testing data** are used to evaluate the final accuracy of the model.
 
Each of these can be used either with simple hold-out validation or with k-fold cross validation.

## Regularized Regression

For the past couple days we've talked about linear regression, in which we find the coefficients $\beta_0$, $\beta_1$, ..., $\beta_p$ to minimize

$$RSS = \sum_{i=1}^n \left( y_i = \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)$$

In **Ridge Regession** we find the values to minimize

$$RSS = \sum_{i=1}^n \left( y_i = \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right) + \lambda \sum_{j=1}^{p} \beta_j^2$$

Effectively we've penalizing extreme values of $\beta$
Note that we aren't including $\beta_0$. The value $\lambda$ is a hyperparameter of the model.

Question: how should we decide the appropriate value for $\lambda$?

In **LASSO Regession** (**L**east **A**bsolute **S**hrinkage and **S**election **O**perator) we find the values to minimize

$$RSS = \sum_{i=1}^n \left( y_i = \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right) + \lambda \sum_{j=1}^{p} | \beta_j |$$

In many ways this is similar to Ridge:
  * We're penalizing large values of $\beta$.
  * We aren't including $\beta_0$
  * We have a hyperparameter $\lambda$

The difference is the exponent. Ridge is sometimes known as **L2 regularization**, while LASSO is **L! regularization**. We'll talk more about this in a bit.

For $\lambda = 0$, these both reduce to a standard linear model.

Questions:

What does it mean if $\lambda = 0$

What does in mean if $\lambda \to \infty$

How does this relate to the bias-variance trade-off? If $\lambda$ increasies, what happens to the bias? What happens to the variance?

### An Example

Let's make up some data.

In [None]:
npts = 100
nfeatures = 6
x = np.zeros((npts, nfeatures))
x[:, 0] = scs.uniform(-10, 20).rvs(npts)
x[:, 1] = scs.uniform(-10, 20).rvs(npts)
x[:, 2] = scs.uniform(-10, 20).rvs(npts)  + 0.2*x[:, 0]
x[:, 3] = scs.uniform(-10, 20).rvs(npts)  + 0.4*x[:, 1]
x[:, 4] = scs.uniform(-10, 20).rvs(npts)  + 0.6*x[:, 2] - 1.4*x[:, 0]
x[:, 5] = scs.uniform(-10, 20).rvs(npts)  + 1.8*x[:, 3]

beta = np.array([0.4, 0.3, 0.1, 1.2, 0.7, 0.2])

y = np.sum(x * beta, axis=1) + scs.norm(0, 5).rvs(npts)

We can look at the how the coefficients depend on various values of $\lambda$.

N.B.: the coefficient in `sklearn` is called $\alpha$.

In [None]:
nalphas = 50
min_alpha_exp = -2
max_alpha_exp = 2
coefs = np.zeros((nalphas, nfeatures))
alphas = np.logspace(min_alpha_exp, max_alpha_exp, nalphas)
for i, alpha in enumerate(alphas):
    model = Lasso(alpha=alpha)
    model.fit(x, y)
    coefs[i] = model.coef_

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
for feature, color in zip(range(nfeatures),
                          ['r','g','b','c','m','k']):
    plt.plot(alphas, coefs[:, feature],
             color=color,
             label="$\\beta_{}$".format(feature))
    plt.plot([10**min_alpha_exp, 10**max_alpha_exp], [beta[feature], beta[feature]],
             ls=':',
             color=color,
             alpha=0.5)
ax.set_xscale('log')
ax.set_title("$\\beta$ as a function of $\\alpha$ for LASSO regression")
ax.set_xlabel("$\\alpha$")
ax.set_ylabel("$\\beta$")
ax.legend(loc="upper right")

Discussion: What's going on?

In [None]:
nalphas = 50
min_alpha_exp = 0
max_alpha_exp = 6
coefs = np.zeros((nalphas, nfeatures))
alphas = np.logspace(min_alpha_exp, max_alpha_exp, nalphas)
for i, alpha in enumerate(alphas):
    model = Ridge(alpha=alpha)
    model.fit(x, y)
    coefs[i] = model.coef_

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
for feature, color in zip(range(nfeatures),
                          ['r','g','b','c','m','k']):
    plt.plot(alphas, coefs[:, feature],
             color=color,
             label="$\\beta_{}$".format(feature))
    plt.plot([10**min_alpha_exp, 10**max_alpha_exp], [beta[feature], beta[feature]],
             ls=':',
             color=color,
             alpha=0.5)
ax.set_xscale('log')
ax.set_title("$\\beta$ as a function of $\\alpha$ for Ridge regression")
ax.set_xlabel("$\\alpha$")
ax.set_ylabel("$\\beta$")
ax.legend(loc="upper right")

Discussion: what's going on? How does this differ from LASSO?

### Scaling and regularization

Over the next couple weeks we'll talk about a variety of predictive models, and various ways in which they are different. One of those is whether it is necessary to standardize/normalize the features before fitting. First, some definitions:

**Standarization** (in this context) is the proccess of subtracting the mean from eah feature, and then dividing my the standard deviation, so each feature has a mean of 0 and standard deviation of 1.

**Normalization** (again, in this context) is the process of subtracting the minimum value from each feature, and then dividing my the maximum, so each feature ranges from 0 to 1.

Both accomplish the same purpose, of having all features on the same scale.

Ok, so for some models you need to standardize (or normalize) the features before fitting the model, at least if they have significantly different ranges. It's not that hard to write code to do that, but the transformers in `sklearn` make that a lot easier.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso

model = Pipeline([('standardize', StandardScaler()),
                   ('regressor', Lasso())])

The pipeline model will take care of the transformations in fitting and prediction automatically. You can normalize using `sklearn.preprocessing.MinMaxScaler`.

In usual linear regression without regularization, scaling **does not matter**. If you multiply change the scale of a feature, it will change the cooresponding coefficient, but the predictions will be exactly the same.

This changes when we add regularization. Since we include a term that is proportional to the $\beta$, that actual predictions will change it we rescale the values.

As a rule of thumb, if rescaling the values will change the predictions of a model, you need to standardize (or normalize) the values.

Discussion: is standardizing or normalizing better? When should you do one or the other?