# Data-X Spring 2018
## Tutorial on Regularization, Ridge, and Lasso Regression 
Reference:  
* AARSHAY JAIN - A Complete Tutorial on Ridge and Lasso Regression in Python
* https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

## Overview
Ridge and Lasso Regression are regularization techniques generally used for creating parsimonious models (Occam's Razor) in presence of a large number of features. The main goal is to combat tendency of models to overfit and balance the statistical-computational tradeoffs that are ubiquitous in the high-dimensional statistics era. 

* Ridge Regression
    - L2-regularization 
    - Obj = Loss Function + $\alpha\|\beta\|_2$
* Lasso Regression
    - L1-regularization
    - Obj = Loss Function + $\alpha\|\beta\|_1$

## Why penalize the magnitude of coefficients?

Consider the following simulation of a sine curve

In [None]:
#Importing libraries. The same will be used throughout the article.
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
%matplotlib inline

plt.rcParams['figure.figsize'] = 14, 10

import warnings
warnings.filterwarnings('ignore')

#Define input array with angles from 60deg to 300deg converted to radians
x_full = np.array([i*np.pi/180 for i in range(60,300,2)])
np.random.seed(10)  #Setting seed for reproducability
y_full = np.sin(x_full) + np.random.normal(0,0.15,len(x_full)) # Adding Noise

x, x_test, y, y_test = train_test_split(x_full,y_full,test_size=0.25)

data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
data_test = pd.DataFrame(np.column_stack([x_test,y_test]),columns=['x','y'])
plt.plot(data['x'],data['y'],'.',ms=16)

In [None]:
combine = [data, data_test]

for df in combine:
    df.sort_values('x',inplace=True)

### First Thoughts: Polynomial Regression 
Features: $X^1, X^2, \dots, X^{15}$

In [None]:
from sklearn.linear_model import Ridge

In [None]:
for df in combine:

    for i in range(2,16):  #power of 1 is already there
        colname = 'x^%d'%i      #new var will be x^power
        df[colname] = df['x']**i


data.head()

Let's consider building 15 models as follows:
* Model 1: Features - $X^1$
* Model 2: Features - $X, X^2$
* $\dots$
* Model 15: Features - $X, X^2, X^{15}$

In [None]:
#Import Linear Regression model from scikit-learn.
from sklearn.linear_model import LinearRegression
def linear_regression(data, power, models_to_plot):
    #initialize predictors:
    predictors=['x']
    if power>=2:
        predictors.extend(['x^%d'%i for i in range(2,power+1)])
    
    #Fit the model
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors],data['y'])
    y_pred = linreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered power
    if power in models_to_plot:
        plt.subplot(models_to_plot[power])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.',ms=10)
        plt.title('Plot for power: %d'%power)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    
    y_pred_test = linreg.predict(data_test[predictors])
    rss_test = sum((y_pred_test-data_test['y'])**2)
    ret.extend([rss_test])
    
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    


    return ret

In [None]:
#Initialize a dataframe to store the results:
col = ['rss_train','rss_test','intercept'] + ['coef_x^%d'%i for i in range(1,16)]
ind = ['model_pow_%d'%i for i in range(1,16)]
coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

#Define the powers for which a plot is required:
models_to_plot = {1:231,3:232,6:233,9:234,12:235,15:236}

#Iterate through all powers and assimilate results
for i in range(1,16):
    coef_matrix_simple.iloc[i-1,0:i+3] = \
    linear_regression(data, power=i, models_to_plot=models_to_plot)

This clearly aligns with our initial understanding. As the model complexity increases, the models tends to fit even smaller deviations in the training data set. We start to fit the gaussian noise in the data (patterns that is not there)

However, as we add more features, we expect ourselves to overfit!

In [None]:
#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_simple

It is clearly evident that the size of coefficients increase exponentially with increase in model complexity.

What does a large coefficient signify? 

It means that we’re putting a lot of emphasis on that feature, i.e. the particular feature is a good predictor for the outcome. When it becomes too large, the algorithm starts modelling intricate relations to estimate the output and ends up overfitting to the particular training data.

## Ridge Regression

Note the ‘Ridge’ function used here. It takes ‘alpha’ as a parameter on initialization. Also, keep in mind that normalizing the inputs is generally a good idea in every type of regression and should be used in case of ridge regression as well.

In [None]:
from sklearn.linear_model import Ridge
def ridge_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    ridgereg = Ridge(alpha=alpha,normalize=True)
    ridgereg.fit(data[predictors],data['y'])
    y_pred = ridgereg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for alpha: %.3g'%alpha)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    
    y_pred_test = ridgereg.predict(data_test[predictors])
    rss_test = sum((y_pred_test-data_test['y'])**2)
    ret.extend([rss_test])
    
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    return ret

Now, lets analyze the result of Ridge regression for 10 different values of α ranging from 1e-15 to 20. Note that each of these 10 models will contain all the 15 variables and only the value of alpha would differ.

In [None]:
#Initialize predictors to be set of 15 powers of x
predictors=['x']
predictors.extend(['x^%d'%i for i in range(2,16)])

#Set the different values of alpha to be tested
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

#Initialize the dataframe for storing coefficients.
col = ['rss_train','rss_test','intercept'] + ['coef_x^%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_ridge[i] for i in range(0,10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

models_to_plot = {1e-15:231, 1e-10:232, 1e-4:233, 1e-3:234, 1e-2:235, 5:236}
for i in range(10):
    coef_matrix_ridge.iloc[i,] = ridge_regression(data, predictors, alpha_ridge[i], models_to_plot)

Here we can clearly observe that as the value of alpha increases, the model complexity reduces. Though higher values of alpha reduce overfitting, significantly high values can cause underfitting as well (eg. alpha = 5). Thus alpha should be chosen wisely. A widely accept technique is cross-validation, i.e. the value of alpha is iterated over a range of values and the one giving higher cross-validation score is chosen.

**USE CROSS VALIDATION**

In [None]:
#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_ridge

### Inference

This straight away gives us the following inferences:

1. The RSS_train increases with increase in alpha, while rss_test is optimized when alpha is around 0.001 (note edge cases not perfect)
2. An alpha as small as 1e-15 gives us significant reduction in magnitude of coefficients. How? Compare the coefficients in the first row of this table to the last row of simple linear regression table.
3. High alpha values can lead to significant underfitting. Note the rapid increase in RSS for values of alpha greater than 1
4. Though the coefficients are **very very small, they are NOT zero**.


In [None]:
# count how many of the coefficents are zero (in the full coefficient matrix)

coef_matrix_ridge.apply(lambda x: sum(x.values==0),axis=1)

We note that none of the coefficients are 0!

## LASSO Regression 
#### (Least Absolute Shrinkage and Selection Operator)

In [None]:
from sklearn.linear_model import Lasso
def lasso_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e5)
    lassoreg.fit(data[predictors],data['y'])
    y_pred = lassoreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for alpha: %.3g'%alpha)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    
    y_pred_test = lassoreg.predict(data_test[predictors])
    rss_test = sum((y_pred_test-data_test['y'])**2)
    ret.extend([rss_test])
    
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret

In [None]:
#Initialize predictors to all 15 powers of x
predictors=['x']
predictors.extend(['x^%d'%i for i in range(2,16)])

#Define the alpha values to test
alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1, 5, 10]

#Initialize the dataframe to store coefficients
col = ['rss_train','rss_test','intercept'] + ['coef_x^%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_lasso[i] for i in range(0,10)]
coef_matrix_lasso = pd.DataFrame(index=ind, columns=col)

#Define the models to plot
models_to_plot = {1e-10:231, 1e-5:232,1e-4:233, 1e-3:234, 1e-2:235, 1:236}

#Iterate over the 10 alpha values:
for i in range(10):
    coef_matrix_lasso.iloc[i,] = lasso_regression(data, predictors, alpha_lasso[i], models_to_plot)

In [None]:
#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_lasso

We have high sparsity whenever a model coefficicent is zero.

Seems to be best around alpha 10^-5 or 10^-4.

### Inference:

Apart from the expected inference of higher RSS for higher alphas, we can see the following:

1. For the same values of alpha, the coefficients of lasso regression are much smaller as compared to that of ridge regression (compare row 1 of the 2 tables).
2. For the same alpha, lasso has higher RSS (poorer fit) as compared to ridge regression
3. Many of the coefficients are zero even for very small values of alpha


In [None]:
coef_matrix_lasso.apply(lambda x: sum(x.values==0),axis=1)

We can see that as $\alpha$ increases, the number of coefficients being set to zero increases from 0 to 15. Implicitly, Lasso conducts feature selection as well!

- Lasso works the best with sparsity 

## Conclusion

* Key Differences:
    - Ridge: Typically includes all (or none) of the features in the model. 
        - Main advantage: Coefficient Shrinkage & Reducing Model Complexity
        - It generally works well even in presence of highly correlated features as it will include all of them in the model but the coefficients will be distributed among them depending on the correlation.
        - Prevents overfitting but does not reduce computational challenges
        - simple gradient descent is capable of the optimization
    - Lasso: 
        - Main advantage: Reduces model complexity & performs coefficient shrinking as well as feature selection
        - Provides sparse solutions
        - requires subgradient optimization in gradient descent

* You can combine Lasso and Ridge Regression via Elastic Nets

## Illustration: Regularizing with Early Stopping

In [None]:
# ridge with stochastic gradient descent
iterations = [1,5,50,500,5e3,1e4]
plt.subplots(2,3,figsize=(20,14))
for i in range(6):
    plt.subplot(2,3,i+1)
    its = iterations[i]
    
    model = Ridge(alpha=0, normalize=True,max_iter=its,solver='sag',tol=1e-9)
    model.fit(data[predictors], data['y'])
    y_pred = model.predict(data[predictors])
    plt.plot(data['x'],data['y'],'.',ms=16)
    plt.plot(data['x'],y_pred)
    plt.title('Max Iterations {}'.format(its))

In [None]:
plt.subplot()