# Regression with regularization

Ridge and Lasso regression are some of the simple techniques to reduce model complexity and prevent over-fitting which may result from simple linear regression.

## Ridge Regression:

In ridge regression, the cost function is altered by adding a penalty equivalent to square of the magnitude of the coefficients. This is also known as L2 regression

<img src="Ridge_formula.png">

Now we will apply a ridge regression to the Boston Housing Data Set and compare the results with the linear regression performed in the previous Session. First load the pandas data set, and convert it to a data frame.

In [5]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import matplotlib
matplotlib.rcParams.update({'font.size': 12})
from sklearn.datasets import load_boston
boston_dataset = load_boston()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

In [8]:
boston_df = pd.DataFrame(data=boston_dataset.data, columns=boston_dataset.feature_names)
boston_df['MEDV'] = boston_dataset.target

Now define the X and y and split them randomly with sklearn, use 30% of the data for test and select the ```random_state=3```

First define a linear regression model, and fit the data

Now define a ridge regression model, set ```alpha=0.3```. What is the alpha parameter doing? What would you expect to happen for alpha very small? And for alpha very large? Set the option ```normalize=True```. Why is it a good practice to normalize? What effect would the regularization term have over two features, one 10 times larger than the other?

Build two more ridge regression models but with alpha taking values 1e-5 and 10. Then compute the scores over the train and test sets for all models and print them.

Finally, generate a plot with the fitted coefficients for the different models. Use ```plt.plot()``` function and change the ```marker='*','d', 's', 'o'``` and ```color='red', 'blue', 'yellow', 'green'``` parameters to distinguish the different points. Also you should add a legend so the plot is interpretable. Comment on the graphic: How come when alpha is so small you recover the same results as for linear regression? What happens when alpha is bigger?

## Lasso regression

The cost function for Lasso (least absolute shrinkage and selection operator) regression can be written as

<img src="lasso_formula.png">

This type of regularization (L1) can lead to zero coefficients i.e. some of the features are completely neglected for the evaluation of output. So Lasso regression not only helps in reducing over-fitting but it can help us in feature selection.

Generate a Lasso model with ```alpha=1e-1``` and ```normalize=True```

Now check the number of coefficients that are different than 0. To which parameters would you expect they correspond? You may check if you are right by looking at exercise 1 from the previous session.

Repeat again the Lasso regression but changing the alpha parameter to 1e-5. How many coefficients will be different than 0 now? And if you use a big alpha, say 100, what should happen?

Finally train a linear model, plot the coefficients obtained for the different lasso models and compare them with the linear regression.

In practice, we will not manually look for the best alpha coefficient. Both in Lasso and ridge regression, the alpha is a hyperparameter. Therefore, we will find the best one via crossvalidation. Fortunately, sklearn has a function that performs CV to find the optimal alpha automatically. Implement a CV with the ```LassoCV``` object. Print the optimal alpha and the score on the test set. Note that there are similar functions for Ridge regression and Elastic net, which has a regularization that combines both L1 and L2 regularizations. Play with the variable `cv` from the `LassoCV` object to perform k-fold cross-validation. What is the minimum value that you can give to `cv`, and the maximum? 