# Penalized Regression
Penalized Regression attempts to balance between bias and variance.

Remember, for standard linear regression of a single variable, we use this equation:

$\bar{y} = m*x + b$

However, if we want to make a more complex model, we can add more terms:

$\bar{y} = ... + \beta_3*x^3 + \beta_2*x^2 + \beta_1*x + \beta_0$


Again, for standard linear regression of a single variable, the error is:

$mse = \frac{1}{N} \sum{( y_i - \bar{y}(x_i) )^2}$

So, for more complex models, we have a lot more terms:

$mse = \frac{1}{N} \sum{(y_i - (... + \beta_3*x^3 + \beta_2*x^2 + \beta_1*x + \beta_0))^2}$

As we saw in the bias/variance tradeoff, we can get a more precise solution with a larger model, but two things happen:
1. we lose generalization (overfitting)
2. we have a lot of coefficients that get larger and larger.

To address that, we really want to get rid of the some coefficients in order to make the model more robust.

We can add a term to the error equation which penalizes coefficients:

$pmse = \frac{1}{N} \sum{( y_i - \bar{y}(x_i) )^2} + \lambda \sum_b{\beta_b^2}$

or:

$pmse = \frac{1}{N} \sum{(y_i - (... + \beta_3*x^3 + \beta_2*x^2 + \beta_1*x + \beta_0))^2} + \lambda \sum_b{\beta_b^2}$

This says that large in the model increase the error.  As such, this regression attempts to keep all of the coefficients as small as possible.  It may also keep the coefficients well-balanced (essentially distributing the work across the different terms.  This is called __Ridge Regression__.  It does "L2 Regularization" (which is a term that you will hear, but not necessarily need to know).

Another form of penalized regression attempts to force some coefficents towards zero (essentially throwing them out of the model).  This attempts to elliminate *"unnecessary"* coefficients and simplify the model.  This is called __LASSO Regression__. It does "L1 Regularization" (which is a term that you will hear, but not necessarily need to know).

$pmse = \frac{1}{N} \sum{( y_i - \bar{y}(x_i) )^2} + \lambda \sum_b{||\beta_b||}$

Lastly, you may want to have very few terms and have them be small.  This is __Elastic Net Regression__ which has both Ridge and Lasson built in:

$pmse = \frac{1}{N} \sum{( y_i - \bar{y}(x_i) )^2}  + \lambda_1 \sum_b{\beta_b^2} + \lambda_2 \sum_b{||\beta_b||}$


__In both cases, we have a $\lambda$ term__.  What is that for?  It is simply a constant that let's us set the relative influence of penalizing term.

<font color='red'>
## When would you want to use a large or small $\lambda$?



In [None]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import matplotlib

from sklearn.datasets import load_boston
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet


In [None]:
matplotlib.rcParams.update({'font.size': 12})

### Time to look at using penalized regression in python.

We are going to start with our previous data (boston housing prices)

In [None]:
boston=load_boston()
boston_df=pd.DataFrame(boston.data,columns=boston.feature_names)
print(boston_df.info())

In [None]:
# add another column that contains the house prices which in scikit learn datasets are considered as target
boston_df['Price']=boston.target
print(boston_df.head(3))

In [None]:
newY=boston_df['Price']
newX=boston_df.drop('Price',axis=1)
print(newX[0:3]) # check 
print("Price")
print(newY[0:3]) # check 


### Training and Testing
For the purpose of this analysis, we need two datasets.  The first is the **training dataset**.  We use this to build the model. The second is the **testing dataset** to test how well the model did. 


In [None]:
X_train,X_test,y_train,y_test=train_test_split(newX,newY,test_size=0.3,random_state=3)
print(len(X_test), len(y_test))

### Comparing Linear Regression to Ridge Regression

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

# higher the alpha value, more restriction on the coefficients; low alpha > more generalization, coefficients are barely
# restricted and in this case linear and ridge regression resembles
rr = Ridge(alpha=0.1) 
rr.fit(X_train, y_train)

rr100 = Ridge(alpha=100) #  comparison with alpha value
rr100.fit(X_train, y_train)

train_score=lr.score(X_train, y_train)
test_score=lr.score(X_test, y_test)

Ridge_train_score = rr.score(X_train,y_train)
Ridge_test_score = rr.score(X_test, y_test)
Ridge_train_score100 = rr100.score(X_train,y_train)
Ridge_test_score100 = rr100.score(X_test, y_test)

print("linear regression train score:", train_score)
print("linear regression test score:", test_score)
print("ridge regression train score low alpha:", Ridge_train_score)
print("ridge regression test score low alpha:", Ridge_test_score)
print("ridge regression train score high alpha:", Ridge_train_score100)
print("ridge regression test score high alpha:", Ridge_test_score100)



In [None]:
plt.plot(rr.coef_,alpha=0.7,linestyle='none',marker='*',markersize=5,color='red',label=r'Ridge; $\alpha = 0.01$',zorder=7) # zorder for ordering the markers
plt.plot(rr100.coef_,alpha=0.5,linestyle='none',marker='d',markersize=6,color='blue',label=r'Ridge; $\alpha = 100$') # alpha here is for transparency
plt.plot(lr.coef_,alpha=0.4,linestyle='none',marker='o',markersize=7,color='green',label='Linear Regression')
plt.xlabel('Coefficient Index',fontsize=16)
plt.ylabel('Coefficient Magnitude',fontsize=16)
plt.legend(fontsize=13,loc=4)
plt.show()

### Comparing Linear Regression to LASSO Regression

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

# higher the alpha value, more restriction on the coefficients; low alpha > more generalization, coefficients are barely
# restricted and in this case linear and ridge regression resembles
lar = Lasso(alpha=0.1) 
lar.fit(X_train, y_train)

lar100 = Lasso(alpha=100) #  comparison with alpha value
lar100.fit(X_train, y_train)

train_score=lr.score(X_train, y_train)
test_score=lr.score(X_test, y_test)

Lasso_train_score = lar.score(X_train,y_train)
Lasso_test_score = lar.score(X_test, y_test)
Lasso_train_score100 = lar100.score(X_train,y_train)
Lasso_test_score100 = lar100.score(X_test, y_test)

lr_coeff_used = np.sum(lr.coef_!=0)
coeff_used = np.sum(lar.coef_!=0)
coeff_used_100 = np.sum(lar100.coef_!=0)

print("linear regression train score:", train_score)
print("linear regression test score:", test_score)
print("Lasso regression train score low alpha:", Lasso_train_score)
print("Lasso regression test score low alpha:", Lasso_test_score)
print("Lasso regression train score high alpha:", Lasso_train_score100)
print("Lasso regression test score high alpha:", Lasso_test_score100)

print("Linear regression number of coefficients:", lr_coeff_used)
print("Lasso regression number of coefficients low alpha:", coeff_used)
print("Lasso regression number of coefficients high alpha:", coeff_used_100)




In [None]:
plt.plot(lar.coef_,alpha=0.7,linestyle='none',marker='*',markersize=5,color='red',label=r'Lasso; $\alpha = 0.01$',zorder=7) # zorder for ordering the markers
plt.plot(lar100.coef_,alpha=0.5,linestyle='none',marker='d',markersize=6,color='blue',label=r'Lasso; $\alpha = 100$') # alpha here is for transparency
plt.plot(lr.coef_,alpha=0.4,linestyle='none',marker='o',markersize=7,color='green',label='Linear Regression')
plt.xlabel('Coefficient Index',fontsize=16)
plt.ylabel('Coefficient Magnitude',fontsize=16)
plt.legend(fontsize=13,loc=4)
plt.show()

## Time for some read work to get done.
We are going to use a whole new dataset and work out our own solutions.

This is a breast cancer prediction dataset that looks at image features to try to predict breast cancer.

In [None]:
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
cancer_df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
print(cancer_df.head(3))

X = cancer.data
Y = cancer.target
X_train,X_test,y_train,y_test=train_test_split(X,Y, test_size=0.3, random_state=31)




In [None]:
lr = LinearRegression()
lr.fit(X_train,y_train)
lr_train_score=lr.score(X_train,y_train)
lr_test_score=lr.score(X_test,y_test)
lr_coeff_used = np.sum(lr.coef_!=0)
print("LR training score:", lr_train_score)
print( "LR test score: ", lr_test_score)
print("number of features used: ", lr_coeff_used)


In [None]:
# Don't forget to set your alpha!!!
ridge = Ridge(alpha = _______)
ridge.fit(X_train,y_train)
train_score=ridge.score(X_train,y_train)
test_score=ridge.score(X_test,y_test)
ridge_coeff_used = np.sum(ridge.coef_!=0)
print("training score:", train_score) 
print("test score: ", test_score)
print("number of features used: ", ridge_coeff_used)



In [None]:
lasso = Lasso(...
              
              
              
print("training score:", train_score) 
print("test score: ", test_score)
print("number of features used: ", lasso_coeff_used)

In [None]:
elastic = ElasticNet(...
                     
                     
print("training score:", train_score) 
print("test score: ", test_score)
print("number of features used: ", elastic_coeff_used)

In [None]:
# Plot your findings