# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [25]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [26]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [27]:
# Load necessary packages
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

# remove "object"-type features and SalesPrice from `X`
X = df.drop([col for col in list(df.columns) if df[col].dtype == 'object'], axis=1)
X.drop('SalePrice', axis=1, inplace=True)

# Impute null values
X.LotFrontage.fillna(X.LotFrontage.dropna().median(), inplace=True)
X.GarageYrBlt.fillna(X.GarageYrBlt.dropna().median(), inplace=True)
X.MasVnrArea.fillna(X.MasVnrArea.dropna().median(), inplace=True)
# Create y
y = df.SalePrice

Look at the information of `X` again

In [28]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [29]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error, r2_score

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Fit the model and print R2 and MSE for train and test
lin = LinearRegression()
lin.fit(X_train, y_train)

r2_train = r2_score(y_train, lin.predict(X_train))
mse_train = mean_squared_error(y_train, lin.predict(X_train))

r2_test = r2_score(y_test, lin.predict(X_test))
mse_test = mean_squared_error(y_test, lin.predict(X_test))

print('R2 Train: {}\nMSE Train: {}\n'.format(r2_train, mse_train))
print('R2 Test: {}\nMSE Test: {}'.format(r2_test, mse_test))

R2 Train: 0.8508149394655694
MSE Train: 1016603870.6482387

R2 Test: 0.49605696423712076
MSE Test: 2408670072.715824


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [30]:
from sklearn import preprocessing

# Scale the data and perform train test split
transformed = preprocessing.scale(X)
X = pd.DataFrame(transformed, columns = X.columns)

X_train, X_test, y_train, y_test = train_test_split(X, y)

Perform the same linear regression on this data and print out R-squared and MSE.

In [31]:
# Your code here
lin = LinearRegression()
lin.fit(X_train, y_train)

r2_train = r2_score(y_train, lin.predict(X_train))
mse_train = mean_squared_error(y_train, lin.predict(X_train))

r2_test = r2_score(y_test, lin.predict(X_test))
mse_test = mean_squared_error(y_test, lin.predict(X_test))

print('R2 Train: {}\nMSE Train: {}\n'.format(r2_train, mse_train))
print('R2 Test: {}\nMSE Test: {}'.format(r2_test, mse_test))

R2 Train: 0.8159543522240756
MSE Train: 1172370130.7166448

R2 Test: 0.7911321409501595
MSE Test: 1274556066.8671708


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [32]:
# Create X_cat which contains only the categorical variables
X_cat = df.drop([col for col in list(df.columns) if df[col].dtype != 'object'], axis=1)

In [36]:
# Make dummies
X_cat = pd.get_dummies(X_cat)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [40]:
# Your code here
X_full = pd.concat([X, X_cat], axis=1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [44]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X_full, y)

lin = LinearRegression()
lin.fit(X_train, y_train)

r2_train = r2_score(y_train, lin.predict(X_train))
mse_train = mean_squared_error(y_train, lin.predict(X_train))

r2_test = r2_score(y_test, lin.predict(X_test))
mse_test = mean_squared_error(y_test, lin.predict(X_test))

print('R2 Train: {}\nMSE Train: {}\n'.format(r2_train, mse_train))
print('R2 Test: {}\nMSE Test: {}'.format(r2_test, mse_test))

R2 Train: 0.9328972187094093
MSE Train: 381323369.1022831

R2 Test: -1.9959881743844896e+21
MSE Test: 1.6231172092577727e+31


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [47]:
# Your code here
lasso = Lasso(alpha=1)
lasso.fit(X_train, y_train)

r2_train = r2_score(y_train, lasso.predict(X_train))
mse_train = mean_squared_error(y_train, lasso.predict(X_train))

r2_test = r2_score(y_test, lasso.predict(X_test))
mse_test = mean_squared_error(y_test, lasso.predict(X_test))

print('R2 Train: {}\nMSE Train: {}\n'.format(r2_train, mse_train))
print('R2 Test: {}\nMSE Test: {}'.format(r2_test, mse_test))

R2 Train: 0.9487853986215966
MSE Train: 291035989.4066221

R2 Test: 0.824955239220053
MSE Test: 1423446126.878766


With a higher regularization parameter (alpha = 10)

In [48]:
# Your code here
lasso = Lasso(alpha=10)
lasso.fit(X_train, y_train)

r2_train = r2_score(y_train, lasso.predict(X_train))
mse_train = mean_squared_error(y_train, lasso.predict(X_train))

r2_test = r2_score(y_test, lasso.predict(X_test))
mse_test = mean_squared_error(y_test, lasso.predict(X_test))

print('R2 Train: {}\nMSE Train: {}\n'.format(r2_train, mse_train))
print('R2 Test: {}\nMSE Test: {}'.format(r2_test, mse_test))

R2 Train: 0.9471750421563708
MSE Train: 300187123.54689443

R2 Test: 0.8321990029645165
MSE Test: 1364540579.519689


## Ridge

With default parameter (alpha = 1)

In [49]:
# Your code here
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)

r2_train = r2_score(y_train, ridge.predict(X_train))
mse_train = mean_squared_error(y_train, ridge.predict(X_train))

r2_test = r2_score(y_test, ridge.predict(X_test))
mse_test = mean_squared_error(y_test, ridge.predict(X_test))

print('R2 Train: {}\nMSE Train: {}\n'.format(r2_train, mse_train))
print('R2 Test: {}\nMSE Test: {}'.format(r2_test, mse_test))

R2 Train: 0.9350140404770928
MSE Train: 369294156.71022874

R2 Test: 0.8132270730648443
MSE Test: 1518818376.8943455


With default parameter (alpha = 10)

In [50]:
# Your code here
ridge = Ridge(alpha=10)
ridge.fit(X_train, y_train)

r2_train = r2_score(y_train, ridge.predict(X_train))
mse_train = mean_squared_error(y_train, ridge.predict(X_train))

r2_test = r2_score(y_test, ridge.predict(X_test))
mse_test = mean_squared_error(y_test, ridge.predict(X_test))

print('R2 Train: {}\nMSE Train: {}\n'.format(r2_train, mse_train))
print('R2 Test: {}\nMSE Test: {}'.format(r2_test, mse_test))

R2 Train: 0.9179677914960495
MSE Train: 466162467.7845314

R2 Test: 0.7975575439364618
MSE Test: 1646241388.2911267


## Look at the metrics, what are your main conclusions?   

Conclusions here

Lasso Alpha 10 produces lowest MSE and highest R2 for testing data

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [57]:
# number of Ridge params almost zero
ridge_params_zero = len([param for param in ridge.coef_ if abs(param) < 10**(-10)])

In [58]:
# number of Lasso params almost zero
lasso_params_zero = len([param for param in lasso.coef_ if abs(param) < 10**(-10)])

Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

In [59]:
# your code here
print('Lasso: ', lasso_params_zero/len(lasso.coef_))
print('Ridge: ', ridge_params_zero/len(ridge.coef_))

Lasso:  0.2422145328719723
Ridge:  0.02768166089965398


## Summary

Great! You now know how to perform Lasso and Ridge regression.