# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [63]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [64]:
# Your code here
df.info

<bound method DataFrame.info of         Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0        1          60       RL         65.0     8450   Pave   NaN      Reg   
1        2          20       RL         80.0     9600   Pave   NaN      Reg   
2        3          60       RL         68.0    11250   Pave   NaN      IR1   
3        4          70       RL         60.0     9550   Pave   NaN      IR1   
4        5          60       RL         84.0    14260   Pave   NaN      IR1   
5        6          50       RL         85.0    14115   Pave   NaN      IR1   
6        7          20       RL         75.0    10084   Pave   NaN      Reg   
7        8          60       RL          NaN    10382   Pave   NaN      IR1   
8        9          50       RM         51.0     6120   Pave   NaN      Reg   
9       10         190       RL         50.0     7420   Pave   NaN      Reg   
10      11          20       RL         70.0    11200   Pave   NaN      Reg   
11      12          

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [65]:
# Load necessary packages
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# remove "object"-type features and SalesPrice from `X`
features = [col for col in df.columns if df[col].dtype != object and col != 'SalePrice']
X = df[features]
# Impute null values
for col in X:
    X[col].fillna(value = X[col].median(), inplace = True)

# Create y
y = pd.DataFrame(df["SalePrice"])

Look at the information of `X` again

In [66]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

In [67]:
y

Unnamed: 0,SalePrice
0,208500
1,181500
2,223500
3,140000
4,250000
5,143000
6,307000
7,200000
8,129900
9,118000


## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [68]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=.20)

# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(Xtrain, ytrain)
y_train_pred = linreg.predict(Xtrain)
y_test_pred = linreg.predict(Xtest)


print(linreg.score(Xtrain, ytrain))
print(linreg.score(Xtest, ytest))
print(mean_squared_error(ytrain, y_train_pred))
print(mean_squared_error(ytest, y_test_pred))

0.8146738880599219
0.79181822544734
1199672140.2734427
1173925595.82121


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [69]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scaled = preprocessing.scale(X)

Xtrain, Xtest, ytrain, ytest = train_test_split(X_scaled, y, test_size=.20)

Perform the same linear regression on this data and print out R-squared and MSE.

In [70]:
# Your code here
linreg = LinearRegression()
linreg.fit(Xtrain, ytrain)
y_train_pred = linreg.predict(Xtrain)
y_test_pred = linreg.predict(Xtest)


print(linreg.score(Xtrain, ytrain))
print(linreg.score(Xtest, ytest))
print(mean_squared_error(ytrain, y_train_pred))
print(mean_squared_error(ytest, y_test_pred))

0.8137426380367498
0.802067179813251
1167863274.2364774
1272825211.9813595


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [71]:
# Create X_cat which contains only the categorical variables
features_cat = [col for col in df.columns if df[col].dtype == object]
X_cat = df[features_cat]

In [72]:
# Make dummies
X_cat = pd.get_dummies(X_cat)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [73]:
# Your code here
X_new = pd.concat([pd.DataFrame(X_scaled), X_cat], axis=1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [74]:
# Your code here
Xtrain, Xtest, ytrain, ytest = train_test_split(X_new, y, test_size=.20)

linreg = LinearRegression()
linreg.fit(Xtrain, ytrain)
y_train_pred = linreg.predict(Xtrain)
y_test_pred = linreg.predict(Xtest)


print(linreg.score(Xtrain, ytrain))
print(linreg.score(Xtest, ytest))
print(mean_squared_error(ytrain, y_train_pred))
print(mean_squared_error(ytest, y_test_pred))

0.9349425449221082
-7.89940488025107e+18
412264110.0839041
4.884442735691153e+28


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [75]:
# Your code here
lasso = Lasso()
lasso.fit(Xtrain, ytrain)
y_train_pred = lasso.predict(Xtrain)
y_test_pred = lasso.predict(Xtest)


print(lasso.score(Xtrain, ytrain))
print(lasso.score(Xtest, ytest))
print(mean_squared_error(ytrain, y_train_pred))
print(mean_squared_error(ytest, y_test_pred))

0.9348242982065984
0.889073096354922
413013430.4635839
685894845.1376156


With a higher regularization parameter (alpha = 10)

In [76]:
# Your code here
lasso = Lasso(alpha=10)
lasso.fit(Xtrain, ytrain)
y_train_pred = lasso.predict(Xtrain)
y_test_pred = lasso.predict(Xtest)


print(lasso.score(Xtrain, ytrain))
print(lasso.score(Xtest, ytest))
print(mean_squared_error(ytrain, y_train_pred))
print(mean_squared_error(ytest, y_test_pred))

0.9330353640820506
0.8964632638125635
424349768.99653864
640199187.9315116


## Ridge

With default parameter (alpha = 1)

In [77]:
# Your code here
ridge = Ridge()
ridge.fit(Xtrain, ytrain)
y_train_pred = ridge.predict(Xtrain)
y_test_pred = ridge.predict(Xtest)


print(ridge.score(Xtrain, ytrain))
print(ridge.score(Xtest, ytest))
print(mean_squared_error(ytrain, y_train_pred))
print(mean_squared_error(ytest, y_test_pred))

0.9213362551627169
0.8948112623404857
498486126.1846443
650414016.4052548


With default parameter (alpha = 10)

In [78]:
# Your code here
ridge = Ridge(alpha=10)
ridge.fit(Xtrain, ytrain)
y_train_pred = ridge.predict(Xtrain)
y_test_pred = ridge.predict(Xtest)


print(ridge.score(Xtrain, ytrain))
print(ridge.score(Xtest, ytest))
print(mean_squared_error(ytrain, y_train_pred))
print(mean_squared_error(ytest, y_test_pred))

0.897187982447313
0.8936678275228889
651511880.8681003
657484222.2919165


## Look at the metrics, what are your main conclusions?

Conclusions here

The Ridge model with ALPHA=10 performed the best.

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [81]:
# number of Ridge params almost zero
count = 0
for x in sum(abs(ridge.coef_) < 10**(-10)):
    count += x
print(count)

8


In [59]:
len(lasso.coef_)

289

In [79]:
len(ridge.coef_[0])

289

In [80]:
print(sum(abs(ridge.coef_[0]) < 10**(-10)))

8


In [82]:
# number of Lasso params almost zero
print(sum(abs(lasso.coef_) < 10**(-10)))

72


Compare with the total length of the parameter space and draw conclusions!

In [83]:
# your code here
(len(lasso.coef_) - len(ridge.coef_))

288

## Summary

Great! You now know how to perform Lasso and Ridge regression.