# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [42]:
# Your code here

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [43]:
# Load necessary packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# remove "object"-type features and SalesPrice from `X`
X = df.select_dtypes(exclude=['object'])
X.drop(columns = 'SalePrice', inplace = True)

# Impute null values
for column in X:
    median = X[column].median()
    X[column].fillna(value = median, inplace=True)




# Create y
y = df["SalePrice"]


Look at the information of `X` again

In [44]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [45]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)

r2_train = linreg.score(X_train, y_train)
r2_test = linreg.score(X_test, y_test)

y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)
MSE_train = mean_squared_error(y_train, y_hat_train)
MSE_test = mean_squared_error(y_test, y_hat_test)


print(f"Train R^2: {r2_train}\nTest R^2: {r2_test}")
print(f"Train MSE: {MSE_train}\nTest MSE: {MSE_test}")







Train R^2: 0.8792812746143956
Test R^2: 0.45926221860708294
Train MSE: 768735679.7667004
Test MSE: 3266991413.4591403


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [46]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scaled = preprocessing.scale(X)

X_scaled_train, X_scaled_test, y_scaled_train, y_scaled_test = train_test_split(X, y)

Perform the same linear regression on this data and print out R-squared and MSE.

In [47]:
linreg_norm = LinearRegression()
linreg_norm.fit(X_scaled_train, y_scaled_train)

r2_scaled_train = linreg.score(X_scaled_train, y_scaled_train)
r2_scaled_test = linreg.score(X_scaled_test, y_scaled_test)

y_hat_scaled_train = linreg.predict(X_scaled_train)
y_hat_scaled_test = linreg.predict(X_scaled_test)
MSE_scaled_train = mean_squared_error(y_scaled_train, y_hat_scaled_train)
MSE_scaled_train = mean_squared_error(y_scaled_test, y_hat_scaled_test)




print(f"For Normalized Dataset:")
print(f"Train R^2: {r2_scaled_train}\nTest R^2: {r2_test}")
print(f"Train MSE : {MSE_train}\nTest MSE: {MSE_test}")


For Normalized Dataset:
Train R^2: 0.834520609586159
Test R^2: 0.45926221860708294
Train MSE : 768735679.7667004
Test MSE: 3266991413.4591403


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [52]:
# Create X_cat which contains only the categorical variables
X_cat = df.select_dtypes(include=['object'])
# (1460, 43) -> 43 rows


In [53]:
# Make dummies
X_cat = pd.get_dummies(X_cat)
# X_cat.shape = (1460, 252) -> 252 rows

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [55]:
# Your code here

X_all = pd.concat([pd.DataFrame(X_scaled), X_cat], axis = 1)

In [59]:
X_all.shape, X_scaled.shape, X_cat.shape

((1460, 289), (1460, 37), (1460, 252))

In [61]:
pd.DataFrame(x_all)

<bound method DataFrame.info of              0         1         2         3         4         5         6  \
0    -1.730865  0.073375 -0.220875 -0.207142  0.651479 -0.517200  1.050994   
1    -1.728492 -0.872563  0.460320 -0.091886 -0.071836  2.179628  0.156734   
2    -1.726120  0.073375 -0.084636  0.073480  0.651479 -0.517200  0.984752   
3    -1.723747  0.309859 -0.447940 -0.096897  0.651479 -0.517200 -1.863632   
4    -1.721374  0.073375  0.641972  0.375148  1.374795 -0.517200  0.951632   
5    -1.719002 -0.163109  0.687385  0.360616 -0.795151 -0.517200  0.719786   
6    -1.716629 -0.872563  0.233255 -0.043379  1.374795 -0.517200  1.084115   
7    -1.714256  0.073375 -0.039223 -0.013513  0.651479  0.381743  0.057371   
8    -1.711883 -0.163109 -0.856657 -0.440659  0.651479 -0.517200 -1.333700   
9    -1.709511  3.147673 -0.902070 -0.310370 -0.795151  0.381743 -1.068734   
10   -1.707138 -0.872563  0.006190  0.068469 -0.795151 -0.517200 -0.207594   
11   -1.704765  0.073375  0.6873

Perform the same linear regression on this data and print out R-squared and MSE.

In [None]:
# Your code here


X_all_train, X_all_test, y_all_train, y_all_test = train_test_split(X_all, y_all)

linreg_all = LinearRegression()
linr

Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [None]:
# Your code here

With a higher regularization parameter (alpha = 10)

In [None]:
# Your code here

## Ridge

With default parameter (alpha = 1)

In [None]:
# Your code here

With default parameter (alpha = 10)

In [None]:
# Your code here

## Look at the metrics, what are your main conclusions?

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [None]:
# number of Ridge params almost zero

In [None]:
# number of Lasso params almost zero

Compare with the total length of the parameter space and draw conclusions!

In [None]:
# your code here

## Summary

Great! You now know how to perform Lasso and Ridge regression.