<a href="https://colab.research.google.com/github/risitadas/machine-learning-in-python/blob/main/31.regularization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

L1 and L2 Regularization : Lasso, Ridge Regression

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
dataset = pd.read_csv('https://raw.githubusercontent.com/risitadas/machine-learning-in-python/main/melbourne-house-prices-csv.csv')
dataset.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Postcode,Regionname,Propertycount,Distance,CouncilArea
0,Abbotsford,49 Lithgow St,3,h,1490000.0,S,Jellis,01-04-2017,3067,Northern Metropolitan,4019,3.0,Yarra City Council
1,Abbotsford,59A Turner St,3,h,1220000.0,S,Marshall,01-04-2017,3067,Northern Metropolitan,4019,3.0,Yarra City Council
2,Abbotsford,119B Yarra St,3,h,1420000.0,S,Nelson,01-04-2017,3067,Northern Metropolitan,4019,3.0,Yarra City Council
3,Aberfeldie,68 Vida St,3,h,1515000.0,S,Barry,01-04-2017,3040,Western Metropolitan,1543,7.5,Moonee Valley City Council
4,Airport West,92 Clydesdale Rd,2,h,670000.0,S,Nelson,01-04-2017,3042,Western Metropolitan,3464,10.4,Moonee Valley City Council


In [4]:
dataset.nunique()

Suburb             380
Address          57754
Rooms               14
Type                 3
Price             3417
Method               9
SellerG            476
Date               112
Postcode           225
Regionname           8
Propertycount      368
Distance           180
CouncilArea         34
dtype: int64

In [7]:
cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method', 'SellerG', 'Regionname', 'Propertycount', 
               'Distance', 'CouncilArea', 'Price']
dataset = dataset[cols_to_use]

In [8]:
dataset.head()

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Price
0,Abbotsford,3,h,S,Jellis,Northern Metropolitan,4019,3.0,Yarra City Council,1490000.0
1,Abbotsford,3,h,S,Marshall,Northern Metropolitan,4019,3.0,Yarra City Council,1220000.0
2,Abbotsford,3,h,S,Nelson,Northern Metropolitan,4019,3.0,Yarra City Council,1420000.0
3,Aberfeldie,3,h,S,Barry,Western Metropolitan,1543,7.5,Moonee Valley City Council,1515000.0
4,Airport West,2,h,S,Nelson,Western Metropolitan,3464,10.4,Moonee Valley City Council,670000.0


In [9]:
dataset.shape

(63023, 10)

checking for Nan values

In [10]:
dataset.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           0
Propertycount        0
Distance             0
CouncilArea          0
Price            14590
dtype: int64

handling missing values

In [13]:
cols_to_fill_zero = ['Propertycount', 'Distance']
dataset[cols_to_fill_zero] = dataset[cols_to_fill_zero].fillna(0)


dropping NA values of price

In [14]:
dataset.dropna(inplace = True)

In [15]:
dataset.shape

(48433, 10)

let's one hot encode the categorical features

In [16]:
dataset = pd.get_dummies(dataset, drop_first=True)

In [17]:
dataset.head()

Unnamed: 0,Rooms,Propertycount,Distance,Price,Suburb_Aberfeldie,Suburb_Airport West,Suburb_Albanvale,Suburb_Albert Park,Suburb_Albion,Suburb_Alphington,...,CouncilArea_Moreland City Council,CouncilArea_Murrindindi Shire Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
0,3,4019,3.0,1490000.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,3,4019,3.0,1220000.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,3,4019,3.0,1420000.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,3,1543,7.5,1515000.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,3464,10.4,670000.0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


bifurcating our dataset into train and test dataset

In [18]:
X = dataset.drop('Price', axis=1)
y = dataset['Price']

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

training our Linear Regression Model on training dataset and checking the accuracy on test set

In [21]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)

In [22]:
reg.score(X_test, y_test)

0.6604739391787537

In [23]:
reg.score(X_train, y_train)

0.660013083756118

we see, from above, normal regression is clearly overfitting the data

using Lasso(L1 Regularized) Regression Model

In [24]:
from sklearn import linear_model
lasso_reg = linear_model.Lasso(alpha = 50, max_iter=100, tol=0.1)
lasso_reg.fit(X_train, y_train)

Lasso(alpha=50, max_iter=100, tol=0.1)

In [25]:
lasso_reg.score(X_test, y_test)

0.6615824233343732

In [26]:
lasso_reg.score(X_train, y_train)

0.6554328357736653

using Ridge(L2 Regularized) Regression Model

In [28]:
from sklearn.linear_model import Ridge
ridge_reg= Ridge(alpha=50, max_iter=100, tol=0.1)
ridge_reg.fit(X_train, y_train)

Ridge(alpha=50, max_iter=100, tol=0.1)

In [29]:
ridge_reg.score(X_test, y_test)


0.6573232856316465

In [30]:
ridge_reg.score(X_train, y_train)

0.6471379169766673

we see that Lasso and Ridge Regularizations prove to be beneficial when our Simple Linear Regression Model overfits

these results may not be that contrast but significant in most cases

also that L1 & L2 Regularizations are used in Neural Networks too