# Regularization

Overfitting is a common issue in machine learning where a model learns not only the underlying patterns in the training data but also captures noise and random fluctuations.

overfitting occurs when a model becomes too complex and starts fitting the training data too closely, including the noise and outliers

L1 and L2 regularization are some of the techniques that can be used to solve overfitting issue.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

In [2]:
# read dataset
df = pd.read_csv('Melbourne_housing_FULL.csv')
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


In [3]:
df.shape

(34857, 21)

In [12]:
# discarding unecessary columns
cols_to_use = ['Suburb','Rooms','Type','Method','SellerG','Regionname','Propertycount',
                 'Distance','CouncilArea','Bathroom','Bedroom2','Car','Landsize','BuildingArea','Price']
df = df[cols_to_use]
df.shape

(34857, 15)

In [13]:
# checking null columns 
df.isna().any().astype(int)

Suburb           0
Rooms            0
Type             0
Method           0
SellerG          0
Regionname       1
Propertycount    1
Distance         1
CouncilArea      1
Bathroom         1
Bedroom2         1
Car              1
Landsize         1
BuildingArea     1
Price            1
dtype: int32

In [16]:
# filling some null columns with 0 value
cols_to_fill_zero = ['Propertycount','Distance','Bedroom2','Bathroom','Car']
df[cols_to_fill_zero] = df[cols_to_fill_zero].fillna(0)
df.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        0
Distance             0
CouncilArea          3
Bathroom             0
Bedroom2             0
Car                  0
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

In [17]:
# filling some null columns with mean value
df['Landsize'] = df['Landsize'].fillna(df.Landsize.mean())
df['BuildingArea'] = df['BuildingArea'].fillna(df.BuildingArea.mean())

In [18]:
df.isna().sum()

Suburb              0
Rooms               0
Type                0
Method              0
SellerG             0
Regionname          3
Propertycount       0
Distance            0
CouncilArea         3
Bathroom            0
Bedroom2            0
Car                 0
Landsize            0
BuildingArea        0
Price            7610
dtype: int64

In [19]:
# dropping some columns
df.dropna(inplace=True)
df.isna().sum()

Suburb           0
Rooms            0
Type             0
Method           0
SellerG          0
Regionname       0
Propertycount    0
Distance         0
CouncilArea      0
Bathroom         0
Bedroom2         0
Car              0
Landsize         0
BuildingArea     0
Price            0
dtype: int64

# Dummy Encoding

In [21]:
# text columns are removed
df =pd.get_dummies(df, drop_first=True)
df.head()

Unnamed: 0,Rooms,Propertycount,Distance,Bathroom,Bedroom2,Car,Landsize,BuildingArea,Price,Suburb_Aberfeldie,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
1,2,4019.0,2.5,1.0,2.0,1.0,202.0,160.2564,1480000.0,False,...,False,False,False,False,False,False,False,False,True,False
2,2,4019.0,2.5,1.0,2.0,0.0,156.0,79.0,1035000.0,False,...,False,False,False,False,False,False,False,False,True,False
4,3,4019.0,2.5,2.0,3.0,0.0,134.0,150.0,1465000.0,False,...,False,False,False,False,False,False,False,False,True,False
5,3,4019.0,2.5,2.0,3.0,1.0,94.0,160.2564,850000.0,False,...,False,False,False,False,False,False,False,False,True,False
6,4,4019.0,2.5,1.0,3.0,2.0,120.0,142.0,1600000.0,False,...,False,False,False,False,False,False,False,False,True,False


In [22]:
x = df.drop('Price', axis='columns')
y = df['Price']

# Train Test Split

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x,y,test_size=0.3,random_state=2)

In [24]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, Y_train)

In [25]:
reg.score(X_test, Y_test)

0.13853683161758845

In [26]:
reg.score(X_train, Y_train)

0.6827792395792723

Clearly overfitting my dataset. For train datasets it is working considerably fine but for unseen datasets it is sooo bad.

# Lasso (L1 Regularization)

In [27]:
from sklearn import linear_model
lasso_reg = linear_model.Lasso(alpha=50, max_iter=100, tol=0.1)
lasso_reg.fit(X_train, Y_train)

  model = cd_fast.enet_coordinate_descent(


In [28]:
lasso_reg.score(X_test, Y_test)

0.6636077695922897

In [29]:
lasso_reg.score(X_train, Y_train)

0.6766985871040827

Now, the model's performance is considerably okay. And after the regularization the difference is not big. 

# Ridge(L2 Regularization)

In [30]:
from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=50,max_iter=100, tol=0.1 )
ridge_reg.fit(X_train,Y_train)

In [31]:
ridge_reg.score(X_test, Y_test)

0.6670848945194959

In [32]:
ridge_reg.score(X_train, Y_train)

0.6622376739684328