# L1 and L2 Regularization

Link to the Youtube video tutorial: https://www.youtube.com/watch?v=VqKq78PVO9g&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=18

1) A regression model in the cases of underfit, overfit and balanced fit respectively: <br />
<img src="hidden\issues.png" alt="This image visualizes a regression model in the cases of underfit, overfit and balanced fit respectively" style="width: 400px;"/>  <br />

2) To reduce overfitting:
    1) Here is the overfit line/regression model along with its equation: <br />
    <img src="hidden\overfitline.png" alt="This image visualizes an overfit regression model/line" style="width: 400px;"/>  <br />

    2) Shrink the parameters (Theta 0 to Theta 4) OR keep the parameters' value smaller, so that you can get a better equation for your prediciton function.  <br />
    <img src="hidden\overfitline2.png" alt="This image visualizes an overfit regression model/line, part2" style="width: 400px;"/>  <br />

3) Types of cost function:  <br />
    1) The cost function (MSE) used to determine the optimal regression model is depicted as below:  <br />
    <img src="hidden\costfunction.png" alt="This image shows MSE, part1" style="width: 400px;"/>  <br />
    <img src="hidden\costfunction2.png" alt="This image shows MSE, part2" style="width: 400px;"/>  <br />

    2) LASSO regression model = A kind of cost function, which involves L1 regularization, is created by adding the lambda and absolute of theta values to the error/MSE/cost function of a regression model which used to identify the optimal linear regression equation depicted as below:  <br />
    <img src="hidden\costfunction3.png" alt="This image shows MSE, part2" style="width: 400px;"/>  <br />

    3) Ridge regression model = A kind of cost function, which involves L2 regularization, is created by adding the lambda and square of theta values to the error/MSE/cost function of a regression model which used to identify the optimal linear regression equation depicted as below:  <br />
    <img src="hidden\costfunction4.png" alt="This image shows MSE, part3" style="width: 400px;"/>  <br />

4) Detailed information about **Regularization in Machine Learning (with Code Examples)**:  <br />
    https://www.dataquest.io/blog/regularization-in-machine-learning/
  <br />
# Load the dataset

In [69]:
import numpy as np
import matplotlib as plt
import pandas as pd
import seaborn as sns

# suppress warnings for clean notebook
import warnings 
warnings.filterwarnings('ignore')

# load the dataset from a CSV file into a dataframe called dataset
dataset = pd.read_csv("Melbourne_housing_FULL.csv")
dataset.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


# Data exploration

In [70]:
# show the number of distinct elements (unique values) in each column of the dataset dataframe
dataset.nunique()

Suburb             351
Address          34009
Rooms               12
Type                 3
Price             2871
Method               9
SellerG            388
Date                78
Distance           215
Postcode           211
Bedroom2            15
Bathroom            11
Car                 15
Landsize          1684
BuildingArea       740
YearBuilt          160
CouncilArea         33
Lattitude        13402
Longtitude       14524
Regionname           8
Propertycount      342
dtype: int64

In [71]:
# show the shape of the dataframe
dataset.shape

(34857, 21)

In [72]:
# the cols_to_use variable stores the column names of the dataframe which contains useful data
cols_to_use = ['Suburb','Rooms','Type','Method','SellerG','Regionname','Propertycount',
               'Distance','CouncilArea','Bedroom2','Bathroom','Car','Landsize','BuildingArea','Price']

# reload the dataset dataframe with only the data of the dataset under the column names specified in cols_to_use variable
dataset = dataset[cols_to_use]

dataset.head()

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price
0,Abbotsford,2,h,SS,Jellis,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,126.0,,
1,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,202.0,,1480000.0
2,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,0.0,156.0,79.0,1035000.0
3,Abbotsford,3,u,VB,Rounds,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,1.0,0.0,,
4,Abbotsford,3,h,SP,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,0.0,134.0,150.0,1465000.0


In [73]:
# check the number of missing value (NA/NaN) in each column of the dataset dataframe
dataset.isna().sum() 

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        3
Distance             1
CouncilArea          3
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

# Data preprocessing

In [74]:
cols_to_fill_zero = ['Propertycount','Distance','Bedroom2','Bathroom','Car']

# fill the missing value (NA/NaN) found in the columns whose column names specified in the cols_to_fill_zero variable with 0 
dataset[cols_to_fill_zero] = dataset[cols_to_fill_zero].fillna(0)

# check the number of missing value (NA/NaN) in each column of the dataset dataframe
dataset.isna().sum() 

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        0
Distance             0
CouncilArea          3
Bedroom2             0
Bathroom             0
Car                  0
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

In [75]:
# fill the missing value (NA/NaN) found in the landsize column with the mean of the data available in the landsize column
dataset['Landsize'] = dataset['Landsize'].fillna(dataset.Landsize.mean())

# fill the missing value (NA/NaN) found in the BuildingArea column with the mean of the data available in the BuildingArea column
dataset['BuildingArea'] = dataset['BuildingArea'].fillna(dataset.BuildingArea.mean())

# check the number of missing value (NA/NaN) in each column of the dataset dataframe
dataset.isna().sum() 

Suburb              0
Rooms               0
Type                0
Method              0
SellerG             0
Regionname          3
Propertycount       0
Distance            0
CouncilArea         3
Bedroom2            0
Bathroom            0
Car                 0
Landsize            0
BuildingArea        0
Price            7610
dtype: int64

In [76]:
# drop the column of the dataset dataframe that contains missing value (NA/NaN)
dataset.dropna(inplace=True)

# check the number of missing value (NA/NaN) in each column of the dataset dataframe
dataset.isna().sum() 

Suburb           0
Rooms            0
Type             0
Method           0
SellerG          0
Regionname       0
Propertycount    0
Distance         0
CouncilArea      0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
Price            0
dtype: int64

In [77]:
# convert/encode the categorical features of the dataset dataframe which are text labels into dummy variables/columns using get_dummies. Hence, there will no text columns available in the dataframe.
# The categorical features of the dataset dataframe which are already integer labels will remain unchanged.
# Set drop_first=True to avoid the dummy variable trap. It works by dropping the first column of the dummy variable/column
dataset = pd.get_dummies(dataset,dtype='int',drop_first=True)

dataset.head()

Unnamed: 0,Rooms,Propertycount,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price,Suburb_Aberfeldie,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
1,2,4019.0,2.5,2.0,1.0,1.0,202.0,160.2564,1480000.0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,4019.0,2.5,2.0,1.0,0.0,156.0,79.0,1035000.0,0,...,0,0,0,0,0,0,0,0,1,0
4,3,4019.0,2.5,3.0,2.0,0.0,134.0,150.0,1465000.0,0,...,0,0,0,0,0,0,0,0,1,0
5,3,4019.0,2.5,3.0,2.0,1.0,94.0,160.2564,850000.0,0,...,0,0,0,0,0,0,0,0,1,0
6,4,4019.0,2.5,3.0,1.0,2.0,120.0,142.0,1600000.0,0,...,0,0,0,0,0,0,0,0,1,0


In [78]:
# load the independent variables of the dataset into variable X
X = dataset.drop('Price',axis=1)

# load the dependent variable of the dataset into variable Y
Y = dataset.Price

In [79]:
# split the dataset into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.3,random_state=2)

# Develop the machine learning model (Linear regression model)

In [80]:
from sklearn.linear_model import LinearRegression

# create and train the linear regression model
reg = LinearRegression().fit(X_train,Y_train)

In [81]:
# show the accuracy of the trained model over test set
print('The accuracy of the trained model over test set:',reg.score(X_test,Y_test))

# show the accuracy of the trained model over train set
print('The accuracy of the trained model over train set:',reg.score(X_train,Y_train))

'''
The score for test set is very low and the score for train set is very high. This shows that the model is overfit.
'''

The accuracy of the trained model over test set: 0.1385368316157104
The accuracy of the trained model over train set: 0.6827792395792723


'\nThe score for test set is very low and the score for train set is very high. This shows that the model is overfit.\n'

# Solution to solve the model overfitting issue

## LASSO Regression (L1 Regularization)

### Develop the machine learning model (Linear regression model) using LASSO Regression

In [82]:
from sklearn import linear_model

# create the LASSO regression model with certain parameters.
# lasso_reg is a linear regression model which created/developed by involving the L1 regularization.
# LASSO regression model (a kind of cost function) is created by adding the lambda and absolute theta values to the error/MSE/cost function of a regression model which used to identify the optimal regression equation
lasso_reg = linear_model.Lasso(alpha=50, max_iter=100,tol=0.1)

# fit the regression model with the regularization parameter on
lasso_reg.fit(X_train,Y_train)

### Evaluate the score of the linear regression model which created by involving L1 Regularization

In [83]:
# show the accuracy of the trained model over test set
print('The accuracy of the trained model developed with L1 regularization over test set:',lasso_reg.score(X_test,Y_test))

# show the accuracy of the trained model over train set
print('The accuracy of the trained model developed with L1 regularization over train set:',lasso_reg.score(X_train,Y_train))

'''
The score for test set and train set are both high. This shows that the L1 regularization solves/alleviate the overfitting issue of the model (as shown above).
'''

The accuracy of the trained model developed with L1 regularization over test set: 0.6636111369404488
The accuracy of the trained model developed with L1 regularization over train set: 0.6766985624766824


'\nThe score for test set and train set are both high. This shows that the L1 regularization solves/alleviate the overfitting issue of the model (as shown above).\n'

## Ridge Regression (L2 Regularization)

### Develop the machine learning model (Linear regression model) using Ridge Regression

In [84]:
from sklearn.linear_model import Ridge

# create the Ridge regression model with certain parameters.
# ridge_reg is a linear regression model which created/developed by involving the L2 regularization.
# Ridge regression model (a kind of cost function) is created by adding the lambda and square of theta values to the error/MSE/cost function of a regression model which used to identify the optimal linear regression equation
ridge_reg = Ridge(alpha=50, max_iter=100,tol=0.1)

# fit the regression model with the regularization parameter on
ridge_reg.fit(X_train,Y_train)

### Evaluate the score of the linear regression model which created by involving L2 Regularization

In [85]:
# show the accuracy of the trained model over test set
print('The accuracy of the trained model developed with L2 regularization over test set:',ridge_reg.score(X_test,Y_test))

# show the accuracy of the trained model over train set
print('The accuracy of the trained model developed with L2 regularization over train set:',ridge_reg.score(X_train,Y_train))

'''
The score for test set and train set are both high. This shows that the L2 regularization solves/alleviate the overfitting issue of the model (as shown above).
'''

The accuracy of the trained model developed with L2 regularization over test set: 0.6670848945194958
The accuracy of the trained model developed with L2 regularization over train set: 0.6622376739684328


'\nThe score for test set and train set are both high. This shows that the L2 regularization solves/alleviate the overfitting issue of the model (as shown above).\n'