In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn import linear_model

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('Melbourne_housing_FULL.csv')

In [3]:
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


In [4]:
#finding unique values

df.nunique()

Suburb             351
Address          34009
Rooms               12
Type                 3
Price             2871
Method               9
SellerG            388
Date                78
Distance           215
Postcode           211
Bedroom2            15
Bathroom            11
Car                 15
Landsize          1684
BuildingArea       740
YearBuilt          160
CouncilArea         33
Lattitude        13402
Longtitude       14524
Regionname           8
Propertycount      342
dtype: int64

In [5]:
# let's use limited columns which makes more sense for serving our purpose
cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method', 'SellerG', 'Regionname', 'Propertycount', 
               'Distance', 'CouncilArea', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'Price']
dataset = df[cols_to_use]

In [6]:
dataset.isnull().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        3
Distance             1
CouncilArea          3
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

In [7]:
#handling missing values

# Some feature's missing values can be treated as zero (another class for NA values or absence of that feature)
# like 0 for Propertycount, Bedroom2 will refer to other class of NA values
# like 0 for Car feature will mean that there's no car parking feature with house
cols_to_fill_zero = ['Propertycount', 'Distance', 'Bedroom2', 'Bathroom', 'Car']
dataset[cols_to_fill_zero] = dataset[cols_to_fill_zero].fillna(0)

# other continuous features can be imputed with mean for faster results since our focus is on Reducing overfitting
# using Lasso and Ridge Regression
dataset['Landsize'] = dataset['Landsize'].fillna(dataset.Landsize.mean())
dataset['BuildingArea'] = dataset['BuildingArea'].fillna(dataset.BuildingArea.mean())


In [8]:
#Drop NA values of Price, since it's our predictive variable we won't impute it

dataset.dropna(inplace=True)

In [9]:
dataset.shape

(27244, 15)

In [10]:
dataset.isnull().sum()

Suburb           0
Rooms            0
Type             0
Method           0
SellerG          0
Regionname       0
Propertycount    0
Distance         0
CouncilArea      0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
Price            0
dtype: int64

In [11]:
#finding categorical columns

dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27244 entries, 1 to 34856
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         27244 non-null  object 
 1   Rooms          27244 non-null  int64  
 2   Type           27244 non-null  object 
 3   Method         27244 non-null  object 
 4   SellerG        27244 non-null  object 
 5   Regionname     27244 non-null  object 
 6   Propertycount  27244 non-null  float64
 7   Distance       27244 non-null  float64
 8   CouncilArea    27244 non-null  object 
 9   Bedroom2       27244 non-null  float64
 10  Bathroom       27244 non-null  float64
 11  Car            27244 non-null  float64
 12  Landsize       27244 non-null  float64
 13  BuildingArea   27244 non-null  float64
 14  Price          27244 non-null  float64
dtypes: float64(8), int64(1), object(6)
memory usage: 3.3+ MB


In [12]:
#Let's one hot encode the categorical features as dataset is huge
dataset = pd.get_dummies(dataset, drop_first=True)


--> pd.get_dummies() is a function in the Pandas library that converts categorical variables into a series of binary (0 or 1) variables. This process is known as one-hot encoding. Machine learning algorithms often require numerical input, so categorical data needs to be converted into a numerical format.


--> With drop_first=True, one column (from the dummy columns created) is dropped to avoid the dummy variable trap.

-->What is the Dummy Variable Trap?

The dummy variable trap occurs when the dummy variables are highly correlated (multicollinear). This multicollinearity can cause problems in statistical models, especially linear regression, because it makes the model parameters unstable and unreliable.

To avoid this, we use drop_first=True in pd.get_dummies(). This parameter drops the first category, ensuring that one category is used as a baseline and the others are represented in relation to it.

-->Why Use drop_first=True?

Avoiding Multicollinearity: It prevents multicollinearity in linear models by ensuring that the dummy variables are independent.
Reducing Dimensionality: It reduces the number of dummy variables created, which can help with model performance, especially when dealing with a large number of categories.

In [14]:
dataset.head()

Unnamed: 0,Rooms,Propertycount,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price,Suburb_Aberfeldie,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
1,2,4019.0,2.5,2.0,1.0,1.0,202.0,160.2564,1480000.0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,4019.0,2.5,2.0,1.0,0.0,156.0,79.0,1035000.0,0,...,0,0,0,0,0,0,0,0,1,0
4,3,4019.0,2.5,3.0,2.0,0.0,134.0,150.0,1465000.0,0,...,0,0,0,0,0,0,0,0,1,0
5,3,4019.0,2.5,3.0,2.0,1.0,94.0,160.2564,850000.0,0,...,0,0,0,0,0,0,0,0,1,0
6,4,4019.0,2.5,3.0,1.0,2.0,120.0,142.0,1600000.0,0,...,0,0,0,0,0,0,0,0,1,0


In [15]:
X = dataset.drop('Price', axis = 1)
Y = dataset['Price']

In [16]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=2)

In [17]:
#model training linear regression
reg = linear_model.LinearRegression()

In [18]:
reg.fit(X_train, Y_train)

LinearRegression()

In [22]:
#training data score
reg.score(X_train, Y_train)

0.6827792395792723

In [23]:
#test data score
reg.score(X_test, Y_test)

0.13853683161562136

Here training score is 68% but test score is 13.85% which is very low.
Normal Regression is clearly overfitting the data, let's try other models

In [36]:
#Using Lasso (L1 Regularized) Regression Model

reg1 = linear_model.Lasso(alpha=50, max_iter=100, tol=0.1)

creates an instance of a Lasso Regression model using the Lasso class from the linear_model module of the scikit-learn library. Let’s break down each of the parameters used:

Parameters Explained
alpha=50:

The alpha parameter controls the strength of the regularization. In Lasso regression (L1 regularization), this parameter determines how much we penalize the absolute values of the coefficients.
A higher alpha value means more regularization, which will shrink more coefficients to zero. This can lead to a simpler model but may underfit if set too high.
In this case, alpha=50 suggests a relatively strong regularization effect.
max_iter=100:

The max_iter parameter sets the maximum number of iterations the solver will use when finding the optimal coefficients.
This is useful to prevent infinite loops or very long computations if the model does not converge quickly.
Here, max_iter=100 means the algorithm will run for at most 100 iterations to find the best fit.
tol=0.1:

The tol parameter is the tolerance for the optimization: it controls how close to the optimal solution the algorithm needs to be before it stops iterating.
A higher tol value allows for a less precise solution, which might speed up computation but could result in a slightly less accurate model.
tol=0.1 means the algorithm will stop if changes in the coefficients are smaller than this value, indicating a relatively lenient convergence criterion.

In [37]:
reg1.fit(X_train,Y_train)

Lasso(alpha=50, max_iter=100, tol=0.1)

In [38]:
#training data score
reg1.score(X_train, Y_train)

0.6766985624766824

In [39]:
#test data score
reg1.score(X_test, Y_test)

0.6636111369404488

In [40]:
#Using Ridge (L2 Regularized) Regression Model

reg2 = linear_model.Ridge(alpha=50, max_iter=100, tol=0.1)

In [41]:
reg2.fit(X_train, Y_train)

Ridge(alpha=50, max_iter=100, tol=0.1)

In [42]:
#training data score
reg2.score(X_train, Y_train)

0.6622376739684328

In [43]:
#test data score
reg2.score(X_test, Y_test)

0.6670848945194959

We see that Lasso and Ridge Regularizations prove to be beneficial when our Simple Linear Regression Model overfits. These results may not be that contrast but significant in most cases.Also that L1 & L2 Regularizations are used in Neural Networks too