<a href="https://www.kaggle.com/code/zmkalila/melbourne-housing-price-prediction?scriptVersionId=200138734" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Getting started

I make this notebook for learning purposes based on the codes in this YouTube lesson:  
https://youtu.be/DY10uyDy3vQ?si=Du8dIucJMRAq9B2L with my own code modifications and markdowns here and there.

## Import module

In [1]:
import pandas as pd

## Import dataset from Kaggle to Jupyter Notebook directory

In [2]:
!kaggle datasets download -d dansbecker/melbourne-housing-snapshot

Dataset URL: https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot
License(s): CC-BY-NC-SA-4.0
Downloading melbourne-housing-snapshot.zip to /kaggle/working
100%|█████████████████████████████████████████| 451k/451k [00:00<00:00, 921kB/s]
100%|█████████████████████████████████████████| 451k/451k [00:00<00:00, 919kB/s]


## Unzip dataset file to the same directory

In [3]:
import zipfile
z= zipfile.ZipFile('melbourne-housing-snapshot.zip')
z.extractall()

## Setting DataFrame display

In [4]:
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 30)

## Read csv file as Pandas DataFrame

In [5]:
df = pd.read_csv('melb_data.csv')
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


# Data Exploration

## Dimension of dataset

In [6]:
df.shape

(13580, 21)

## List of dataset columns

In [7]:
df.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

## Summary of dataset

In [8]:
df.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


Let's say we want to find the biggest Landsize in the dataset:

In [9]:
df.describe().loc['max', 'Landsize']

433014.0

In [10]:
df.describe()['Landsize']['max']

433014.0

In [11]:
df['Landsize'].max()

433014.0

## Data type of each column

In [12]:
df.dtypes

Suburb            object
Address           object
Rooms              int64
Type              object
Price            float64
Method            object
SellerG           object
Date              object
Distance         float64
Postcode         float64
Bedroom2         float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea     float64
YearBuilt        float64
CouncilArea       object
Lattitude        float64
Longtitude       float64
Regionname        object
Propertycount    float64
dtype: object

## Number of null/NaN/missing values in DataFrame

In [13]:
df.isnull().sum().to_frame() # alternative: df.isna().sum()

Unnamed: 0,0
Suburb,0
Address,0
Rooms,0
Type,0
Price,0
Method,0
SellerG,0
Date,0
Distance,0
Postcode,0


# Data cleaning

## Replace NaN (missing values) with 0

In [14]:
df.fillna(0, inplace=True)
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,0.0,0.0,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,1.0,94.0,0.0,0.0,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


## Change column data types

In [15]:
list = ['Price', 'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'YearBuilt', 'Propertycount']

df[list] = df[list].astype(int) # change from float to integer
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000,S,Biggin,3/12/2016,2.5,3067,2,1,1,202.0,0.0,0,Yarra,-37.7996,144.9984,Northern Metropolitan,4019
1,Abbotsford,25 Bloomburg St,2,h,1035000,S,Biggin,4/02/2016,2.5,3067,2,1,0,156.0,79.0,1900,Yarra,-37.8079,144.9934,Northern Metropolitan,4019
2,Abbotsford,5 Charles St,3,h,1465000,SP,Biggin,4/03/2017,2.5,3067,3,2,0,134.0,150.0,1900,Yarra,-37.8093,144.9944,Northern Metropolitan,4019
3,Abbotsford,40 Federation La,3,h,850000,PI,Biggin,4/03/2017,2.5,3067,3,2,1,94.0,0.0,0,Yarra,-37.7969,144.9969,Northern Metropolitan,4019
4,Abbotsford,55a Park St,4,h,1600000,VB,Nelson,4/06/2016,2.5,3067,3,1,2,120.0,142.0,2014,Yarra,-37.8072,144.9941,Northern Metropolitan,4019


In [16]:
df.dtypes

Suburb            object
Address           object
Rooms              int64
Type              object
Price              int64
Method            object
SellerG           object
Date              object
Distance         float64
Postcode           int64
Bedroom2           int64
Bathroom           int64
Car                int64
Landsize         float64
BuildingArea     float64
YearBuilt          int64
CouncilArea       object
Lattitude        float64
Longtitude       float64
Regionname        object
Propertycount      int64
dtype: object

# Machine Learning Basics

## Prediction target (output) selection

Select a part of the dataset that we want to predict later using the Machine Learning model that we build.

In [17]:
y = df['Price']
y.head().to_frame()

Unnamed: 0,Price
0,1480000
1,1035000
2,1465000
3,850000
4,1600000


## Features (input) selection

Select part(s) of the dataset that we want to use as the material for Machine Learning process to predict the target.

In [18]:
features = ['Rooms', 'Bedroom2', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = df[features]
X.head()

Unnamed: 0,Rooms,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude
0,2,2,1,202.0,-37.7996,144.9984
1,2,2,1,156.0,-37.8079,144.9934
2,3,3,2,134.0,-37.8093,144.9944
3,3,3,2,94.0,-37.7969,144.9969
4,4,3,1,120.0,-37.8072,144.9941


## Building model

### Model selection

Build machine learning model using Decision Tree Regressor.

Decision Tree model can be used for classification and regression problems, but because in this case the prediction target is `price` (which is numerical data), then regression is used.

In [19]:
from sklearn.tree import DecisionTreeRegressor

### Configure model

In [20]:
housing_model = DecisionTreeRegressor(random_state=1)

### Training model

method `.fit()` is used for the machine to "learn" as if  
- `X` is the problems/questions to be solved, and
- `y` is the answer key.

In [21]:
housing_model.fit(X, y)

### Doing prediction

In [22]:
housing_model.predict(X.head()) # the predicted value

array([1480000., 1035000., 1465000.,  850000., 1600000.])

In [23]:
y.head().to_frame() # the real value

Unnamed: 0,Price
0,1480000
1,1035000
2,1465000
3,850000
4,1600000


As we can see above, the predicted values and the real values are exactly the same.  

This happens because the dataset is NOT split into training and testing dataset (the way it should've been done), which means the machine was being tested on the same exact material as the ones it learned, thus it gives out perfect prediction result.

Later on, the dataset has to be split into two parts: training and testing set.

## Model evaluation

### Importing evaluation metric (`mean_absolute_error`)

In [24]:
from sklearn.metrics import mean_absolute_error

In [25]:
y_hat = housing_model.predict(X)
# in machine learning, usually prediction result is assigned to variable named 'y_hat'

mean_absolute_error(y, y_hat)

979.8441826215021

## Splitting dataset into Training and Testing dataset

In [26]:
from sklearn.model_selection import train_test_split

### Splitting dataset into two parts

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Think of it as:  
- `X_train` : practice problems (learning materials)
- `y_train`  : answer keys to practice problems
- `X_test` : examination problems (testing materials)
- `y_test`  : answer keys to the examination problems

And as we all know, a good test/examination is the one that doesn't have high similarity with the practice problems, thus the dataset has to be split into training and testing.

### Configure and train model

In [28]:
housing_model = DecisionTreeRegressor(random_state=1)
housing_model.fit(X_train, y_train)

### Model evaluation

In [29]:
y_hat = housing_model.predict(X_test)
mean_absolute_error(y_test, y_hat)

240548.23888070692

## Model optimization

In [30]:
def get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_hat)
    return mae

### Comparing mean absolute error with varying values of `max_leaf_nodes` to find the best value

In DecisionTreeRegressor model, the adjustable parameter is the value of `max_leaf_nodes` thus it is the one being varied.

In [31]:
for max_leaf_nodes in [5, 50, 500, 5000]:
    leaf_mae = get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test)
    print(f'Max leaf nodes: {max_leaf_nodes} \t Mean Absolute Error: {int(leaf_mae)}')

Max leaf nodes: 5 	 Mean Absolute Error: 356157
Max leaf nodes: 50 	 Mean Absolute Error: 264538
Max leaf nodes: 500 	 Mean Absolute Error: 224550
Max leaf nodes: 5000 	 Mean Absolute Error: 240279


Higher `max_leaf_nodes` is NOT equivalent to lower error (=better performance).  

As seen above, mean absolute error for `max_leaf_nodes=500` is lower than model with `max_leaf_nodes=5000`, meaning it has better performance.

## Data exploration with Random Forest

### Importing RandomForestRegressor

RandomForest is a popular Machine Learning Model, and is a development from the DecisionTree model.

Notice how it's called "Tree" and the other one's called "Forest",  
it's because RandomForest consists of a group of DecisionTree.

In [32]:
from sklearn.ensemble import RandomForestRegressor

In [33]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=1)
# n_estimators is the number of DecisionTree within the RandomForest

rf_model.fit(X_train, y_train)
y_hat = rf_model.predict(X_test)
print(f'Mean Absolute Error: {int(mean_absolute_error(y_test, y_hat))}')

Mean Absolute Error: 180501


Turns out, the Mean Absolute Error of RandomForest is lower (thus the model has better performance) in comparison to DecisionTree model.