<a href="https://www.kaggle.com/code/zmkalila/melbourne-housing-price-prediction?scriptVersionId=200049612" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Getting started

I make this notebook for learning purposes based on the codes in this YouTube lesson:  
https://youtu.be/DY10uyDy3vQ?si=Du8dIucJMRAq9B2L with my own code modifications and markdowns here and there.

## Import module

In [None]:
import pandas as pd

## Import dataset from Kaggle to Jupyter Notebook directory

In [None]:
!kaggle datasets download -d dansbecker/melbourne-housing-snapshot

## Unzip dataset file to the same directory

In [None]:
import zipfile
z= zipfile.ZipFile('melbourne-housing-snapshot.zip')
z.extractall()

## Setting DataFrame display

In [None]:
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 30)

## Read csv file as Pandas DataFrame

In [None]:
df = pd.read_csv('melb_data.csv')
df.head()

# Data Exploration

## Dimension of dataset

In [None]:
df.shape

## List of dataset columns

In [None]:
df.columns

## Summary of dataset

In [None]:
df.describe()

Let's say we want to find the biggest Landsize in the dataset:

In [None]:
df.describe().loc['max', 'Landsize']

In [None]:
df.describe()['Landsize']['max']

In [None]:
df['Landsize'].max()

## Data type of each column

In [None]:
df.dtypes

## Number of null/NaN/missing values in DataFrame

In [None]:
df.isnull().sum().to_frame() # alternative: df.isna().sum()

# Data cleaning

## Replace NaN (missing values) with 0

In [None]:
df.fillna(0, inplace=True)
df.head()

## Change column data types

In [None]:
list = ['Price', 'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'YearBuilt', 'Propertycount']

df[list] = df[list].astype(int) # change from float to integer
df.head()

In [None]:
df.dtypes

# Machine Learning Basics

## Prediction target (output) selection

Select a part of the dataset that we want to predict later using the Machine Learning model that we build.

In [None]:
y = df['Price']
y.head().to_frame()

## Features (input) selection

Select part(s) of the dataset that we want to use as the material for Machine Learning process to predict the target.

In [None]:
features = ['Rooms', 'Bedroom2', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = df[features]
X.head()

## Building model

### Model selection

Build machine learning model using Decision Tree Regressor.

Decision Tree model can be used for classification and regression problems, but because in this case the prediction target is `price` (which is numerical data), then regression is used.

In [None]:
from sklearn.tree import DecisionTreeRegressor

### Configure model

In [None]:
housing_model = DecisionTreeRegressor(random_state=1)

### Training model

method `.fit()` is used for the machine to "learn" as if  
- `X` is the problems/questions to be solved, and
- `y` is the answer key.

In [None]:
housing_model.fit(X, y)

### Doing prediction

In [None]:
housing_model.predict(X.head()) # the predicted value

In [None]:
y.head().to_frame() # the real value

As we can see above, the predicted values and the real values are exactly the same.  

This happens because the dataset is NOT split into training and testing dataset (the way it should've been done), which means the machine was being tested on the same exact material as the ones it learned, thus it gives out perfect prediction result.

Later on, the dataset has to be split into two parts: training and testing set.

## Model evaluation

### Importing evaluation metric (`mean_absolute_error`)

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
y_hat = housing_model.predict(X)
# in machine learning, usually prediction result is assigned to variable named 'y_hat'

mean_absolute_error(y, y_hat)

## Splitting dataset into Training and Testing dataset

In [None]:
from sklearn.model_selection import train_test_split

### Splitting dataset into two parts

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Think of it as:  
- `X_train` : practice problems (learning materials)
- `y_train`  : answer keys to practice problems
- `X_test` : examination problems (testing materials)
- `y_test`  : answer keys to the examination problems

And as we all know, a good test/examination is the one that doesn't have high similarity with the practice problems, thus the dataset has to be split into training and testing.

### Configure and train model

In [None]:
housing_model = DecisionTreeRegressor(random_state=1)
housing_model.fit(X_train, y_train)

### Model evaluation

In [None]:
y_hat = housing_model.predict(X_test)
mean_absolute_error(y_test, y_hat)

## Model optimization

In [None]:
def get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_hat)
    return mae

### Comparing mean absolute error with varying values of `max_leaf_nodes` to find the best value

In DecisionTreeRegressor model, the adjustable parameter is the value of `max_leaf_nodes` thus it is the one being varied.

In [None]:
for max_leaf_nodes in [5, 50, 500, 5000]:
    leaf_mae = get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test)
    print(f'Max leaf nodes: {max_leaf_nodes} \t Mean Absolute Error: {int(leaf_mae)}')

Higher `max_leaf_nodes` is NOT equivalent to lower error (=better performance).  

As seen above, mean absolute error for `max_leaf_nodes=500` is lower than model with `max_leaf_nodes=5000`, meaning it has better performance.

## Data exploration with Random Forest

### Importing RandomForestRegressor

RandomForest is a popular Machine Learning Model, and is a development from the DecisionTree model.

Notice how it's called "Tree" and the other one's called "Forest",  
it's because RandomForest consists of a group of DecisionTree.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=1)
# n_estimators is the number of DecisionTree within the RandomForest

rf_model.fit(X_train, y_train)
y_hat = rf_model.predict(X_test)
print(f'Mean Absolute Error: {int(mean_absolute_error(y_test, y_hat))}')

Turns out, the Mean Absolute Error of RandomForest is lower (thus the model has better performance) in comparison to DecisionTree model.