# Prediction of Resale HDB Prices

This notebook is a work flow for various Python-based machine learning model for predicting of HDB Resale prices.

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation

# 1. Problem Definition

How we can use various python based Machine Learning Model to and the given parameters to predict the resale prices of HBD

# 2. Data

Data set from Data.gov.sg
link:
https://data.gov.sg/dataset/resale-flat-prices?resource_id=42ff9cfe-abe5-4b54-beda-c88f9bb438ee

# 3. Evaluation

It will be done with the Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and with the R2 Score (Accuracy)

# 4. Features

Data.gov.sg link: https://data.gov.sg/dataset/resale-flat-prices?resource_id=42ff9cfe-abe5-4b54-beda-c88f9bb438ee

## Features /  Inputs
1. month 	Month 	Datetime (Month) "YYYY-MM" 	- 	-
2. 	town 	Town 	Text (General) 	- 	-
3. 	flat_type 	Flat type 	Text (General) 	- 	-
4. 	block 	Block 	Text (General) 	- 	-
5. 	street_name 	Street name 	Text (General) 	- 	-
6. 	storey_range 	Storey range 	Text (General) 	- 	-
7. 	floor_area_sqm 	Floor area sqm 	Numeric (General) 	Sqm 	-
8. 	flat_model 	Flat model 	Text (General) 	- 	-
9. 	lease_commence_date 	Lease commence date 	Datetime (Year) "YYYY" 	- 	-
10. 	remaining_lease 	Remaining lease 	Text (General) 	- 	Years and Months

## Label / Outputs
11. 	resale_price 	Resale price 	Numeric (General) 	$ 	- 


## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
df = pd.read_csv('/kaggle/input/singapore-hdb-resale/HDBresale.csv')
df.head()

## Data Exploration (Exploratory Data Analysis (EDA) )

In [None]:
df

In [None]:
df.info()

In [None]:
df['month'] = pd.to_datetime(df['month'])

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Resale prices vs Month of sales vs Flat type')
sns.scatterplot(data=df, x='month', y='resale_price', hue='flat_type');

As we can see from 2017 to 2021, the pricing remains pretty stable

In [None]:
len(df['street_name'].unique())

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Ang Mo Kiom month sales resale price vs street name')
sns.scatterplot(data=df[df['town'] == 'ANG MO KIO'], x='month', y='resale_price', hue='street_name');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Ang Mo Kiom month sales resale price vs street name')
sns.scatterplot(data=df[df['town'] == 'YISHUN'], x='month', y='resale_price', hue='street_name');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Flat type vs resale price vs town')
sns.scatterplot(data=df, x='flat_type',y='resale_price', hue='town');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Floor area vs resale price vs flat type')
sns.scatterplot(data=df, x='floor_area_sqm', y='resale_price', hue='flat_type');

In [None]:
df[df['floor_area_sqm'] > 200]

### Data Cleaning 

For a simpler model, i have choose to drop street_name, remaining_lease, month, block.

In [None]:
df = df.drop(['street_name', 'remaining_lease', 'month', 'block'], axis=1)

In [None]:
df

### Get Dummies Vars

In [None]:
df = pd.get_dummies(df)
df

# 5. Modelling

In [None]:
sample_df = df.sample(frac=0.1, random_state=42)
sample_df

In [None]:
X = sample_df.drop('resale_price', axis=1)
y = sample_df['resale_price']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

## Importing Models

In [None]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from warnings import filterwarnings

In [None]:
filterwarnings('ignore')

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        model_scores[name] = model.score(X_test,y_test)

    model_scores = pd.DataFrame(model_scores, index=['Accuracy'])
    model_scores = model_scores.transpose().sort_values('Accuracy')

    return model_scores

## Baseline models and scores

In [None]:
models = {'Ridge' : Ridge(),
         'Lasso': Lasso(),
         'ElasticNet': ElasticNet(),
         'KNeighborsRegressor': KNeighborsRegressor(),
         'SVR': SVR(),
         'DecisionTreeRegressor': DecisionTreeRegressor(),
         'RandomForestRegressor':RandomForestRegressor(),
         'GradientBoostingRegressor': GradientBoostingRegressor(),
         'AdaBoostRegressor': AdaBoostRegressor()}

In [None]:
baseline_model_scores_df = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
baseline_model_scores_df.sort_values('Accuracy')

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores_df.T)
plt.title('Baseline Model Accuracy Score')
plt.xticks(rotation=90);

With the scoring of the baseline model, 

    1. DecisionTreeRegressor 	0.880635
    2. RandomForestRegressor 	0.917609

## Hyperparameter Tuning via Random Search CV

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from warnings import filterwarnings

In [None]:
def randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_gs_scores = {}
    model_gs_best_param = {}
    
    for name, model in models.items():
        gs_model = RandomizedSearchCV(model,
                                params[name],n_iter=10,
                                cv=5,
                                verbose=0)
        
        gs_model.fit(X_train,y_train)

        model_gs_scores[name] = gs_model.score(X_test,y_test)
        model_gs_best_param[name] = gs_model.best_params_

    model_gs_scores = pd.DataFrame(model_gs_scores, index=['Accuracy'])
    model_gs_scores = model_gs_scores.transpose().sort_values('Accuracy')
        
    return model_gs_scores, model_gs_best_param

### Random SearchCV Model 1

In [None]:
models = {'RandomForestRegressor':RandomForestRegressor(),
         'DecisionTreeRegressor': DecisionTreeRegressor()}

params = {'RandomForestRegressor': {'n_estimators' : [50,100,200],
                    'criterion' : ['mse'],
                    'oob_score' : [True,False]},
          'DecisionTreeRegressor': {'criterion': ['mse', 'friedman_mse'],
                                        'ccp_alpha': [0.0,0.1,0.5,0.8]}
          }

In [None]:
model_gs_scores_1, model_gs_best_param_1 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_gs_scores_1

In [None]:
model_gs_best_param_1

Since the score barely improve with the random search CV, we will continue with the a grid search CV on the Random Forest Regressor as it seem to product the best results.

## Hyperparameter Tuning via Grid Search CV

In [None]:
def gridsearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_gs_scores = {}
    model_gs_best_param = {}
    
    for name, model in models.items():
        gs_model = GridSearchCV(model,
                                param_grid=params[name],
                                cv=5,
                                verbose=0)
        
        gs_model.fit(X_train,y_train)

        model_gs_scores[name] = gs_model.score(X_test,y_test)
        model_gs_best_param[name] = gs_model.best_params_

    model_gs_scores = pd.DataFrame(model_gs_scores, index=['Accuracy'])
    model_gs_scores = model_gs_scores.transpose().sort_values('Accuracy')
        
    return model_gs_scores, model_gs_best_param

### Grid search CV model 1

In [None]:
models = {'RandomForestRegressor':RandomForestRegressor()}

params = {'RandomForestRegressor': {'n_estimators' : [150,200,300],
                    'criterion' : ['mse'],
                    'oob_score' : [False]}
          }
          

In [None]:
model_gs_scores_1, model_gs_best_param_1 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_gs_scores_1

In [None]:
model_gs_best_param_1

### Grid search CV model 2

In [None]:
models = {'RandomForestRegressor':RandomForestRegressor()}

params = {'RandomForestRegressor': {'n_estimators' : [130,150,180],
                    'criterion' : ['mse'],
                    'oob_score' : [False]}
          }

In [None]:
model_gs_scores_2, model_gs_best_param_2 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_gs_scores_2

In [None]:
model_gs_best_param_2

# 6. Model Evaluation

Since we have done a gird search CV. it's time to build the model for evalution using the full dataset

In [None]:
df.head()

In [None]:
X = df.drop('resale_price', axis = 1)
y = df['resale_price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
np.random.seed(42)
model = RandomForestRegressor(criterion='mse',n_estimators=130,oob_score=False)
model.fit(X_train,y_train)

In [None]:
y_preds = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
r2 = r2_score(y_test,y_preds)
mae = mean_absolute_error(y_test, y_preds)
mse = mean_squared_error(y_test, y_preds)
rmse = np.sqrt(mse)
rmse / 2

In [None]:
print(f'R2 Score: {r2}')
print(f'Mean Absolute Error: {mae}')
print(f'Mean Square Error: {mse}')
print(f'Root Mean Square Error: {rmse}')

Using a Random Forest Regressor, we have build a model that have an accurcy, of 95% and a Root Mean Square Error of SGD 34,858 around a SGD 17,111 plus minus the predicted price. 