## Libraries

In [2]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import warnings

from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.linear_model import Ridge, BayesianRidge, LinearRegression
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

sb.set()
warnings.filterwarnings('ignore')



## Dataset Preparation

In [3]:
data = pd.read_csv("cleaned-Housing.csv")
train_data = pd.read_csv("train-Housing.csv")
test_data = pd.read_csv("test-Housing.csv")

We encode `storey_range` since they present ordering-like-structure. They will be utilized in our regression.

In [3]:
# Encode storey_range
label_encoders = LabelEncoder()
categorical_cols = ['storey_range']
for col in categorical_cols:
    data[col] = label_encoders.fit_transform(data[col])
    train_data[col] = label_encoders.fit_transform(train_data[col])
    test_data[col] = label_encoders.fit_transform(test_data[col])

In [4]:
data.head()

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,resale_year
0,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,5,31.0,IMPROVED,1977,9000.0,1990
1,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,2,31.0,IMPROVED,1977,6000.0,1990
2,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,5,31.0,IMPROVED,1977,8000.0,1990
3,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,4,31.0,IMPROVED,1977,6000.0,1990
4,1990-01,ANG MO KIO,3 ROOM,216,ANG MO KIO AVE 1,2,73.0,NEW GENERATION,1976,47200.0,1990


These are the four candidates variable that we think may have significant impact on `resale_price`. They are `floor_area_sqm`, `lease_commence_date`, `storey_range`, and `resale_year`.

In [5]:
candidates = ['floor_area_sqm', 'lease_commence_date', 'storey_range', 'resale_year']

In the following section, we will be building 5 different machine learning models using each of the candidate variables at a time, as well as all of them at the same time. With this, we hope to evaluate the impact of each of variable towards `resale_price`. We will evaluate the result altogether at the end of this notebook.

## Linear Regression Model

In [6]:
def linear_regression(predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Initialize and train the model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Predict and evaluate the model
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Print the performance
    print(f'Linear Regression using {predictor}:')
    print(f'Mean Squared Error (MSE): {mse}')
    print(f'R-squared Score: {r2}\n')
    print()

In [7]:
for var in candidates:
    linear_regression([var])
linear_regression(candidates)

Linear Regression using ['floor_area_sqm']:
Mean Squared Error (MSE): 10852322747.407
R-squared Score: 0.41058981333644384


Linear Regression using ['lease_commence_date']:
Mean Squared Error (MSE): 13890149818.691706
R-squared Score: 0.2455996759425535


Linear Regression using ['storey_range']:
Mean Squared Error (MSE): 17999438190.340294
R-squared Score: 0.02241644756257466


Linear Regression using ['resale_year']:
Mean Squared Error (MSE): 11860717498.754131
R-squared Score: 0.355821985982242


Linear Regression using ['floor_area_sqm', 'lease_commence_date', 'storey_range', 'resale_year']:
Mean Squared Error (MSE): 4869618849.629878
R-squared Score: 0.7355217844192378




## Ridge Regression Model

In [8]:
def ridge_regression(predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Initialize and train the Ridge Regression model
    ridge_reg_model = Ridge(alpha=1.0)
    ridge_reg_model.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred_ridge = ridge_reg_model.predict(X_test)

    # Evaluate the model performance
    mse_ridge = mean_squared_error(y_test, y_pred_ridge)
    r2_ridge = r2_score(y_test, y_pred_ridge)

    # Print out the model performance metrics for Ridge Regression
    print(f'Ridge Regression using {predictor}:')
    print(f'Mean Squared Error (MSE): {mse_ridge}')
    print(f'R-squared Score: {r2_ridge}\n')
    print()

In [9]:
for var in candidates:
    ridge_regression([var])
ridge_regression(candidates)

Ridge Regression using ['floor_area_sqm']:
Mean Squared Error (MSE): 10852322747.584032
R-squared Score: 0.41058981332682887


Ridge Regression using ['lease_commence_date']:
Mean Squared Error (MSE): 13890149818.726665
R-squared Score: 0.24559967594065468


Ridge Regression using ['storey_range']:
Mean Squared Error (MSE): 17999438194.853462
R-squared Score: 0.022416447317455956


Ridge Regression using ['resale_year']:
Mean Squared Error (MSE): 11860717502.674099
R-squared Score: 0.3558219857693412


Ridge Regression using ['floor_area_sqm', 'lease_commence_date', 'storey_range', 'resale_year']:
Mean Squared Error (MSE): 4869618852.099425
R-squared Score: 0.7355217842851121




## Bayesian Ridge Regression Model

In [10]:
def bayesian_ridge_regression(predictor):  
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Initialize and train the model
    model = BayesianRidge()
    model.fit(X_train, y_train)
    
    # Predict and evaluate the model
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Print the performance
    print(f'Bayesian Ridge Regression using {predictor}:')
    print(f'Mean Squared Error (MSE): {mse}')
    print(f'R-squared Score: {r2}\n')
    print()

In [11]:
for var in candidates:
    bayesian_ridge_regression([var])
bayesian_ridge_regression(candidates)

Bayesian Ridge Regression using ['floor_area_sqm']:
Mean Squared Error (MSE): 10852322917.217775
R-squared Score: 0.4105898041136993


Bayesian Ridge Regression using ['lease_commence_date']:
Mean Squared Error (MSE): 13890149828.021973
R-squared Score: 0.24559967543580896


Bayesian Ridge Regression using ['storey_range']:
Mean Squared Error (MSE): 17999439435.067406
R-squared Score: 0.022416379959089627


Bayesian Ridge Regression using ['resale_year']:
Mean Squared Error (MSE): 11860717947.097328
R-squared Score: 0.3558219616318745


Bayesian Ridge Regression using ['floor_area_sqm', 'lease_commence_date', 'storey_range', 'resale_year']:
Mean Squared Error (MSE): 4869619217.818021
R-squared Score: 0.7355217644222429




## Gradient Boosting Regressor

In [12]:
def gradient_boosting_regression(predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Initialize and train the Gradient Boosting Regressor model
    gb_regressor = GradientBoostingRegressor()
    gb_regressor.fit(X_train, y_train)

    # Make predictions
    y_pred = gb_regressor.predict(X_test)

    # Evaluate the model
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    print(f'Gradient Boosting Regression using {predictor}:')
    print(f"Mean Squared Error: {mse}")
    print(f'R-squared Score: {r2}\n')
    print();

In [13]:
for var in candidates:
    gradient_boosting_regression([var])
gradient_boosting_regression(candidates)

Gradient Boosting Regression using ['floor_area_sqm']:
Mean Squared Error: 10147424080.993206
R-squared Score: 0.44887419394511274


Gradient Boosting Regression using ['lease_commence_date']:
Mean Squared Error: 13351409455.82496
R-squared Score: 0.2748596846274547


Gradient Boosting Regression using ['storey_range']:
Mean Squared Error: 17571429926.24127
R-squared Score: 0.04566238640054632


Gradient Boosting Regression using ['resale_year']:
Mean Squared Error: 9755997498.846968
R-squared Score: 0.47013331240461653


Gradient Boosting Regression using ['floor_area_sqm', 'lease_commence_date', 'storey_range', 'resale_year']:
Mean Squared Error: 2138139236.7485445
R-squared Score: 0.8838736115781528




## Random Forest Regressor

In [14]:
def random_forest_regression(predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Train decision tree
    rf = RandomForestRegressor()
    rf.fit(X_train, y_train)
    
    # Predict Response corresponding to Predictors
    y_pred = rf.predict(X_test)
    
    # Check the Goodness of Fit (on Train Data)
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    print(f'Random Forest Regression using {predictor}:')
    print(f"Mean Squared Error: {mse}")
    print(f'R-squared Score: {r2}\n')
    print();
    
    # Plot trees is not useful since the number of nodes is incomprehensible thus not providing significant insights
#     for i in range(min(2,len(rf.estimators_))):
#         plt.figure(figsize=(30,15), dpi=300)
#         plot_tree(rf.estimators_[i])
#         plt.show()

In [15]:
for var in candidates:
    random_forest_regression([var])
random_forest_regression(candidates)

Random Forest Regression using ['floor_area_sqm']:
Mean Squared Error: 10110927498.159582
R-squared Score: 0.45085639242936715


Random Forest Regression using ['lease_commence_date']:
Mean Squared Error: 13346982729.891596
R-squared Score: 0.2751001084905689


Random Forest Regression using ['storey_range']:
Mean Squared Error: 17571357488.52167
R-squared Score: 0.04566632063018761


Random Forest Regression using ['resale_year']:
Mean Squared Error: 9755938660.965277
R-squared Score: 0.4701365080013231


Random Forest Regression using ['floor_area_sqm', 'lease_commence_date', 'storey_range', 'resale_year']:
Mean Squared Error: 1745874211.190154
R-squared Score: 0.9051782674861425




## Insights

It can be observed that the variables `floor_area_sqm` and `resale_year` consistently achieved relatively high R-squared values (around 0.4) when each of the models build using the variable. This suggests that `floor_area_sqm` and `resale_year` can be the two most relevant factors in determining `resale_price`.

On the other hand, `lease_commence_date` have a relatively mediocre R-squared value at around 0.2 up to 0.3. It seems to have some impact in the pricing, but not enough to be the determining factor.

At the end, we have `storey_range` with a consistently very low R-squared values never going over 0.05. This shows that the variable `storey_range` does not have significant impact in the pricing.

While the model we built using 1 predictor at a time gives R-squared score that never pass 0.5, rebuilding the same model utilizing all predictor at the same time yields a significantly better prediction.

A simple linear regression model built with the four chosen predictors successfully achieved a R-squared value of 0.736. The same approach using ridge regression and bayesian ridge regression models also yields similar R-squared value of 0.736. While this score is already somewhat good, we are looking for ways to further improve our prediction. Thus, we moved on to try other complex models like gradient boosting and random forest.

For the gradient boosting model, using one predictor at a time shows non-neglilible improvement in terms of lowering the error of the prediction, compared to the previous three models. The model really shines when it utilizes all four predictors and managed to achieve R-squared value of 0.884. This shows that gradient boosting model suits the problem compared to the previous models.

Finally, random forest model achieved an R-squared value of 0.905, just above 0.9. This is a slight improvement from gradient boosting. This result shows that random forest and gradient boosting can be appropriate models in predicting HDB resale pricing.

In [4]:
print("The average resale price is:",data['resale_price'].mean())
print("RMSE Random Forest:", 1740000000**0.5)

The average resale price is: 293643.45073320804
RMSE Random Forest: 41713.30722922842


It is important to note that even though the Mean-Squared Error (MSE) of the models we built may seems high and concerning, it is acceptable since housing prices, such as those for HDB flats, can range and vary wildly.

The square root of the MSE gives you the Root Mean Squared Error (RMSE), which is on the same scale as the prices themselves. Using our random forest model with MSE value of 1,740,000,000, this will gives us RMSE of around 41,000. This means that we can expect error of around S$41,000 in our price prediction. Such error might be a reasonable margin for certain applications, like a quick market estimate.