<h1>Modelling</h1>

In this notebook, we will be running multiple models on the data we have created earlier. Then, we will be evaluating each of the mdoels according to MAE, RMSE, r2 and MAPE scores to determine the best model. After which, we will select the best models to attempt hyperparameter tuning to see if we can get even better performance.

In [15]:
import pandas as pd
import numpy as np
import os
import time
import requests
from dotenv import load_dotenv
from tqdm import tqdm

In [61]:
df_linear = pd.read_csv('dataset_for_linear.csv')
df_tree = pd.read_csv('dataset_for_tree.csv')

In [12]:
df_linear.head()

Unnamed: 0,floor_area_sqm,lease_commence_date,remaining_lease,resale_price,latitude,longitude,distance_to_nearest_mrt_km,num_mrt_within_1km,distance_to_nearest_hawker_km,num_hawkers_within_1km,...,flat_model_New Generation,flat_model_Premium Apartment,flat_model_Premium Apartment Loft,flat_model_Premium Maisonette,flat_model_Simplified,flat_model_Standard,flat_model_Terrace,flat_model_Type S1,flat_model_Type S2,storey_median
0,60.0,1986,70,255000.0,1.375097,103.837619,0.41994,1,0.18404,4,...,False,False,False,False,False,False,False,False,False,8.0
1,68.0,1981,65,275000.0,1.373922,103.855621,0.80565,1,0.1816,2,...,True,False,False,False,False,False,False,False,False,2.0
2,69.0,1980,64,285000.0,1.373549,103.838176,0.29234,1,0.15532,5,...,True,False,False,False,False,False,False,False,False,2.0
3,68.0,1979,63,290000.0,1.367761,103.855357,0.68736,1,0.1239,4,...,True,False,False,False,False,False,False,False,False,2.0
4,68.0,1980,64,290000.0,1.371626,103.857736,0.927,1,0.38548,2,...,True,False,False,False,False,False,False,False,False,8.0


In [13]:
df_linear.shape

(249289, 74)

In [14]:
df_tree.shape

(249289, 32)

<h1>Linear Modelling</h1>

In [17]:
X_linear = df_linear.drop(columns=['resale_price'])
y_linear = df_linear['resale_price']

In [18]:
# Train-test split
from sklearn.model_selection import train_test_split

X_train_lin, X_test_lin, y_train_lin, y_test_lin = train_test_split(
    X_linear, y_linear, test_size=0.2, random_state=42
)

<h2>Linear Regression</h2>

In [19]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train_lin, y_train_lin)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [20]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred_lin = lr.predict(X_test_lin)

rmse = np.sqrt(mean_squared_error(y_test_lin, y_pred_lin))
r2 = r2_score(y_test_lin, y_pred_lin)

print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.4f}")

RMSE: 64827.83
R² Score: 0.8694


<h2>Ridge Regression</h2>

In [21]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_lin, y_train_lin)
y_pred_ridge = ridge.predict(X_test_lin)

rmse_ridge = np.sqrt(mean_squared_error(y_test_lin, y_pred_ridge))
r2_ridge = r2_score(y_test_lin, y_pred_ridge)

print(f"Ridge RMSE: {rmse_ridge:.2f}")
print(f"Ridge R²: {r2_ridge:.4f}")

Ridge RMSE: 64828.27
Ridge R²: 0.8694


<h2>Lasso Regression</h2>

In [22]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1, max_iter=10000)
lasso.fit(X_train_lin, y_train_lin)
y_pred_lasso = lasso.predict(X_test_lin)

rmse_lasso = np.sqrt(mean_squared_error(y_test_lin, y_pred_lasso))
r2_lasso = r2_score(y_test_lin, y_pred_lasso)

print(f"Lasso RMSE: {rmse_lasso:.2f}")
print(f"Lasso R²: {r2_lasso:.4f}")

Lasso RMSE: 64827.81
Lasso R²: 0.8694


<h2>Tuning alpha for Ridge and Lasso</h2>

In [23]:
from sklearn.linear_model import RidgeCV, LassoCV

ridgecv = RidgeCV(alphas=[0.1, 1.0, 10.0, 100.0])
ridgecv.fit(X_train_lin, y_train_lin)
print("Best Ridge alpha:", ridgecv.alpha_)

lassocv = LassoCV(alphas=[0.01, 0.1, 1.0, 10.0], max_iter=10000)
lassocv.fit(X_train_lin, y_train_lin)
print("Best Lasso alpha:", lassocv.alpha_)

Best Ridge alpha: 0.1
Best Lasso alpha: 0.1


<h2>Tree-Based models - Splitting into Train and Test sets</h2>

In [62]:
X = df_tree.drop(columns=['resale_price'])
y = df_tree['resale_price']

# Split train/test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

<h2>Random Forest</h2>

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Evaluate
y_pred_rf = rf.predict(X_test)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest RMSE: {rmse_rf:.2f}")
print(f"Random Forest R²: {r2_rf:.4f}")

<h2>XGBoost</h2>

In [25]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Instantiate and train
xgb = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
xgb.fit(X_train, y_train)

# Predict and evaluate
y_pred_xgb = xgb.predict(X_test)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"XGBoost RMSE: {rmse_xgb:.2f}")
print(f"XGBoost R²: {r2_xgb:.4f}")

XGBoost RMSE: 32330.00
XGBoost R²: 0.9675


<h2>LightGBM</h2>

In [26]:
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Train LightGBM
lgbm = LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
lgbm.fit(X_train, y_train)

# Predict and evaluate
y_pred_lgbm = lgbm.predict(X_test)
rmse_lgbm = np.sqrt(mean_squared_error(y_test, y_pred_lgbm))
r2_lgbm = r2_score(y_test, y_pred_lgbm)

print(f"LightGBM RMSE: {rmse_lgbm:.2f}")
print(f"LightGBM R²: {r2_lgbm:.4f}")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.014569 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3608
[LightGBM] [Info] Number of data points in the train set: 199431, number of used features: 31
[LightGBM] [Info] Start training from score 505875.452574
LightGBM RMSE: 32699.92
LightGBM R²: 0.9668


<h2>CatBoost</h2>

In [27]:
from catboost import CatBoostRegressor

# Train CatBoost (silent training)
cat = CatBoostRegressor(iterations=100, learning_rate=0.1, random_state=42, verbose=0)
cat.fit(X_train, y_train)

# Predict and evaluate
y_pred_cat = cat.predict(X_test)
rmse_cat = np.sqrt(mean_squared_error(y_test, y_pred_cat))
r2_cat = r2_score(y_test, y_pred_cat)

print(f"CatBoost RMSE: {rmse_cat:.2f}")
print(f"CatBoost R²: {r2_cat:.4f}")

CatBoost RMSE: 38440.37
CatBoost R²: 0.9541


<h1>Full Evaluation</h1>

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

def evaluate_model(name, y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

    print(f"\n{name} Performance:")
    print(f"MAE: {mae:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print(f"R² Score: {r2:.4f}")
    print(f"MAPE: {mape:.2f}%")

    return {
        "Model": name,
        "MAE": mae,
        "RMSE": rmse,
        "R2": r2,
        "MAPE": mape
    }

In [29]:
results = []
results.append(evaluate_model("Random Forest", y_test, y_pred_rf))
results.append(evaluate_model("XGBoost", y_test, y_pred_xgb))
results.append(evaluate_model("LightGBM", y_test, y_pred_lgbm))
results.append(evaluate_model("CatBoost", y_test, y_pred_cat))
results.append(evaluate_model("Linear Regression", y_test_lin, y_pred_lin))
results.append(evaluate_model("Ridge Regression", y_test_lin, y_pred_ridge))
results.append(evaluate_model("Lasso Regression", y_test_lin, y_pred_lasso))


Random Forest Performance:
MAE: 19677.79
RMSE: 28229.90
R² Score: 0.9752
MAPE: 3.94%

XGBoost Performance:
MAE: 23330.42
RMSE: 32330.00
R² Score: 0.9675
MAPE: 4.70%

LightGBM Performance:
MAE: 23713.73
RMSE: 32699.92
R² Score: 0.9668
MAPE: 4.79%

CatBoost Performance:
MAE: 27792.01
RMSE: 38440.37
R² Score: 0.9541
MAPE: 5.58%

Linear Regression Performance:
MAE: 50178.49
RMSE: 64827.83
R² Score: 0.8694
MAPE: 10.95%

Ridge Regression Performance:
MAE: 50177.75
RMSE: 64828.27
R² Score: 0.8694
MAPE: 10.95%

Lasso Regression Performance:
MAE: 50178.10
RMSE: 64827.81
R² Score: 0.8694
MAPE: 10.95%


In [30]:
results_df = pd.DataFrame(results)
display(results_df.sort_values("RMSE"))

Unnamed: 0,Model,MAE,RMSE,R2,MAPE
0,Random Forest,19677.792651,28229.900812,0.975227,3.944605
1,XGBoost,23330.415943,32330.002928,0.967508,4.70376
2,LightGBM,23713.731858,32699.919655,0.966761,4.792648
3,CatBoost,27792.013396,38440.373496,0.954066,5.575533
6,Lasso Regression,50178.101089,64827.813344,0.869358,10.947648
4,Linear Regression,50178.488533,64827.825191,0.869358,10.947667
5,Ridge Regression,50177.748955,64828.26718,0.869356,10.947799


<h1>Improving the Random Forest Model</h1>

Since the Random Forest model performed the best, we will attempt to improve the model:
- Dropping features with lower importance
- Hyperparameter tuning

<h2>Hyperparameter Tuning</h2>

We will first attempt a randomized grid search cross-validation on the typical hyperparameter values for Random Forest models, to find the combination closest to the promising range out of 20 iterations.

Subsequently, if necessary, we will do a grid search cross-validation once we are able scope down the parameter space, to save on computational resource.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Setting hyperparams grid for randomized search
param_dist = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', 0.5]
}

# Use RandomizedSearchCV to identify promising ranges, from 20 iterations 
rf_randomized = RandomForestRegressor(random_state=42, n_jobs=-1)
rand_search = RandomizedSearchCV(rf_randomized, param_dist, n_iter=20, scoring='neg_root_mean_squared_error', cv=3, random_state=42)
rand_search.fit(X_train, y_train)

print("Best hyperparameters found:", rand_search.best_params_)
print("Best RMSE (CV):", -rand_search.best_score_)

Best hyperparameters found: {'n_estimators': 300, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 0.5, 'max_depth': None}
Best RMSE (CV): 27899.532443100543


<h1>Final Model with Best Performance</h1>

We will be using this Random Forest Model that we have tuned with Random Search CV in the cell above.

In [65]:
best_rf = RandomForestRegressor(
    **rand_search.best_params_,
    random_state=42,
    n_jobs=-1
)
best_rf.fit(X_train, y_train)
y_pred_best_rf = best_rf.predict(X_test)

evaluate_model("Random Forest (Best)", y_test, y_pred_best_rf)

Random Forest (Best) MAE: 18983.65
Random Forest (Best) RMSE: 27101.33
Random Forest (Best) R²: 0.9772
Random Forest (Best) MAPE: 3.8104%


(18983.651973620024,
 np.float64(27101.331856137866),
 0.9771680901417941,
 np.float64(3.8103910536967955))

<h1>Saving Model for use in webapp</h1>

In [57]:
import joblib

# Save the model
joblib.dump(best_rf, 'best_rf_model.pkl')

['best_rf_model.pkl']

<h3>Gradient Boosting Regressor</h3>

In [41]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Initialize and fit model
gbr = GradientBoostingRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=5,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42
)
gbr.fit(X_train, y_train)

# Evaluate
y_pred_gbr = gbr.predict(X_test)

def evaluate_model(name, y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

    print(f"{name} MAE: {mae:.2f}")
    print(f"{name} RMSE: {rmse:.2f}")
    print(f"{name} R²: {r2:.4f}")
    print(f"{name} MAPE: {mape:.4f}%")
    return mae, rmse, r2, mape

evaluate_model("Gradient Boosting", y_test, y_pred_gbr)

Gradient Boosting MAE: 22727.80
Gradient Boosting RMSE: 31410.50
Gradient Boosting R²: 0.9693
Gradient Boosting MAPE: 4.5996%


(22727.79938832143,
 np.float64(31410.503041698717),
 0.9693302132554812,
 np.float64(4.5996269593900845))