<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Singapore HDB Resale Price Predictions

--- 
**Primary Learning Objectives:**
1. Creating and iteratively refining a regression model
2. Using [Kaggle](https://www.kaggle.com/) to practice the modeling process
3. Providing business insights through reporting and presentation.

---

## Contents:
- [Model Tuning and Evaluation](#Model-Tuning-and-Evaluation)

In [1]:
import dill
import pandas as pd
import numpy as np

from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, PolynomialFeatures, StandardScaler
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV, ElasticNet
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import r2_score, mean_absolute_percentage_error, mean_squared_error

## Data and Preprocessor Import

In [2]:
# import clean data
X = pd.read_csv('../datasets/X.csv', index_col='Unnamed: 0')
y = pd.read_csv('../datasets/y.csv', index_col='Unnamed: 0').squeeze() # convert y into a Series from a Dataframe to prevent errors

In [3]:
# define folder path for models
folder_path = '../models/'

In [4]:
# load preprocessors
preprocessor_A = dill.load(open(folder_path + 'preprocessor_A.sav', 'rb'))
preprocessor_B = dill.load(open(folder_path + 'preprocessor_B.sav', 'rb'))
preprocessor_C = dill.load(open(folder_path + 'preprocessor_C.sav', 'rb'))

In [5]:
# load regression transformer
lr_log_model = dill.load(open(folder_path + 'lr_log.sav', 'rb'))

## Model Tuning and Evaluation

In [6]:
# conduct train-test-split for model tuning and evaluation
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#### Baseline Model
Establish the baseline model as a dummy regressor that always predicts the resale price as the median resale price of the training dataset (considering that the resale prices are skewed normal distributed)

In [7]:
# baseline model
base_model = DummyRegressor(strategy='median')

base_model.fit(X_train, y_train)

In [8]:
#Model performance in terms of r2_score
print(f'Train R2_SCORE: {base_model.score(X_train, y_train)}')
print(f'5-Fold CV R2_SCORE: {cross_val_score(base_model, X_train, y_train).mean()}')
print(f'Test R2_SCORE: {base_model.score(X_test, y_test)}')

Train R2_SCORE: -0.03981185199003967
5-Fold CV R2_SCORE: -0.0394012087909922
Test R2_SCORE: -0.03512877630249989


In [9]:
#Model performance in terms of rmse
print(f'Train RMSE_SCORE: {np.sqrt(mean_squared_error(y_train, base_model.predict(X_train)))}')
print(f'5-Fold RMSE_SCORE: {-1 * (cross_val_score(base_model, X_train, y_train, cv=5, scoring="neg_root_mean_squared_error")).mean()}')
print(f'Test RMSE_SCORE: {np.sqrt(mean_squared_error(y_test, base_model.predict(X_test)))}')

Train RMSE_SCORE: 146318.59646423583
5-Fold RMSE_SCORE: 146287.01885345049
Test RMSE_SCORE: 145232.8699783684


#### Observation:
- R2 score is -0.04 and RMSE of $145k. This is expected.
- Assess if other models perform better

#### Model A (simple model without amenities)

In [10]:
# create pipeline to combine preprocessor with regressor
model_A = Pipeline(
    steps=[
        ('preproc', preprocessor_A),
        ('lr', LinearRegression())
    ]
)

In [11]:
# fit model
model_A.fit(X_train, y_train)

In [12]:
#Model performance in terms of r2_score
print(f'Train R2_SCORE: {model_A.score(X_train, y_train)}')
print(f'5-Fold CV R2_SCORE: {cross_val_score(model_A, X_train, y_train).mean()}')
print(f'Test R2_SCORE: {model_A.score(X_test, y_test)}')

Train R2_SCORE: 0.8524633474139972




5-Fold CV R2_SCORE: 0.8521813370398386
Test R2_SCORE: 0.8510185891710963


In [13]:
# Model performance in terms of rmse
print(f'Train RMSE_SCORE: {np.sqrt(mean_squared_error(y_train, model_A.predict(X_train)))}')
print(f'5-Fold RMSE_SCORE: {-1 * (cross_val_score(model_A, X_train, y_train, cv=5, scoring="neg_root_mean_squared_error")).mean()}')
print(f'Test RMSE_SCORE: {np.sqrt(mean_squared_error(y_test, model_A.predict(X_test)))}')

Train RMSE_SCORE: 55115.291381977644




5-Fold RMSE_SCORE: 55166.56291912706
Test RMSE_SCORE: 55097.73804994464


Observation:
- R2 score is 0.85, consistent acrosss train, cv and test dataset. 
- Unlikley to have overfitting or underfitting
- RMSE improved to $55k
- Model A performs better than Baseline Model

#### Model Without Amenities

In [14]:
# create pipeline to combine preprocessor with regressor that logs resale price
model_A2 = Pipeline(
    steps=[
        ('preproc', preprocessor_A),
        ('regr', lr_log_model)
    ]
)

In [15]:
# fit model
model_A2.fit(X_train, y_train)

In [16]:
# Model performance in terms of r2_score
print(f'Train R2_SCORE: {model_A2.score(X_train, y_train)}')
print(f'5-Fold CV R2_SCORE: {cross_val_score(model_A2, X_train, y_train).mean()}')
print(f'Test R2_SCORE: {model_A2.score(X_test, y_test)}')

Train R2_SCORE: 0.8652814615512272




5-Fold CV R2_SCORE: 0.8648076814198437
Test R2_SCORE: 0.8648500432152446


In [17]:
# Model performance in terms of rmse
print(f'Train RMSE_SCORE: {np.sqrt(mean_squared_error(y_train, model_A2.predict(X_train)))}')
print(f'5-Fold RMSE_SCORE: {-1 * (cross_val_score(model_A2, X_train, y_train, cv=5, scoring="neg_root_mean_squared_error")).mean()}')
print(f'Test RMSE_SCORE: {np.sqrt(mean_squared_error(y_test, model_A2.predict(X_test)))}')

Train RMSE_SCORE: 52666.66593707455




5-Fold RMSE_SCORE: 52756.42006619719
Test RMSE_SCORE: 52477.8077528423


Observations:
- R2 score is 0.86 and consistent across train, cv and test dataset. 
- Unlikely to have overfitting and underfitting
- RMSE improved further to $52k

#### Model B (model with amenities)

In [18]:
# create pipeline to combine preprocessor with regressor that logs resale price since we saw better results earlier
model_B = Pipeline(
    steps=[
        ('preproc', preprocessor_B),
        ('regr', lr_log_model)
    ]
)

In [19]:
# fit model
model_B.fit(X_train, y_train)

In [20]:
# model performance in terms of r2_score
print(f'Train R2_SCORE: {model_B.score(X_train, y_train)}')
print(f'5-Fold CV R2_SCORE: {cross_val_score(model_B, X_train, y_train).mean()}')
print(f'Test R2_SCORE: {model_B.score(X_test, y_test)}')

Train R2_SCORE: 0.8957748846896925




5-Fold CV R2_SCORE: 0.8951937911873458
Test R2_SCORE: 0.8966134424092185


In [21]:
# model performance in terms of rmse
print(f'Train RMSE_SCORE: {np.sqrt(mean_squared_error(y_train, model_B.predict(X_train)))}')
print(f'5-Fold RMSE_SCORE: {-1 * (cross_val_score(model_B, X_train, y_train, cv=5, scoring="neg_root_mean_squared_error")).mean()}')
print(f'Test RMSE_SCORE: {np.sqrt(mean_squared_error(y_test, model_B.predict(X_test)))}')

Train RMSE_SCORE: 46324.24363858044




5-Fold RMSE_SCORE: 46444.836006691636
Test RMSE_SCORE: 45898.63458917098


Observation:
- R2 score further improved to 0.89. Improvement across train, cv, and test datasets
- RMSE improved further to $46k
- Model B performs better than models above.

#### Model C (model with amenities and interaction)

In [22]:
# create pipeline to combine preprocessor with regressor
model_C = Pipeline(
    steps=[
        ('preproc', preprocessor_C),
        ('regr', lr_log_model)
    ]
)

In [23]:
# fit model
model_C.fit(X_train, y_train)

In [24]:
# Model performance in terms of r2_score
print(f'Train R2_SCORE: {model_C.score(X_train, y_train)}')
print(f'5-Fold CV R2_SCORE: {cross_val_score(model_C, X_train, y_train).mean()}')
print(f'Test R2_SCORE: {model_C.score(X_test, y_test)}')

Train R2_SCORE: 0.8959182732515444




5-Fold CV R2_SCORE: 0.8953152095246144
Test R2_SCORE: 0.8968291794637823


In [25]:
# Model performance in terms of rmse
print(f'Train RMSE_SCORE: {np.sqrt(mean_squared_error(y_train, model_C.predict(X_train)))}')
print(f'5-Fold RMSE_SCORE: {-1 * (cross_val_score(model_C, X_train, y_train, cv=5, scoring="neg_root_mean_squared_error")).mean()}')
print(f'Test RMSE_SCORE: {np.sqrt(mean_squared_error(y_test, model_C.predict(X_test)))}')

Train RMSE_SCORE: 46292.367191142745




5-Fold RMSE_SCORE: 46417.416925415615
Test RMSE_SCORE: 45850.72116834144


Observation:
- Marginal improvement in R2 score compared to Model B. R2 scores consistent across train, cv and test dataset.
- Marginal improvement in RMSE.
- Model C peforms slightly better than Model B. On balance, to choose Model B for production due to its lower complexity, and lower risk of overfitting.
- To address risk of overfitting, we will regularise Model C and examine the results.

#### Regularise Model C (model with amenities, interaction and regularisation)
We will add in both L1 (Lasso Regression) and L2 (Ridge Regression) regularisation 

In [26]:
# create regressor with regularisation thru Ridge
ridge_log_model = TransformedTargetRegressor(
    regressor=RidgeCV(),
    func=np.log,
    inverse_func=np.exp
)

In [27]:
# create pipeline to join preprocessor and regressor and instantiate StandardScaler
model_CR = Pipeline(
    steps=[
        ('preproc', preprocessor_C),
        ('ss', StandardScaler()),
        ('regr', ridge_log_model),
    ]
)

In [28]:
#fit model
model_CR.fit(X_train, y_train)

In [29]:
# Model performance in terms of rmse
print(f'Train RMSE_SCORE: {np.sqrt(mean_squared_error(y_train, model_CR.predict(X_train)))}')
print(f'5-Fold RMSE_SCORE: {-1 * (cross_val_score(model_CR, X_train, y_train, cv=5, scoring="neg_root_mean_squared_error")).mean()}')
print(f'Test RMSE_SCORE: {np.sqrt(mean_squared_error(y_test, model_CR.predict(X_test)))}')

Train RMSE_SCORE: 46294.74926123444




5-Fold RMSE_SCORE: 46418.670425164455
Test RMSE_SCORE: 45852.62696180347


In [30]:
# create regressor with regularisation thru Lasso
lasso_log_model = TransformedTargetRegressor(
    regressor=LassoCV(),
    func=np.log,
    inverse_func=np.exp
)

In [31]:
# create pipeline to join preprocessor and regressor
model_CR2 = Pipeline(
    steps=[
        ('preproc', preprocessor_C),
        ('ss', StandardScaler()),
        ('regr', lasso_log_model)
    ]
)

In [32]:
#fit model
model_CR2.fit(X_train, y_train)

In [33]:
# Model performance in terms of rmse
print(f'Train RMSE_SCORE: {np.sqrt(mean_squared_error(y_train, model_CR2.predict(X_train)))}')
print(f'5-Fold RMSE_SCORE: {-1 * (cross_val_score(model_CR2, X_train, y_train, cv=5, scoring="neg_root_mean_squared_error")).mean()}')
print(f'Test RMSE_SCORE: {np.sqrt(mean_squared_error(y_test, model_CR2.predict(X_test)))}')

Train RMSE_SCORE: 46756.108776539084




5-Fold RMSE_SCORE: 46891.213454505276
Test RMSE_SCORE: 46258.601692450524


In [34]:
# create regressor with regularisation thru ElasticNet
enet_log_model = TransformedTargetRegressor(
    regressor=ElasticNet(),
    func=np.log,
    inverse_func=np.exp
)

In [35]:
# create pipeline to join preprocessor and regressor
model_CR3 = Pipeline(
    steps=[
        ('preproc', preprocessor_C),
        ('ss', StandardScaler()),
        ('regr', enet_log_model)
    ]
)

In [36]:
# check the names of the parameters to set
model_CR3['regr'].get_params()

{'check_inverse': True,
 'func': <ufunc 'log'>,
 'inverse_func': <ufunc 'exp'>,
 'regressor__alpha': 1.0,
 'regressor__copy_X': True,
 'regressor__fit_intercept': True,
 'regressor__l1_ratio': 0.5,
 'regressor__max_iter': 1000,
 'regressor__positive': False,
 'regressor__precompute': False,
 'regressor__random_state': None,
 'regressor__selection': 'cyclic',
 'regressor__tol': 0.0001,
 'regressor__warm_start': False,
 'regressor': ElasticNet(),
 'transformer': None}

In [37]:
# Create dictionary of hyperparameters.
model_CR3_params = {
    'regr__regressor__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9],
    'regr__regressor__alpha':  [0.0001, 0.001, 0.01, 0.1, 1, 10],
    'regr__regressor__max_iter': [1000]
}

In [38]:
# Instantiate our GridSearchCV object
model_CR3_grid = GridSearchCV(
    estimator=model_CR3,
    param_grid=model_CR3_params,
    scoring='neg_mean_squared_error'
)

In [39]:
# Fit the GridSearchCV object to the data - Approximately 15mins!
model_CR3_grid.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [40]:
# examine best estimator hyperparameters
model_CR3_grid.best_estimator_

In [41]:
# exmaine cv results
pd.DataFrame(model_CR3_grid.cv_results_).head(20)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_regr__regressor__alpha,param_regr__regressor__l1_ratio,param_regr__regressor__max_iter,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,5.638054,0.607637,0.278484,0.044407,0.0001,0.1,1000,"{'regr__regressor__alpha': 0.0001, 'regr__regr...",-2130755000.0,-2132674000.0,-2092343000.0,-2349391000.0,-2117062000.0,-2164445000.0,93587850.0,1
1,6.902224,0.49528,0.290026,0.043864,0.0001,0.3,1000,"{'regr__regressor__alpha': 0.0001, 'regr__regr...",-2133671000.0,-2136271000.0,-2095572000.0,-2367485000.0,-2121237000.0,-2170847000.0,99369470.0,2
2,5.740573,1.288983,0.292414,0.034081,0.0001,0.5,1000,"{'regr__regressor__alpha': 0.0001, 'regr__regr...",-2136493000.0,-2139914000.0,-2098845000.0,-2385728000.0,-2125387000.0,-2177273000.0,105219200.0,3
3,4.088149,1.601028,0.292807,0.053039,0.0001,0.7,1000,"{'regr__regressor__alpha': 0.0001, 'regr__regr...",-2139023000.0,-2142952000.0,-2101935000.0,-2401715000.0,-2129382000.0,-2183001000.0,110290800.0,5
4,2.965467,0.332276,0.270192,0.039259,0.0001,0.9,1000,"{'regr__regressor__alpha': 0.0001, 'regr__regr...",-2141525000.0,-2145358000.0,-2104659000.0,-2408436000.0,-2132302000.0,-2186456000.0,111898300.0,6
5,2.90797,0.222241,0.294822,0.048627,0.001,0.1,1000,"{'regr__regressor__alpha': 0.001, 'regr__regre...",-2141984000.0,-2144825000.0,-2104095000.0,-2392474000.0,-2129865000.0,-2182648000.0,105893900.0,4
6,2.447844,0.327197,0.284761,0.047941,0.001,0.3,1000,"{'regr__regressor__alpha': 0.001, 'regr__regre...",-2158755000.0,-2161552000.0,-2120047000.0,-2449857000.0,-2147600000.0,-2207562000.0,122033200.0,7
7,2.25483,0.137453,0.293781,0.041191,0.001,0.5,1000,"{'regr__regressor__alpha': 0.001, 'regr__regre...",-2176809000.0,-2182179000.0,-2138803000.0,-2501101000.0,-2167627000.0,-2233304000.0,134732900.0,8
8,2.159638,0.106347,0.275594,0.024613,0.001,0.7,1000,"{'regr__regressor__alpha': 0.001, 'regr__regre...",-2190888000.0,-2200451000.0,-2153386000.0,-2544974000.0,-2182794000.0,-2254499000.0,146088300.0,10
9,1.994256,0.182895,0.273781,0.040374,0.001,0.9,1000,"{'regr__regressor__alpha': 0.001, 'regr__regre...",-2205958000.0,-2219268000.0,-2167933000.0,-2580575000.0,-2196565000.0,-2274060000.0,154181700.0,11


Observation:
- Based on top 10 models, best alpha is <=0.1
- RMSE scores for regularised models are marginally worse than orignal Model C without regularisation. This is unsurprising given that only a few features were used in the model. Therefore, will use original Model C for further evaluation.