# Model Prediction Task 1

## Assumptions

1) The dataset provided represents the sale of all residential properties in Singapore.
2) The initial lease for all properties is 99 years.
3) The flat_types of 1 ROOM, 2 ROOM, 3 ROOM, 4 ROOM, 5 ROOM, EXECUTIVE, MULTI-GENERATION have ascending floor areas on average, i.e., an average 4 ROOM flat is bigger than an average 3 ROOM flat.
4) Each street_name can only be part of one town.
5) Block numbers have no impact on property prices. 
6) Buyers can finance their property purchase with either a bank loan or a HDB housing loan, with the latter interest rate set at 0.1% above the prevailing Central Provident Fund Ordinary Account interest rate.

## Load and Preprocess Data

In [1]:
import warnings
import json
import pandas as pd
from src.model import Model

warnings.filterwarnings(action="ignore", category=UserWarning)

model = Model(EDA=True)
model.preprocess_data()
df_pp = model.pp_data

## EDA

### Continuous Variables

#### Year

In [2]:
from jupyter_dash import JupyterDash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly.express as px

df_plot_data = df_pp.groupby("year").agg({"resale_price": "mean"}).reset_index()
fig = px.line(df_plot_data, x="year", y="resale_price", title="Mean resale prices over time.")

flat_type_options = list(df_pp.flat_type.unique())
flat_type_options.sort()
town_options = list(df_pp.town.unique())
town_options.sort()

# Initialise the Jupyter Dash app
app = JupyterDash(__name__)

# Define the app layout
app.layout = html.Div([
    html.Div([
        dcc.Dropdown(
            id="aggregation_type",
            options=[{"label": i.capitalize() + " resale price aggregation", "value": i} for i in ["mean", "median"] ],
            value="mean"
        )
    ], style={"display": "inline-block", "width": "32%", "padding": "0 0px"}),

    html.Div([
        dcc.Dropdown(
            id="flat_type",
            options=[{"label": i, "value": i} for i in ["All flat types"] + flat_type_options],
            value="All flat types"
        )
    ], style={"display": "inline-block", "width": "32%", "padding": "0 10px"}),

    html.Div([
        dcc.Dropdown(
            id="town",
            options=[{"label": i, "value": i} for i in ["All towns"] + town_options],
            value="All towns"
        )
    ], style={"display": "inline-block", "width": "32%", "padding": "0 0px"}),
    dcc.Graph(id="line-chart", figure=fig),
])

@app.callback(
    Output("line-chart", "figure"),
    [Input("aggregation_type", "value"),
     Input("flat_type", "value"),
     Input("town", "value")]
)
def update_chart(selected_aggregation_type, selected_flat_type, selected_town):
    if selected_flat_type == "All flat types":
        filtered_data = df_pp
    else:
        filtered_data = df_pp[df_pp["flat_type"] == selected_flat_type]
    if selected_town != "All towns":
        filtered_data = filtered_data[filtered_data["town"] == selected_town]
    filtered_data = filtered_data.groupby("year").agg({"resale_price": selected_aggregation_type}).reset_index()
    fig = px.line(filtered_data, x="year", y="resale_price", title=f"{selected_aggregation_type.capitalize()} resale prices over the years for {selected_flat_type} in {selected_town}.")
    return fig

if __name__ == "__main__":
    app.run_server(port=8050, debug=True)

Dash app running on http://127.0.0.1:8050/


#### Others

In [3]:
df_corr = pd.DataFrame()

for corr_coef in ["pearson", "spearman"]: # Measure of linearity | strength and direction of monotonicity 
    df_corr_ = df_pp[["year", "floor_area_sqm", "remaining_lease", "storey", "resale_price"]].corr(method=corr_coef)[["resale_price"]].rename({"resale_price": f"resale_price_{corr_coef}"}, axis=1).iloc[:-1]
    df_corr = pd.concat([df_corr, df_corr_], axis=1)

round(df_corr.sort_values(by="resale_price_pearson", ascending=False), 3)

Unnamed: 0,resale_price_pearson,resale_price_spearman
floor_area_sqm,0.625,0.656
year,0.605,0.598
storey,0.212,0.13
remaining_lease,-0.013,0.036


#### Observations

1) Resale prices were observed to increase over the years both visually and statistically (Pearson and Spearman Rank correlation coefficients of 0.604 and 0.598 respectively). This observation is consistent with inflation. Therefore, a similar unit to one sold today is likely to sell for more in the future.
2) Floor area was observed to be positively correlated to resale prices based on both the Person and Spearman Rank correlation coefficients of 0.625 and 0.656 respectively. Therefore, resale prices can be expected to increase as the floor area of the property increases.
3) The remaining lease on the property at the time of sale and the floor level of the property were not observed to have a significant correlation with resale prices.

The continuous variables ```year``` and ```floor_area_sqm``` are likely to be important features for resale price prediction.

### Categorical Variables

In [4]:
df_plot_2_data = df_pp.groupby("town").agg({"resale_price": "mean"}).reset_index()
fig2 = px.bar(df_plot_2_data, x="town", y="resale_price", title="Mean resale prices by town.")
cat_var_options = ["town", "flat_type", "flat_model", "street_name"]

# Initialise the Dash app
app2 = JupyterDash(__name__)

# Define the app layout
app2.layout = html.Div([
    html.Div([
        dcc.Dropdown(
            id="aggregation_type",
            options=[{"label": i.capitalize() + " resale price aggregation", "value": i} for i in ["mean", "median"] ],
            value="mean"
        )
    ], style={"display": "inline-block", "width": "32%", "padding": "0 0px"}),

    html.Div([
        dcc.Dropdown(
            id="cat_var",
            options=[{"label": f"Sales by {i}", "value": i} for i in cat_var_options],
            value="town"
        )
    ], style={"display": "inline-block", "width": "32%", "padding": "0 10px"}),
    dcc.Graph(id="line-chart", figure=fig2),
])

@app2.callback(
    Output("line-chart", "figure"),
    [Input("aggregation_type", "value"),
     Input("cat_var", "value")]
)
def update_chart(selected_aggregation_type, selected_cat_var):
    filtered_data = df_pp.groupby(selected_cat_var).agg({"resale_price": selected_aggregation_type}).reset_index()
    fig2 = px.bar(filtered_data, x=selected_cat_var, y="resale_price", title=f"{selected_aggregation_type.capitalize()} resale prices by {selected_cat_var}.")
    return fig2

if __name__ == "__main__":
    app2.run_server(port=8051, debug=True)

Dash app running on http://127.0.0.1:8051/


#### Flat Model and Flat Type

In [5]:
cat_col_pair = ["flat_model", "flat_type"]

pd.DataFrame(df_pp[cat_col_pair].value_counts()).reset_index().sort_values(by=cat_col_pair).set_index(cat_col_pair)

Unnamed: 0_level_0,Unnamed: 1_level_0,count
flat_model,flat_type,Unnamed: 2_level_1
2-ROOM,2 ROOM,40
ADJOINED FLAT,3 ROOM,2
ADJOINED FLAT,4 ROOM,164
ADJOINED FLAT,5 ROOM,602
ADJOINED FLAT,EXECUTIVE,317
APARTMENT,EXECUTIVE,32004
DBSS,2 ROOM,1
DBSS,3 ROOM,156
DBSS,4 ROOM,549
DBSS,5 ROOM,903


In [6]:
unique_flat_models = len(df_pp["flat_model"].unique())
unique_flat_types = len(df_pp["flat_type"].unique())

print(f"Number of unique flat models: {unique_flat_models}.\nNumber of unique flat types: {unique_flat_types}.")

Number of unique flat models: 20.
Number of unique flat types: 7.


#### Town and Street Name

In [7]:
unique_towns = len(df_pp["town"].unique())
unique_street_names = len(df_pp["street_name"].unique())

print(f"Number of unique towns: {unique_towns}.\nNumber of unique street names: {unique_street_names}.")

Number of unique towns: 27.
Number of unique street names: 568.


#### Observations

**Flat Model and Flat Type**
- Both variables were visually observed to have some influence on resale prices. However, the influence of flat type appeared to be more pronounced than flat model.
- The number of unique flat types of 7 is smaller than the number of unique flat models at 20. 
- It was observed that one flat model, e.g., 'IMPROVED' can be attributed to multiple flat types. 

As it is assumed that the flat type conveys more specific information on the size of the property (and the subsequent resale price) compared to the flat model, it is postulated that ```flat_type``` will be an important predictor of resale prices while ```flat_model``` is likely not an important predictor.

**Town and Street Name**
- Both variables were visually observed to have some influence on resale prices.
- It was observed that the number of unique street names of 568 far outnumbered the number of unique towns of 27. 

It was observed that resale prices were influenced by location. However, as it is assumed that each street name is part of one town, the much larger number of street names is likely to result in ```street_name``` being a much less important predictor of resale prices compared to ```town```.

## Model Evaluation

In [8]:
with open("artifacts/model_metadata.json", "rb") as f:
    model_metadata = json.load(f)
    eval_metrics = model_metadata["EVAL_METRICS"]
    feature_importances = model_metadata["FEATURE_IMPORTANCES"]

print(eval_metrics)

{'TRAIN_MAPE': 0.06279472569184753, 'TEST_MAPE': 0.05762537169215347}


The mean absolute percentage error (MAPE) for the test set was 5.8%. This means that across the entire test set, the predicted resale price was, on average, within +- 5.8% of the actual resale price.

## Model Interpretation

### Feature Explanation

The model predicts resale price based on the following input features:

1) year
    - The calendar year during which the sale occurred.
2) floor_area_sqm
    - The floor area of the property in square metres.
3) flat_type 
    - The flat type, e.g., 1 ROOM. It is assumed that the flat type has a direct impact on the floor area of the property.
4) town
    - Town that the property was located in, e.g., BUKIT TIMAH
5) street_name
    - Street Name that the property was located on, e.g., PINE CL. It is assumed that each street name can only be part of one town.
6) flat_model
    - The flat model, e.g., IMPROVED.
7) remaining_lease
    - The number of years remaining on the lease at the time of the sale. The initial lease is assumed to be 99 years.
8) storey
    - The approximate floor level of the property, averaged from the range provided. E.g., 04 TO 06 will be converted to a floor level of 5.

### Feature Importance

#### Overall

In [9]:
df_feat_imp = pd.DataFrame(feature_importances, index=["Importance (%)"]).T.reset_index().rename({"index": "Feature"}, axis=1)
df_feat_imp

Unnamed: 0,Feature,Importance (%)
0,year,34.993531
1,floor_area_sqm,21.301286
2,flat_type,15.359727
3,town,14.34364
4,street_name,4.257389
5,flat_model,3.430442
6,remaining_lease,3.400329
7,storey,2.913657


In [10]:

feat_imp_plot = px.bar(df_feat_imp, x="Feature", y="Importance (%)")
feat_imp_plot.show()

In [11]:
top_4_feat_imp = df_feat_imp.iloc[0:4]["Importance (%)"].sum()
top_4_feat_imp

85.99818293528062

The top 4 most important features of year, floor_area_sqm, flat_type and town had an overwhelming impact of 86% on the predictions, with the remaining 14% split between street_name, flat_model, remaining_lease and story. Of the top 4 most important features, year stood out as the most important feature at 35%, 13.7%-points higher than the next most important feature of floor_area_sqm at 21.3%. flat_type and town had similar feature importances at 15.4% and 14.3% respectively.

#### Random Example

In [12]:
import shap

# Load model
model.evaluate("model.pkl")

# Initialise SHAP explainer
explainer = shap.Explainer(model.model)

# Calculate SHAP values
shap_values = explainer.shap_values(model.X_test)

# Extract SHAP values for a sample test set row and display the values alongside the data
sample_test_set_row_idx = 32
sample_test_set_row = model.X_test.iloc[sample_test_set_row_idx:sample_test_set_row_idx+1].T
sample_test_set_row["SHAP_VALUES"] = shap_values[sample_test_set_row_idx]
sample_test_set_row["SHAP_VALUES"] = round(sample_test_set_row["SHAP_VALUES"], 2)
sample_test_set_row = sample_test_set_row.sort_values(by="SHAP_VALUES", key=lambda x: abs(x), ascending=False)
sample_test_set_row = sample_test_set_row.rename({sample_test_set_row.columns[0]: "TEST_DATA"}, axis=1)
sample_test_set_row

Unnamed: 0,TEST_DATA,SHAP_VALUES
year,2019,110171.63
storey,20.0,38783.29
floor_area_sqm,96.0,-21593.55
flat_model,PREMIUM APARTMENT,17904.82
street_name,FERNVALE RD,10822.86
town,SENGKANG,-9045.65
remaining_lease,92,3858.23
flat_type,4 ROOM,-2144.01


In the test set sample above, the sale year had a greatest positive influence on the predicted resale price which is consistent with expectations of inflation and the sale year of 2019, which is recent in comparison to the training data, which stopped at 2018. The storey also had a positive influence on the resale price as the value of 20 indicates that the flat is on a high floor which generally commands a higher sale price. The floor area had the greatest negative impact on the resale price as the floor area of 96 sq metres is relatively small and smaller properties are generally less expensive.

## Model Selection

**Selection of a gradient boosted tree-based model**
- Linear models:
    - Advantages:
        1) Simple to interpret
        2) Computationally efficient to train
    - Disadvantages: 
        1) Assumes linearity between features and target
        2) May underfit complex data.

- Gradient boosted tree-based models:
    - Advantages:
        1) Can handle complex data and model non-linear relationships
        2) High predictive accuracy
        3) Provides feature importance insights
    - Disadvantages:
        1) Computationally more expensive to train compared to linear models
        2) Requires careful hyperparameter tuning to avoid overfitting
    
- Deep learning models:
    - Advantages:
        1) Can handle complex data and model non-linear relationships
        2) High predictive accuracy, potentially higher than that of gradient boosted tree-based models
    - Disadvantages:
        1) Limited interpretability
        2) Requires large datasets for training
        3) Requires careful hyperparameter tuning to avoid overfitting
        4) Much more computationally expensive to train compared to gradient boosted tree-based models, particularly if GPUs are unavailable 

Therefore, a gradient boosted tree-based model was chosen because all three of its listed advantages were requirements; the requirement to model non-linear relationships and high predictive accuracy precluded linear models, while the requirement of feature importances would preclude deep learning models. In addition, the lack of GPUs for model training would further make deep learning models challenging to train. While gradient boosed tree-based models are more computationally expensive to train relative to linear models, the time required to train a CatBoost model (a gradient boosed tree-based model) is reasonable, with a typical non-GPU accelerated training duration of under one hour based on the training configurations of this study.

**Selection of CatBoost**

- CatBoost was selected as the gradient boosted tree-based model of choice for the following reasons:
    1) It provides feature importance scores for model interpretability
    2) Efficient handling of categorical features with target encoding and ordered boosting
        - Ordered boosting is necessary to prevent data leakage

While CatBoost models are generally fast to train, they can still be slower to train relative compared to other gradient boosted tree-based models like LightGBM. However, the training duration for CatBoost remains reasonable as outlined above.

# Model Prediction Task 2

**Factors and Considerations**

- Access to updated data
    - When building predictive models, it is important to continually have access to updated data so that the model may be retrained periodically to ensure that recent trends are captured in the prediction.
        - This is particularly important if the prediction has a temporal component, as observed in the toy example where the model correctly identified the sale year as an important feature in predicting resale prices. Since the model can only learn from what it has seen in the training data, providing updated data is critical in maintaining the accuracy of the predictions.

- Ease of model retraining
    - As the model will need to be retrained periodically, model retraining should take a reasonable amount of resources to prevent cost overruns.

- User understanding and acceptance of accuracy metrics
    - Efforts should be made to get users of the predictive model to understand and accept the accuracy metrics used for model evaluation.
        - In the toy example, MAPE was used as the main accuracy metric for model evaluation. However, if users wished for an accuracy metric of number of predicted prices within +-$5,000 of the actual resale price, efforts will have to be made to explain why MAPE is a more suitable metric, e.g., $5,000 out of a $1,000,000 resale price is a very small margin of error as compared to a $10,000 resale price.

- Predictive model consumption
    - Model inference should take a reasonable amount of time to ensure a good user experience.
    - Efforts should be made to discuss with users their requirements on how the predictive model is to be consumed, with the projected manpower and technology costs taken into consideration.
        - In the toy example, if the users wished to generate resale price predictions on the go within a web app, a web-based front-end will need to be developed, with the predictive model hosted on the cloud and the necessary REST API(s) developed for model inference.