CS4001/4042 Assignment 1, Part B, Q4
---

Model degradation is a common issue faced when deploying machine learning models (including neural networks) in the real world. New data points could exhibit a different pattern from older data points due to factors such as changes in government policy or market sentiments. For instance, housing prices in Singapore have been increasing and the Singapore government has introduced 3 rounds of cooling measures over the past years (16 December 2021, 30 September 2022, 27 April 2023).

In such situations, the distribution of the new data points could differ from the original data distribution which the models were trained on. Recall that machine learning models often work with the assumption that the test distribution should be similar to train distribution. When this assumption is violated, model performance will be adversely impacted.  In the last part of this assignment, we will investigate to what extent model degradation has occurred.




---



---



Your co-investigators used a linear regression model to rapidly test out several combinations of train/test splits and shared with you their findings in a brief report attached in Appendix A below. You wish to investigate whether your deep learning model corroborates with their findings.

In [None]:
!pip install alibi-detect

In [9]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from alibi_detect.cd import TabularDrift

In [10]:
from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

> Evaluate your model from B1 on data from year 2022 and report the test R2.

In [11]:
df = pd.read_csv('hdb_price_prediction.csv')

# This is the model from B1, but using year 2022 as test data.

df2020val = df[df['year']==2020]
df2022test = df[df['year']==2022]
df2019andbeforetrain = df[df['year']<=2019]

# Training the model from B1 and evaluating on year 2022 data

data_config = DataConfig(
    target=[
        "resale_price"
    ],  # target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols= ["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"],
    categorical_cols=["month", "town", "flat_model_type", "storey_range"],
)

trainer_config = TrainerConfig(
    auto_lr_find=True,  # Runs the LRFinder to automatically derive a learning rate
    batch_size=1024,
    max_epochs=50,
)

optimizer_config = OptimizerConfig(optimizer="Adam")

model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50",  # Number of nodes in each layer
    activation="ReLU"
)

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

tabular_model.fit(train=df2019andbeforetrain, validation=df2020val)
result = tabular_model.evaluate(df2022test)
pred_df = tabular_model.predict(df2022test)
tabular_model.save_model("Neural Networks and Deep Learning B4 2022")
loaded_model = TabularModel.load_from_checkpoint("Neural Networks and Deep Learning B4 2022")

2023-10-10 23:46:30,528 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off
Global seed set to 42
2023-10-10 23:46:30,552 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-10-10 23:46:30,558 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-10-10 23:46:30,637 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-10-10 23:46:30,670 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-10-10 23:46:30,732 - {pytorch_tabular.tabular_model:573} - INFO - Auto LR Find Started
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

  rank_zero_warn(
  rank_zero_warn(
`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at C:\Users\Gareth Thong\Neural Networks and Deep Learning B4\.lr_find_3dd2798d-753c-4431-adaa-74cd7f54f6a7.ckpt
Restored all states from the checkpoint file at C:\Users\Gareth Thong\Neural Networks and Deep Learning B4\.lr_find_3dd2798d-753c-4431-adaa-74cd7f54f6a7.ckpt
2023-10-10 23:46:36,204 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-10 23:46:36,205 - {pytorch_tabular.tabular_model:582} - INFO - Training Started


Output()

2023-10-10 23:47:17,239 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-10-10 23:47:17,240 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model


Output()

  rank_zero_deprecation(
  rank_zero_warn(


Output()

2023-10-10 23:47:20,509 - {pytorch_tabular.tabular_model:129} - INFO - Experiment Tracking is turned off
2023-10-10 23:47:20,514 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [12]:
from sklearn.metrics import mean_squared_error, r2_score

# Calculate RMSE
rmse = mean_squared_error(pred_df['resale_price'], pred_df['resale_price_prediction'], squared=False)
print("RMSE is " + str(rmse))

# Calculate R^2
r2 = r2_score(pred_df['resale_price'], pred_df['resale_price_prediction'])
print("R2 is " + str(r2))

RMSE is 127542.29821209553
R2 is 0.4388472630514271


> Evaluate your model from B1 on data from year 2023 and report the test R2.

In [13]:
df2023test = df[df['year']==2023]

result = tabular_model.evaluate(df2023test)
pred_df = tabular_model.predict(df2023test)
tabular_model.save_model("Neural Networks and Deep Learning B4 2023")
loaded_model = TabularModel.load_from_checkpoint("Neural Networks and Deep Learning B4 2023")

Output()

  rank_zero_warn(


Output()

2023-10-10 23:47:54,121 - {pytorch_tabular.tabular_model:129} - INFO - Experiment Tracking is turned off
2023-10-10 23:47:54,126 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [14]:
# Calculate RMSE
rmse = mean_squared_error(pred_df['resale_price'], pred_df['resale_price_prediction'], squared=False)
print("RMSE is " + str(rmse))

# Calculate R^2
r2 = r2_score(pred_df['resale_price'], pred_df['resale_price_prediction'])
print("R2 is " + str(r2))

RMSE is 157166.677331193
R2 is 0.162125913544938


> Did model degradation occur for the deep learning model?


Yes, model degradation has occurred for the deep learning model. The RMSE and R2 when tested on year 2021 was 76696.923214814 and 0.7776187684416944 respectively. However, the RMSE has increased subsequently for years 2022 and 2023 to 127542.29821209553 and 157166.677331193 respectively, while the R2 value has decreased to 0.4388472630514271 and 0.162125913544938 for years 2022 and 2023 respectively. The greater error and worser model fit indicate model degradation.



---



---



Model degradation could be caused by [various data distribution shifts](https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html#data-shift-types): covariate shift (features), label shift and/or concept drift (altered relationship between features and labels).
There are various conflicting terminologies in the [literature](https://www.sciencedirect.com/science/article/pii/S0950705122002854#tbl1). Let’s stick to this reference for this assignment.

> Using the **Alibi Detect** library, apply the **TabularDrift** function with the training data (year 2019 and before) used as the reference and **detect which features have drifted** in the 2023 test dataset. Before running the statistical tests, ensure you **sample 1000 data points** each from the train and test data. Do not use the whole train/test data. (Hint: use this example as a guide https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_chi2ks_adult.html)


In [27]:
# Note to marker: I have not included "year" in the features to be consistent with the model trained previously.

X_ref = (df[df['year']<=2019]).drop(columns=['year', 'resale_price'])[:1000]
X_test = (df[df['year']==2023]).drop(columns=['year', 'resale_price'])[:1000]
categories_per_feature = {0:None, 1:None, 2:None, 3:None, 8:None, 11:None}

cd = TabularDrift(X_ref.to_numpy(), p_val=.05, categories_per_feature=categories_per_feature)
fpreds = cd.predict(X_test.to_numpy(), drift_type='feature')

for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    labels = ['No!', 'Yes!']
    feature_names = ["month", "town", 
                     "full_address", "nearest_stn", "dist_to_nearest_stn", 
                     "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", 
                     "flat_model_type", "remaining_lease_years", "floor_area_sqm", "storey_range"]
    fname = feature_names[f]
    is_drift = fpreds['data']['is_drift'][f]
    stat_val, p_val = fpreds['data']['distance'][f], fpreds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

month -- Drift? No! -- Chi2 0.000 -- p-value 1.000
town -- Drift? Yes! -- Chi2 667.474 -- p-value 0.000
full_address -- Drift? Yes! -- Chi2 1750.200 -- p-value 0.004
nearest_stn -- Drift? Yes! -- Chi2 617.871 -- p-value 0.000
dist_to_nearest_stn -- Drift? No! -- K-S 0.055 -- p-value 0.094
dist_to_dhoby -- Drift? Yes! -- K-S 0.218 -- p-value 0.000
degree_centrality -- Drift? No! -- K-S 0.029 -- p-value 0.783
eigenvector_centrality -- Drift? Yes! -- K-S 0.195 -- p-value 0.000
flat_model_type -- Drift? Yes! -- Chi2 77.586 -- p-value 0.000
remaining_lease_years -- Drift? Yes! -- K-S 0.271 -- p-value 0.000
floor_area_sqm -- Drift? Yes! -- K-S 0.134 -- p-value 0.000
storey_range -- Drift? Yes! -- Chi2 38.800 -- p-value 0.001


> Assuming that the flurry of housing measures have made an impact on the relationship between all the features and resale_price (i.e. P(Y|X) changes), which type of data distribution shift possibly led to model degradation?


Concept drift possibly led to model degradation.

> From your analysis via TabularDrift, which features contribute to this shift?


From my analysis, the features which contributed to this shift are town, full_address, nearest_stn, dist_to_dhoby, eigenvector_centrality, flat_model_type, remaining_lease_years, floor_area_sqm and storey_range.

> Suggest 1 way to address model degradation and implement it, showing improved test R2 for year 2023.


To address the model degradation and improve test R2 for year 2023, the model could be trained on more recent data using entries from the year 2022 and before as training data. This would take the cooling measures introduced by the Singapore government into account.

In [28]:

df2023testlastquestion = df[df['year']==2023]
df2022andbeforetrainlastquestion = df[df['year']<=2022]

data_config = DataConfig(
    target=[
        "resale_price"
    ],  # target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols= ["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"],
    categorical_cols=["month", "town", "flat_model_type", "storey_range"],
)

trainer_config = TrainerConfig(
    auto_lr_find=True,  # Runs the LRFinder to automatically derive a learning rate
    batch_size=1024,
    max_epochs=50,
)

optimizer_config = OptimizerConfig(optimizer="Adam")

model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50",  # Number of nodes in each layer
    activation="ReLU"
)

tabular_model_last_question = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

tabular_model_last_question.fit(train=df2022andbeforetrainlastquestion)
result_last_question = tabular_model_last_question.evaluate(df2023testlastquestion)
pred_df_last_question = tabular_model_last_question.predict(df2023testlastquestion)
tabular_model_last_question.save_model("Neural Networks and Deep Learning B4 last question")
loaded_model_last_question = TabularModel.load_from_checkpoint("Neural Networks and Deep Learning B4 last question")

2023-10-11 01:05:27,356 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off
Global seed set to 42
2023-10-11 01:05:27,386 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-10-11 01:05:27,395 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-10-11 01:05:27,549 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-10-11 01:05:27,587 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-10-11 01:05:27,655 - {pytorch_tabular.tabular_model:573} - INFO - Auto LR Find Started
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

  rank_zero_warn(
  rank_zero_warn(
`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at C:\Users\Gareth Thong\Neural Networks and Deep Learning B4\.lr_find_bb10184a-da0c-46e7-929a-0e211ab5958b.ckpt
Restored all states from the checkpoint file at C:\Users\Gareth Thong\Neural Networks and Deep Learning B4\.lr_find_bb10184a-da0c-46e7-929a-0e211ab5958b.ckpt
2023-10-11 01:05:31,987 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-11 01:05:31,987 - {pytorch_tabular.tabular_model:582} - INFO - Training Started


Output()

2023-10-11 01:08:05,325 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-10-11 01:08:05,326 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model


Output()

  rank_zero_deprecation(
  rank_zero_warn(


Output()

2023-10-11 01:08:07,268 - {pytorch_tabular.tabular_model:129} - INFO - Experiment Tracking is turned off
2023-10-11 01:08:07,279 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [29]:
# Calculate R^2
r2 = r2_score(pred_df_last_question['resale_price'], pred_df_last_question['resale_price_prediction'])
print("Improved test R2 for year 2023 is " + str(r2))

Improved test R2 for year 2023 is 0.5530593913505556


### Appendix A



Here are our results from a linear regression model. We used StandardScaler for continuous variables and OneHotEncoder for categorical variables.

While 2021 data can be predicted well, test R2 dropped rapidly for 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| Year <= 2020 | 2021     | 0.76    |
| Year <= 2020 | **2022**     | 0.41    |
| Year <= 2020 | **2023**     | **0.10**   |



Similarly, a model trained on 2017 data can predict 2018-2021 well (with slight degradation in performance for 2021), but drops drastically in 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2017         | 2018     | 0.90    |
|              | 2019     | 0.89    |
|              | 2020     | 0.87    |
|              | 2021     | 0.72    |
|              | **2022**     | **0.37**    |
|              | **2023**     | **0.09**    |

With the test set fixed at year 2021, training on data from 2017-2020 still works well on the test data, with minimal degradation. Training sets closer to year 2021 generally do better.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2020         | 2021     | 0.81    |
| 2019         | 2021     | 0.75    |
| 2018         | 2021     | 0.73    |
| 2017         | 2021     | 0.72    |