CS4001/4042 Assignment 1, Part B, Q1
---

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [52]:
# !pip install pytorch_tabular[extra]


In [53]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)


> Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [54]:
num_features = [
    "dist_to_nearest_stn",
    "dist_to_dhoby",
    "degree_centrality",
    "eigenvector_centrality",
    "remaining_lease_years",
    "floor_area_sqm",
]

cat_features = [
    "month",
    "town",
    "flat_model_type",
    "storey_range",
]

features = num_features + cat_features

targets = ["resale_price"]

df = pd.read_csv("hdb_price_prediction.csv")

df_train = df[df["year"] <= 2019]
df_val = df[df["year"] == 2020]
df_test = df[df["year"] == 2021]

train = df_train[features + targets]
val = df_val[features + targets]
test = df_test[features + targets]


> Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [55]:
data_config = DataConfig(
    target=targets,
    continuous_cols=num_features,
    categorical_cols=cat_features,
)

trainer_config = TrainerConfig(
    auto_lr_find=True,
    batch_size=1024,
    max_epochs=50,
)

model_config = CategoryEmbeddingModelConfig(task="regression", layers="50")

optimizer_config = OptimizerConfig(optimizer="Adam")

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)


2023-10-12 15:39:45,569 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off


> Report the test RMSE error and the test R2 value that you obtained.



In [56]:
tabular_model.fit(
    train=train,
    validation=val,
    seed=SEED,
)

pred = tabular_model.predict(test)

y_test = np.array(pred["resale_price"])
y_pred = np.array(pred["resale_price_prediction"])

ss_res = np.sum((y_test - y_pred) ** 2)
ss_tot = np.sum((y_test - np.mean(y_test)) ** 2)

rmse = np.sqrt(ss_res / len(y_test))
r2 = 1 - (ss_res / ss_tot)

print(f"Test RMSE: {rmse}")
print(f"Test R2: {r2}")


Global seed set to 42
2023-10-12 15:39:45,635 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-10-12 15:39:45,638 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-10-12 15:39:45,707 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-10-12 15:39:45,733 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-10-12 15:39:45,781 - {pytorch_tabular.tabular_model:573} - INFO - Auto LR Find Started
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

  rank_zero_warn(
  rank_zero_warn(
`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at e:\Dev\NTU-SC4001\PartB\.lr_find_cf77f3e4-0e3c-4b48-a4bd-3b82b6bca6ca.ckpt
Restored all states from the checkpoint file at e:\Dev\NTU-SC4001\PartB\.lr_find_cf77f3e4-0e3c-4b48-a4bd-3b82b6bca6ca.ckpt
2023-10-12 15:39:48,943 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-12 15:39:48,945 - {pytorch_tabular.tabular_model:582} - INFO - Training Started


Output()

2023-10-12 15:40:09,794 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-10-12 15:40:09,794 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model


Output()

  rank_zero_deprecation(


Test RMSE: 76696.92723313595
Test R2: 0.7776187451396016


> Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [62]:
pred["error"] = abs(pred["resale_price"] - pred["resale_price_prediction"])
pred_sorted = pred.sort_values(by="error", ascending=False)
pred_sorted_25 = pred_sorted.head(25)
pred_sorted_25


Unnamed: 0,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,remaining_lease_years,floor_area_sqm,month,town,flat_model_type,storey_range,resale_price,resale_price_prediction,error
92405,0.581977,2.309477,0.016807,0.047782,50.166667,88.0,11,BUKIT MERAH,"3 ROOM, Standard",01 TO 03,780000.0,364686.5,415313.5
90957,1.29254,10.763777,0.016807,0.000217,75.583333,144.0,6,BUKIT BATOK,"EXECUTIVE, Apartment",10 TO 12,968000.0,614498.2,353501.75
112128,0.370873,12.479752,0.033613,0.000229,61.75,148.0,12,TAMPINES,"EXECUTIVE, Maisonette",01 TO 03,998000.0,655593.6,342406.4375
90608,0.776182,6.297489,0.033613,0.015854,88.833333,120.0,12,BISHAN,"5 ROOM, DBSS",37 TO 39,1360000.0,1020589.0,339411.4375
106192,0.658035,3.807573,0.016807,0.008342,93.333333,109.0,12,QUEENSTOWN,"4 ROOM, Premium Apartment Loft",04 TO 06,968000.0,635938.8,332061.1875
91871,0.693391,2.058774,0.016807,0.047782,50.583333,88.0,6,BUKIT MERAH,"3 ROOM, Standard",01 TO 03,680888.0,358356.8,322531.25
93825,0.451637,2.594828,0.016807,0.103876,54.583333,118.0,8,CENTRAL AREA,"5 ROOM, Adjoined flat",16 TO 18,938000.0,617954.1,320045.875
92504,0.468378,2.365532,0.016807,0.047782,50.166667,88.0,12,BUKIT MERAH,"3 ROOM, Standard",01 TO 03,695000.0,376002.2,318997.75
105695,0.745596,3.720593,0.016807,0.008342,93.916667,97.0,6,QUEENSTOWN,"4 ROOM, Premium Apartment Loft",07 TO 09,930000.0,612467.8,317532.25
90432,0.827889,6.370404,0.033613,0.015854,88.916667,120.0,8,BISHAN,"5 ROOM, DBSS",25 TO 27,1280000.0,962978.1,317021.875


In [63]:
for cat in cat_features:
    print(f"{pred_sorted_25[cat].value_counts()}\n")


month
12    6
6     4
8     4
10    4
11    3
9     2
4     1
3     1
Name: count, dtype: int64

town
BUKIT MERAH     10
QUEENSTOWN       6
BISHAN           3
CENTRAL AREA     2
BUKIT BATOK      1
TAMPINES         1
ANG MO KIO       1
HOUGANG          1
Name: count, dtype: int64

flat_model_type
3 ROOM, Standard                  6
4 ROOM, Premium Apartment Loft    6
5 ROOM, Improved                  6
EXECUTIVE, Apartment              2
5 ROOM, DBSS                      2
5 ROOM, Adjoined flat             2
EXECUTIVE, Maisonette             1
Name: count, dtype: int64

storey_range
01 TO 03    7
07 TO 09    3
28 TO 30    3
10 TO 12    2
04 TO 06    2
16 TO 18    2
25 TO 27    2
37 TO 39    1
31 TO 33    1
13 TO 15    1
34 TO 36    1
Name: count, dtype: int64



Trends:
1. Resale prices are less predictable in December
2. In some towns such as BUKIT MERAH and QUEENSTOWN, resale prices are less predictable
3. Resales prices of flats with less rooms are less predictable
4. Resales prices of flats located on lower floor levels are less predictable

To reduce the errors, we can:
1. Train the model using more data
2. Perform hyperparameter selections
3. Implement early stopping, weight regularization and/or dropouts