CS4001/4042 Assignment 1, Part B, Q1
---

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [1]:
!pip install pytorch_tabular

Defaulting to user installation because normal site-packages is not writeable


In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

import math
from sklearn.metrics import mean_squared_error, r2_score
import torch

  Referenced from: <00B86D22-833F-3522-B9CF-FCA5ED5567DC> /Users/limxinyi/Library/Python/3.9/lib/python/site-packages/torchvision/image.so
  warn(


> Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

# TODO: Enter your code here
train_df = df[df['year'] <= 2019]
val_df = df[df['year'] == 2020]
test_df = df[df['year'] == 2021]

# Remove the features not used to train the model
train_df = train_df.drop(columns = ['year', 'nearest_stn', 'full_address'])
val_df = val_df.drop(columns = ['year', 'nearest_stn', 'full_address'])
test_df = test_df.drop(columns = ['year', 'nearest_stn', 'full_address'])

> Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [4]:
# TODO: Enter your code here

data_config = DataConfig(
	target=["resale_price"],
	continuous_cols=["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"],
	categorical_cols=["month", "town", "flat_model_type", "storey_range"],
)

trainer_config = TrainerConfig(batch_size=1024, max_epochs=50)

model_config = CategoryEmbeddingModelConfig(
	task="regression",
	layers="50",  # Number of nodes in hidden layer
)

optimizer_config = OptimizerConfig(
    optimizer="Adam"
)

tabular_model = TabularModel(
	data_config=data_config,
	model_config=model_config,
	optimizer_config=optimizer_config,
	trainer_config=trainer_config,
)

2023-10-13 01:03:05,656 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off


In [5]:
# Fit model
tabular_model.fit(train=train_df, validation=val_df)

Global seed set to 42
2023-10-13 01:03:05,848 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-10-13 01:03:05,853 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-10-13 01:03:05,943 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-10-13 01:03:05,989 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-10-13 01:03:06,081 - {pytorch_tabular.tabular_model:582} - INFO - Training Started


Output()

`Trainer.fit` stopped: `max_epochs=50` reached.


2023-10-13 01:07:01,815 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-10-13 01:07:01,815 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
  rank_zero_deprecation(


<pytorch_lightning.trainer.trainer.Trainer at 0x293d1d910>

In [6]:
torch.save(tabular_model, 'b1_model')

In [11]:
result = tabular_model.evaluate(test_df)
pred_df = tabular_model.predict(test_df)

Output()

  rank_zero_warn(


Output()

> Report the test RMSE error and the test R2 value that you obtained.



In [12]:
mse = result[0]['test_mean_squared_error']
rsme = math.sqrt(mse)

r2 = r2_score(y_true = list(pred_df['resale_price']), y_pred = list(pred_df['resale_price_prediction']))

print("Root Mean Squared Error: ", rsme)
print("R2 Score: ", r2)

Root Mean Squared Error:  306891.6452430727
R2 Score:  -2.5605117375142203


\# TODO: \<Enter your answer here\>

Root Mean Squared Error:  306891.6452430727 <br>
R2 Score:  -2.5605117375142203

> Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [13]:
# TODO: Enter your code here
pred_df["error"] = (pred_df["resale_price"] - pred_df["resale_price_prediction"]).abs()
sorted_df = pred_df.sort_values(by=["error"], ascending=False)
sorted_df.head(25)

Unnamed: 0,month,town,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price,resale_price_prediction,error
106199,12,QUEENSTOWN,0.584731,3.882019,0.016807,0.008342,"5 ROOM, Premium Apartment Loft",93.333333,122.0,40 TO 42,1328000.0,173652.109375,1154348.0
90608,12,BISHAN,0.776182,6.297489,0.033613,0.015854,"5 ROOM, DBSS",88.833333,120.0,37 TO 39,1360000.0,244618.953125,1115381.0
93930,12,CENTRAL AREA,0.438348,2.506568,0.033613,0.121082,"5 ROOM, Type S2",88.166667,107.0,46 TO 48,1280000.0,204889.765625,1075110.0
93931,12,CENTRAL AREA,0.352779,2.413099,0.033613,0.121082,"5 ROOM, Type S2",88.083333,107.0,40 TO 42,1288000.0,224400.125,1063600.0
100836,6,KALLANG/WHAMPOA,0.998313,3.304953,0.016807,0.053004,"3 ROOM, Terrace",50.083333,210.0,01 TO 03,1268000.0,213496.90625,1054503.0
90483,9,BISHAN,0.767244,6.327956,0.033613,0.015854,"5 ROOM, DBSS",89.0,120.0,37 TO 39,1295000.0,243080.625,1051919.0
93929,12,CENTRAL AREA,0.352779,2.413099,0.033613,0.121082,"5 ROOM, Type S2",88.083333,106.0,43 TO 45,1254000.0,202370.8125,1051629.0
93904,11,CENTRAL AREA,0.401367,2.445314,0.033613,0.121082,"5 ROOM, Type S2",88.333333,106.0,40 TO 42,1261000.0,229243.78125,1031756.0
90432,8,BISHAN,0.827889,6.370404,0.033613,0.015854,"5 ROOM, DBSS",88.916667,120.0,25 TO 27,1280000.0,249152.75,1030847.0
101087,9,KALLANG/WHAMPOA,0.987682,3.383526,0.016807,0.053004,"3 ROOM, Terrace",49.833333,241.0,01 TO 03,1235000.0,210541.546875,1024458.0


\# TODO: \<Enter your answer here\> <br >

These rows all have a degree_centrality of either 0.016807 or 0.033613. <br />
By printing *sorted_df['degree_centrality'].value_counts()*, <br />
| degree_centrality | counts |
| -------- | ------- |
| 0.016807 | 24499 |
| 0.033613 | 1993 |
| 0.008403 | 1923 |
| 0.025210 | 642 |

 <br>
we can see that most of the data have a degree_centrality of 0.016807 or 0.033613. Hence, in reality this feature might actually not make a difference to the predictions and thus, we can try reducing the weight associated with the degree_centrality feature. <br>
We could also obtain more data with wider variety of degree_centrality values and re-train the model
 <br> <br>
Additionally, most of these rows with highest error have floor_area_sqm of >=100, and remaining_lease_years of >=80, which could mean these these two features are not that indicative of the actual resale price. 

In [17]:
sorted_df['degree_centrality'].value_counts()

degree_centrality
0.016807    24499
0.033613     1993
0.008403     1923
0.025210      642
Name: count, dtype: int64