CS4001/4042 Assignment 1, Part B, Q1
---

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).




---



In [2]:
!pip install pytorch_tabular[extra]



In [3]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

  warn(


> Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [4]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/hdb_price_prediction.csv')
# TODO: Enter your code here
# Filter data based on year
train_data = df[df['year'] <= 2019]
validation_data = df[df['year'] == 2020]
test_data = df[df['year'] == 2021]

train_data = train_data.drop(columns=['year'])
validation_data = validation_data.drop(columns=['year'])
test_data = test_data.drop(columns =['year'])

# print(train_data['year'].unique())
# print(validation_data['year'].unique())
# print(test_data['year'].unique())
test_data

Unnamed: 0,month,town,full_address,nearest_stn,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price
87370,1,ANG MO KIO,170 ANG MO KIO AVENUE 4,Yio Chu Kang,1.276775,8.339960,0.016807,0.002459,"2 ROOM, Improved",64.083333,45.0,01 TO 03,211000.0
87371,1,ANG MO KIO,170 ANG MO KIO AVENUE 4,Yio Chu Kang,1.276775,8.339960,0.016807,0.002459,"2 ROOM, Improved",64.083333,45.0,07 TO 09,225000.0
87372,1,ANG MO KIO,331 ANG MO KIO AVENUE 1,Ang Mo Kio,0.884872,6.981730,0.016807,0.006243,"3 ROOM, New Generation",59.000000,68.0,04 TO 06,260000.0
87373,1,ANG MO KIO,534 ANG MO KIO AVENUE 10,Ang Mo Kio,0.677246,8.333056,0.016807,0.006243,"3 ROOM, New Generation",58.166667,68.0,04 TO 06,265000.0
87374,1,ANG MO KIO,561 ANG MO KIO AVENUE 10,Ang Mo Kio,0.922047,8.009223,0.016807,0.006243,"3 ROOM, New Generation",58.083333,68.0,01 TO 03,265000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
116422,12,YISHUN,502B YISHUN STREET 51,Khatib,0.954699,13.018048,0.016807,0.000968,"5 ROOM, Improved",95.083333,112.0,13 TO 15,720000.0
116423,12,YISHUN,877 YISHUN STREET 81,Khatib,0.475885,12.738721,0.016807,0.000968,"EXECUTIVE, Apartment",65.083333,142.0,01 TO 03,738000.0
116424,12,YISHUN,824 YISHUN STREET 81,Khatib,0.408137,12.745325,0.016807,0.000968,"EXECUTIVE, Maisonette",65.000000,146.0,04 TO 06,755000.0
116425,12,YISHUN,348A YISHUN AVENUE 11,Yishun,0.733238,14.183095,0.016807,0.000382,"5 ROOM, DBSS",90.916667,112.0,10 TO 12,848000.0


> Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [5]:
# TODO: Enter your code here
data_config = DataConfig(
    target=["resale_price"],
    continuous_cols=["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"],
    categorical_cols=["month", "town", "flat_model_type", "storey_range"],
)

trainer_config = TrainerConfig(
    batch_size=1024,
    max_epochs=50,
    auto_lr_find=True
)

model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50",  # 1 hidden layer with 50 neurons
)

optimizer_config = OptimizerConfig(optimizer="Adam")

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config
)

tabular_model.fit(train=train_data, validation=validation_data)
result = tabular_model.evaluate(test_data)[0]
pred_df = tabular_model.predict(test_data)

test_loss = result['test_loss']
test_mse = result['test_mean_squared_error']

test_rmse = np.sqrt(test_mse)
print(f"Test RMSE: {test_rmse}")

test_target_values = test_data["resale_price"]
pred_values = pred_df["resale_price_prediction"]
test_r2 = 1 - (test_mse / np.var(test_target_values))
print(f"Test R^2: {test_r2}")

2023-10-14 10:09:55,232 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off
INFO:pytorch_tabular.tabular_model:Experiment Tracking is turned off
INFO:lightning_fabric.utilities.seed:Global seed set to 42
2023-10-14 10:09:55,263 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
INFO:pytorch_tabular.tabular_model:Preparing the DataLoaders
2023-10-14 10:09:55,271 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
INFO:pytorch_tabular.tabular_datamodule:Setting up the datamodule for regression task
2023-10-14 10:09:55,501 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
INFO:pytorch_tabular.tabular_model:Preparing the Model: CategoryEmbeddingModel
2023-10-14 10:09:55,545 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
INFO:pytorch_tabular.tabular_model:Preparing the Trainer
  rank_zero_deprecation(
INFO:pytorch_lightning.utilities

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

  rank_zero_warn(
  rank_zero_warn(
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=100` reached.
INFO:pytorch_lightning.tuner.lr_finder:Learning rate set to 0.5754399373371567
INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/.lr_find_1b2bd3f5-75cf-4a71-ab1a-26e88cf0dc2c.ckpt
INFO:pytorch_lightning.utilities.rank_zero:Restored all states from the checkpoint file at /content/.lr_find_1b2bd3f5-75cf-4a71-ab1a-26e88cf0dc2c.ckpt
2023-10-14 10:10:03,565 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
INFO:pytorch_tabular.tabular_model:Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-14 10:10:03,587 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
INFO:pytorch_tabular.tabular_model:Training Started


Output()

2023-10-14 10:10:36,967 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
INFO:pytorch_tabular.tabular_model:Training the model completed
2023-10-14 10:10:36,971 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
INFO:pytorch_tabular.tabular_model:Loading the best model


Output()

Output()

Test RMSE: 76696.95086507677
Test R^2: 0.7776186080988713


> Report the test RMSE error and the test R2 value that you obtained.



\# TODO: \<Enter your answer here\>  
Test RMSE: 76696.91081132277  
Test R^2: 0.7776188403690498

> Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [6]:
# TODO: Enter your code here
# Create a deep copy of the test_data DataFrame
test_data = test_data.copy()

# Add columns for predicted_price and error
test_data['predicted_price'] = pred_values
test_data['error'] = test_target_values - pred_values
test_data['abs_error'] = abs(test_target_values - pred_values)
# Find the top 25 test samples with the largest errors
top_25_errors = test_data.nlargest(25, 'abs_error')
top_25_errors

Unnamed: 0,month,town,full_address,nearest_stn,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price,predicted_price,error,abs_error
92405,11,BUKIT MERAH,46 SENG POH ROAD,Tiong Bahru,0.581977,2.309477,0.016807,0.047782,"3 ROOM, Standard",50.166667,88.0,01 TO 03,780000.0,364686.6,415313.4375,415313.4375
90957,6,BUKIT BATOK,288A BUKIT BATOK STREET 25,Bukit Batok,1.29254,10.763777,0.016807,0.000217,"EXECUTIVE, Apartment",75.583333,144.0,10 TO 12,968000.0,614498.3,353501.6875,353501.6875
112128,12,TAMPINES,156 TAMPINES STREET 12,Tampines,0.370873,12.479752,0.033613,0.000229,"EXECUTIVE, Maisonette",61.75,148.0,01 TO 03,998000.0,655593.7,342406.3125,342406.3125
90608,12,BISHAN,273B BISHAN STREET 24,Bishan,0.776182,6.297489,0.033613,0.015854,"5 ROOM, DBSS",88.833333,120.0,37 TO 39,1360000.0,1020588.0,339411.5,339411.5
106192,12,QUEENSTOWN,89 DAWSON ROAD,Queenstown,0.658035,3.807573,0.016807,0.008342,"4 ROOM, Premium Apartment Loft",93.333333,109.0,04 TO 06,968000.0,635938.8,332061.1875,332061.1875
91871,6,BUKIT MERAH,17 TIONG BAHRU ROAD,Tiong Bahru,0.693391,2.058774,0.016807,0.047782,"3 ROOM, Standard",50.583333,88.0,01 TO 03,680888.0,358356.8,322531.25,322531.25
93825,8,CENTRAL AREA,4 TANJONG PAGAR PLAZA,Tanjong Pagar,0.451637,2.594828,0.016807,0.103876,"5 ROOM, Adjoined flat",54.583333,118.0,16 TO 18,938000.0,617954.2,320045.75,320045.75
92504,12,BUKIT MERAH,49 KIM PONG ROAD,Tiong Bahru,0.468378,2.365532,0.016807,0.047782,"3 ROOM, Standard",50.166667,88.0,01 TO 03,695000.0,376002.3,318997.71875,318997.71875
105695,6,QUEENSTOWN,91 DAWSON ROAD,Queenstown,0.745596,3.720593,0.016807,0.008342,"4 ROOM, Premium Apartment Loft",93.916667,97.0,07 TO 09,930000.0,612467.8,317532.25,317532.25
90432,8,BISHAN,275A BISHAN STREET 24,Bishan,0.827889,6.370404,0.033613,0.015854,"5 ROOM, DBSS",88.916667,120.0,25 TO 27,1280000.0,962978.0,317022.0,317022.0


\# TODO: \  
The error is being calculated as (actual - predicted) and all the errors are positive. These means that the model is underpredicting the prices. Most of the flat model types are actually of a considerably large size, belong to the categories of 3 bedroom and more. This suggests that the distribution of data could be causing the problem here. This can be fixed by data normalization.