CS4001/4042 Assignment 1, Part B, Q1
---

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [None]:
!pip install pytorch_tabular[extra]


Collecting pytorch_tabular[extra]
  Downloading pytorch_tabular-1.0.2-py2.py3-none-any.whl (122 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/122.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.5/122.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting category-encoders<2.7.0,>=2.6.0 (from pytorch_tabular[extra])
  Downloading category_encoders-2.6.2-py2.py3-none-any.whl (81 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.8/81.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting pytorch-lightning<2.0.0,>=1.8.0 (from pytorch_tabular[extra])
  Downloading pytorch_lightning-1.9.5-py3-none-any.whl (829 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m829.5/829.5 kB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting omegaconf>=2.1.

In [None]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

  warn(


> Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
df = pd.read_csv('/content/drive/MyDrive/programming_assignment/hdb_price_prediction.csv')
df.drop(columns=['full_address','nearest_stn'],inplace=True)

# TODO: Enter your code here
train=df[df['year']<=2019].drop(columns=['year'])
val=df[df['year']==2020].drop(columns=['year'])
test=df[df['year']==2021].drop(columns=['year'])

> Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [None]:
# TODO: Enter your code here
num_col_names=['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality',
'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
cat_col_names=['month', 'town', 'flat_model_type', 'storey_range']

data_config = DataConfig(
    target=[
        "resale_price"
    ],  # target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
    auto_lr_find=True,  # Runs the LRFinder to automatically derive a learning rate
    batch_size=1024,
    max_epochs=50,
)
optimizer_config = OptimizerConfig('Adam')

model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50",  # Number of nodes in each layer
    activation="LeakyReLU",  # Activation between each layers
    learning_rate=1e-3,
)

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)
tabular_model.fit(train=train, validation=val)
result = tabular_model.evaluate(test)
pred_df = tabular_model.predict(test)
# tabular_model.save_model("examples/basic")
# loaded_model = TabularModel.load_from_checkpoint("examples/basic")
print(result)
pred_df

2023-10-08 08:52:36,952 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off
INFO:pytorch_tabular.tabular_model:Experiment Tracking is turned off
INFO:lightning_fabric.utilities.seed:Global seed set to 42
2023-10-08 08:52:36,977 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
INFO:pytorch_tabular.tabular_model:Preparing the DataLoaders
2023-10-08 08:52:36,982 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
INFO:pytorch_tabular.tabular_datamodule:Setting up the datamodule for regression task
2023-10-08 08:52:37,108 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
INFO:pytorch_tabular.tabular_model:Preparing the Model: CategoryEmbeddingModel
2023-10-08 08:52:37,137 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
INFO:pytorch_tabular.tabular_model:Preparing the Trainer
  rank_zero_deprecation(
INFO:pytorch_lightning.utilities

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

  rank_zero_warn(
  rank_zero_warn(
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=100` reached.
INFO:pytorch_lightning.tuner.lr_finder:Learning rate set to 0.5754399373371567
INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/.lr_find_32bd4243-8fc5-4299-b388-10f016f5a897.ckpt
INFO:pytorch_lightning.utilities.rank_zero:Restored all states from the checkpoint file at /content/.lr_find_32bd4243-8fc5-4299-b388-10f016f5a897.ckpt
2023-10-08 08:52:40,701 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
INFO:pytorch_tabular.tabular_model:Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-08 08:52:40,705 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
INFO:pytorch_tabular.tabular_model:Training Started


Output()

2023-10-08 08:53:30,037 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
INFO:pytorch_tabular.tabular_model:Training the model completed
2023-10-08 08:53:30,041 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
INFO:pytorch_tabular.tabular_model:Loading the best model


Output()

  rank_zero_deprecation(


Output()

[{'test_loss': 5050981376.0, 'test_mean_squared_error': 5050981376.0}]


Unnamed: 0,month,town,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price,resale_price_prediction
87370,1,ANG MO KIO,1.276775,8.339960,0.016807,0.002459,"2 ROOM, Improved",64.083333,45.0,01 TO 03,211000.0,195177.703125
87371,1,ANG MO KIO,1.276775,8.339960,0.016807,0.002459,"2 ROOM, Improved",64.083333,45.0,07 TO 09,225000.0,219794.296875
87372,1,ANG MO KIO,0.884872,6.981730,0.016807,0.006243,"3 ROOM, New Generation",59.000000,68.0,04 TO 06,260000.0,287758.750000
87373,1,ANG MO KIO,0.677246,8.333056,0.016807,0.006243,"3 ROOM, New Generation",58.166667,68.0,04 TO 06,265000.0,282527.000000
87374,1,ANG MO KIO,0.922047,8.009223,0.016807,0.006243,"3 ROOM, New Generation",58.083333,68.0,01 TO 03,265000.0,262030.343750
...,...,...,...,...,...,...,...,...,...,...,...,...
116422,12,YISHUN,0.954699,13.018048,0.016807,0.000968,"5 ROOM, Improved",95.083333,112.0,13 TO 15,720000.0,573646.437500
116423,12,YISHUN,0.475885,12.738721,0.016807,0.000968,"EXECUTIVE, Apartment",65.083333,142.0,01 TO 03,738000.0,565859.937500
116424,12,YISHUN,0.408137,12.745325,0.016807,0.000968,"EXECUTIVE, Maisonette",65.000000,146.0,04 TO 06,755000.0,638194.250000
116425,12,YISHUN,0.733238,14.183095,0.016807,0.000382,"5 ROOM, DBSS",90.916667,112.0,10 TO 12,848000.0,646337.187500


> Report the test RMSE error and the test R2 value that you obtained.



\# TODO: \<Enter your answer here\>

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

rmse=np.sqrt(mean_squared_error(pred_df['resale_price'],pred_df['resale_price_prediction']))
r2=r2_score(pred_df['resale_price'],pred_df['resale_price_prediction'])

print(f'RMSE: {rmse}')
print(f'R^2: {r2}')

RMSE: 71070.25607074641
R^2: 0.8090507278177188


> Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [None]:
# TODO: Enter your code here
pred_df['absolute_diff']=abs(pred_df['resale_price']-pred_df['resale_price_prediction'])
top25_largesterrors=pred_df.sort_values(by='absolute_diff',ascending=False).head(25)
top25_largesterrors

Unnamed: 0,month,town,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price,resale_price_prediction,absolute_diff
92405,11,BUKIT MERAH,0.581977,2.309477,0.016807,0.047782,"3 ROOM, Standard",50.166667,88.0,01 TO 03,780000.0,406482.5,373517.5
90957,6,BUKIT BATOK,1.29254,10.763777,0.016807,0.000217,"EXECUTIVE, Apartment",75.583333,144.0,10 TO 12,968000.0,631970.8,336029.1875
112128,12,TAMPINES,0.370873,12.479752,0.033613,0.000229,"EXECUTIVE, Maisonette",61.75,148.0,01 TO 03,998000.0,668921.5,329078.5
90608,12,BISHAN,0.776182,6.297489,0.033613,0.015854,"5 ROOM, DBSS",88.833333,120.0,37 TO 39,1360000.0,1045015.0,314985.4375
90521,10,BISHAN,0.947205,6.663943,0.033613,0.015854,"5 ROOM, Improved",69.583333,121.0,07 TO 09,988000.0,689537.4,298462.5625
114254,9,WOODLANDS,1.915461,16.660245,0.016807,2.4e-05,"EXECUTIVE, Maisonette",75.083333,141.0,10 TO 12,800000.0,501543.4,298456.59375
92442,11,BUKIT MERAH,0.686789,2.664024,0.016807,0.047782,"5 ROOM, Improved",90.333333,113.0,16 TO 18,1165000.0,867848.7,297151.3125
98379,12,HOUGANG,0.899849,8.828235,0.016807,0.001507,"EXECUTIVE, Apartment",63.666667,142.0,04 TO 06,873000.0,585980.3,287019.6875
92340,10,BUKIT MERAH,0.451387,2.128424,0.016807,0.047782,"5 ROOM, Improved",90.75,114.0,34 TO 36,1245000.0,961102.9,283897.0625
91871,6,BUKIT MERAH,0.693391,2.058774,0.016807,0.047782,"3 ROOM, Standard",50.583333,88.0,01 TO 03,680888.0,401848.9,279039.0625


In [None]:
top25=pred_df.sort_values(by='absolute_diff',ascending=True).head(25)
top25

Unnamed: 0,month,town,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price,resale_price_prediction,absolute_diff
96233,2,GEYLANG,0.439622,5.707583,0.033613,0.011178,"2 ROOM, Standard",51.5,46.0,04 TO 06,210000.0,210009.765625,9.765625
88172,10,ANG MO KIO,0.701473,8.162365,0.016807,0.006243,"3 ROOM, New Generation",57.916667,85.0,01 TO 03,300000.0,300010.46875,10.46875
90305,6,BISHAN,0.905374,6.363997,0.016807,0.013555,"4 ROOM, Model A",70.166667,103.0,01 TO 03,540000.0,540015.1875,15.1875
100475,1,KALLANG/WHAMPOA,0.213585,2.726393,0.016807,0.053004,"4 ROOM, Improved",52.083333,88.0,07 TO 09,450000.0,450016.5,16.5
103182,4,PUNGGOL,0.57786,12.730192,0.016807,9e-05,"4 ROOM, Model A",92.083333,92.0,01 TO 03,415000.0,415024.9375,24.9375
88881,4,BEDOK,0.149219,9.119765,0.016807,0.000698,"4 ROOM, New Generation",58.583333,91.0,01 TO 03,390000.0,389972.625,27.375
91842,6,BUKIT MERAH,0.513485,4.680263,0.016807,0.018783,"3 ROOM, New Generation",54.333333,73.0,10 TO 12,368000.0,367971.84375,28.15625
112668,8,TOA PAYOH,0.922948,5.333597,0.016807,0.004897,"3 ROOM, Standard",45.916667,57.0,07 TO 09,246000.0,245967.734375,32.265625
97175,3,HOUGANG,1.104764,9.434844,0.016807,0.001507,"4 ROOM, Model A",76.083333,101.0,04 TO 06,395000.0,394963.0625,36.9375
92030,8,BUKIT MERAH,0.506883,2.297155,0.016807,0.047782,"3 ROOM, Improved",50.5,63.0,10 TO 12,345888.0,345925.53125,37.53125


\# TODO: \<Enter your answer here\>

The poor predictions mostly occur in the later months of the year (8-12). To reduce these errors, more training data can be obtained and data can be resampled over minority class.