CS4001/4042 Assignment 1, Part B, Q1
---

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [None]:
!pip install pytorch_tabular

In [None]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

import math
from sklearn.metrics import mean_squared_error, r2_score

> Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [None]:
df = pd.read_csv('hdb_price_prediction.csv')

# TODO: Enter your code here
train_df = df[df['year'] <= 2019]
val_df = df[df['year'] == 2020]
test_df = df[df['year'] == 2021]

# Remove the features not used to train the model
train_df = train_df.drop(columns = ['year', 'nearest_stn', 'full_address'])
val_df = val_df.drop(columns = ['year', 'nearest_stn', 'full_address'])
test_df = test_df.drop(columns = ['year', 'nearest_stn', 'full_address'])

> Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [None]:
# TODO: Enter your code here

data_config = DataConfig(
	target=["resale_price"],
	continuous_cols=["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"],
	categorical_cols=["month", "town", "flat_model_type", "storey_range"],
)

trainer_config = TrainerConfig(batch_size=1024, max_epochs=50)

model_config = CategoryEmbeddingModelConfig(
	task="regression",
	layers="50",  # Number of nodes in hidden layer
)

optimizer_config = OptimizerConfig(
    optimizer="Adam"
)

tabular_model = TabularModel(
	data_config=data_config,
	model_config=model_config,
	optimizer_config=optimizer_config,
	trainer_config=trainer_config,
)

In [None]:
# Fit model
tabular_model.fit(train=train_df, validation=val_df)

In [None]:
result = tabular_model.evaluate(test_df)
pred_df = tabular_model.predict(test_df)
list(pred_df['resale_price'])

In [None]:
mse = result[0]['test_mean_squared_error']
rsme = math.sqrt(mse)

r2 = r2_score(y_true = list(pred_df['resale_price']), y_pred = list(pred_df['resale_price_prediction']))

print("Root Mean Squared Error: ", rsme)
print("R2 Score: ", r2)

> Report the test RMSE error and the test R2 value that you obtained.



\# TODO: \<Enter your answer here\>

> Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [None]:
# TODO: Enter your code here
pred_df["error"] = (pred_df["resale_price"] - pred_df["resale_price_prediction"]).abs()
sorted_df = pred_df.sort_values(by=["error"], ascending=False)
sorted_df.head(25)

\# TODO: \<Enter your answer here\> <br />
These rows all have a degree_centrality of 0.016807. <br />
We can reduce the weight of the neuron in charge of this degree_centrality feature.