# Question B1 (15 marks)

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [1]:
!pip install pytorch_tabular[extra]

Collecting pytorch_tabular[extra]
  Downloading pytorch_tabular-1.1.0-py2.py3-none-any.whl.metadata (21 kB)
Collecting scikit-learn>=1.3.0 (from pytorch_tabular[extra])
  Downloading scikit_learn-1.5.2-cp311-cp311-win_amd64.whl.metadata (13 kB)
Collecting pytorch-lightning<2.2.0,>=2.0.0 (from pytorch_tabular[extra])
  Downloading pytorch_lightning-2.1.4-py3-none-any.whl.metadata (21 kB)
Collecting omegaconf>=2.3.0 (from pytorch_tabular[extra])
  Downloading omegaconf-2.3.0-py3-none-any.whl.metadata (3.9 kB)
Collecting torchmetrics<1.3.0,>=0.10.0 (from pytorch_tabular[extra])
  Downloading torchmetrics-1.2.1-py3-none-any.whl.metadata (20 kB)
Collecting tensorboard!=2.5.0,>2.2.0 (from pytorch_tabular[extra])
  Downloading tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Collecting pytorch-tabnet==4.1 (from pytorch_tabular[extra])
  Downloading pytorch_tabnet-4.1.0-py3-none-any.whl.metadata (15 kB)
Collecting einops<0.8.0,>=0.6.0 (from pytorch_tabular[extra])
  Downloading einops-0.7

  You can safely remove it manually.


In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)
from sklearn.metrics import mean_squared_error, r2_score

1.Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

# Step 2: Train-Validation-Test Split
# Using entries from year 2019 and before as training data, year 2020 as validation data, and year 2021 as test data
df = df[df['year'].isin([2019, 2020, 2021])]
train_data = df[df['year'] <= 2019]
val_data = df[df['year'] == 2020]
test_data = df[df['year'] == 2021]

# Drop unused columns
train_data = train_data.drop(columns=['full_address', 'nearest_stn', 'year'])
val_data = val_data.drop(columns=['full_address', 'nearest_stn', 'year'])
test_data = test_data.drop(columns=['full_address', 'nearest_stn', 'year'])

# YOUR CODE HERE

2.Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [4]:
# YOUR CODE HERE

# Step 3: Configuring Data
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
categorical_cols = ['month', 'town', 'flat_model_type', 'storey_range']

data_config = DataConfig(
    target=['resale_price'],
    continuous_cols=continuous_cols,
    categorical_cols=categorical_cols,
)

# Step 4: Trainer Configuration
trainer_config = TrainerConfig(
    auto_lr_find=True,
    batch_size=1024,
    max_epochs=50,
)

# Step 5: Model Configuration
model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50",
    activation="ReLU",
    dropout=0.1,
)

# Step 6: Optimizer Configuration
optimizer_config = OptimizerConfig(
    optimizer="Adam"
)

# Step 7: Initialize the Model
model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

# Step 8: Train the Model
model.fit(train=train_data, validation=val_data)

# Step 9: Evaluate the Model on Test Data
test_result = model.evaluate(test=test_data)

Seed set to 42


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Missing logger folder: C:\Users\Vaishob\Downloads\lightning_logs
C:\Users\Vaishob\anaconda3\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
C:\Users\Vaishob\anaconda3\Lib\site-packages\pytorch_lightning\loops\fit_loop.py:293: The number of training batches (22) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
C:\Users\Vaishob\anaconda3\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at C:\Users\Vaishob\Downloads\.lr_find_ba145bdf-4b84-49b9-9614-dc1d537366b6.ckpt
C:\Users\Vaishob\anaconda3\Lib\site-packages\lightning_fabric\utilities\cloud_io.py:56: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` 

Output()

Output()

  return torch.load(f, map_location=map_location)
C:\Users\Vaishob\anaconda3\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:441: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


In [11]:
model.save_model("deep_learning_model_b1.h5")

3.Report the test RMSE error and the test R2 value that you obtained.



In [5]:
# YOUR CODE & RESULT HERE

# Calculate RMSE and R2 on Test Set
y_true = test_data['resale_price'].values
y_pred = model.predict(test_data.drop(columns=['resale_price']))

rmse = mean_squared_error(y_true, y_pred, squared=False)
r2 = r2_score(y_true, y_pred)

print(f'Test RMSE: {rmse}')
print(f'Test R2: {r2}')

Test RMSE: 82664.68508837765
Test R2: 0.7416655778912187




4.Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [6]:
# YOUR CODE & RESULT HERE

# Step 10: Print Top 25 Test Samples with Largest Errors
test_data['predicted_resale_price'] = y_pred
test_data['error'] = abs(test_data['resale_price'] - test_data['predicted_resale_price'])
sorted_test_data = test_data.sort_values(by='error', ascending=False)
print(sorted_test_data.head(25))

        month             town  dist_to_nearest_stn  dist_to_dhoby  \
105372      2       QUEENSTOWN             0.570988       4.922054   
105869      8       QUEENSTOWN             0.554599       4.841933   
92405      11      BUKIT MERAH             0.581977       2.309477   
106192     12       QUEENSTOWN             0.658035       3.807573   
105702      6       QUEENSTOWN             0.245207       4.709043   
92442      11      BUKIT MERAH             0.686789       2.664024   
106057     10       QUEENSTOWN             0.584731       3.882019   
100836      6  KALLANG/WHAMPOA             0.998313       3.304953   
105695      6       QUEENSTOWN             0.745596       3.720593   
92504      12      BUKIT MERAH             0.468378       2.365532   
90957       6      BUKIT BATOK             1.292540      10.763777   
105696      6       QUEENSTOWN             0.658035       3.807573   
114389     10        WOODLANDS             0.419275      16.945885   
92340      10      B

Observing the top 25 samples with the largest errors shows that the model struggles particularly with certain towns like Queenstown and Bukit Merah, as well as flats with larger floor areas or those on lower floors.

Possible trends include:
- Flats in popular towns (e.g., Queenstown, Bukit Merah) with high demand may have more volatile prices that are difficult to predict.
- Flats on lower floors tend to have larger errors, possibly because the price variation across floors is not well captured.
- Large floor area or executive flats also have larger errors, indicating the model might struggle to generalize well for larger properties.
 
Suggestion: To reduce errors, I would consider adding more relevant features such as the age of the building, proximity to amenities like schools or parks, or incorporating external factors like economic indicators. Additionally, I would consider increasing model complexity, tuning hyperparameters further, or using more advanced models like Gradient Boosting or XGBoost to help capture non-linear relationships better.