# CS4001/4042 Assignment 1, Part B, Q2
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep

Defaulting to user installation because normal site-packages is not writeable
Collecting torch (from pytorch-widedeep)
  Using cached torch-2.0.1-cp39-none-macosx_11_0_arm64.whl (55.8 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.13.1
    Uninstalling torch-1.13.1:
      Successfully uninstalled torch-1.13.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytorch-tabnet 4.0 requires torch<2.0,>=1.2, but you have torch 2.0.1 which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.0.1


In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

import torch
from pytorch_widedeep.models import Wide
import math
from sklearn.metrics import mean_squared_error, r2_score


>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

# TODO: Enter your code here
train_df = df[df['year'] <= 2020]
test_df = df[df['year'] >= 2021]

In [4]:
train = train_df.to_numpy()
test = test_df.to_numpy()

>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [5]:
# TODO: Enter your code here

continuous_cols = ["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"]
cat_embed_cols = ["month", "town", "flat_model_type", "storey_range"]
target = train_df["resale_price"].values

tab_preprocessor = TabPreprocessor(
    cat_embed_cols = cat_embed_cols, 
		continuous_cols = continuous_cols
)
X_tab = tab_preprocessor.fit_transform(train_df)


tabmlp = TabMlp(mlp_hidden_dims=[200,100], column_idx=tab_preprocessor.column_idx, cat_embed_input= tab_preprocessor.cat_embed_input, cat_embed_activation=None, continuous_cols=continuous_cols)

model = WideDeep(deeptabular=tabmlp)

trainer = Trainer(model, objective="root_mean_squared_error", num_workers=0)
trainer.fit(X_tab=X_tab, target=target, n_epochs=100, batch_size=64)

epoch 1: 100%|██████████| 1366/1366 [00:20<00:00, 65.14it/s, loss=2.33e+5]
epoch 2: 100%|██████████| 1366/1366 [00:18<00:00, 73.50it/s, loss=9.83e+4] 
epoch 3: 100%|██████████| 1366/1366 [00:09<00:00, 146.75it/s, loss=8.6e+4] 
epoch 4: 100%|██████████| 1366/1366 [00:08<00:00, 163.83it/s, loss=7.97e+4]
epoch 5: 100%|██████████| 1366/1366 [00:09<00:00, 151.73it/s, loss=7.61e+4]
epoch 6: 100%|██████████| 1366/1366 [00:07<00:00, 181.96it/s, loss=7.36e+4]
epoch 7: 100%|██████████| 1366/1366 [00:07<00:00, 184.34it/s, loss=7.22e+4]
epoch 8: 100%|██████████| 1366/1366 [00:07<00:00, 186.53it/s, loss=7.09e+4]
epoch 9: 100%|██████████| 1366/1366 [00:07<00:00, 176.97it/s, loss=6.98e+4]
epoch 10: 100%|██████████| 1366/1366 [00:08<00:00, 167.55it/s, loss=6.89e+4]
epoch 11: 100%|██████████| 1366/1366 [00:09<00:00, 147.68it/s, loss=6.83e+4]
epoch 12: 100%|██████████| 1366/1366 [00:09<00:00, 150.59it/s, loss=6.76e+4]
epoch 13: 100%|██████████| 1366/1366 [00:10<00:00, 133.31it/s, loss=6.69e+4]
epoch 14:

>Report the test RMSE and the test R2 value that you obtained.

In [6]:
# TODO: Enter your code here

X_tab_test = tab_preprocessor.transform(test_df)
preds = trainer.predict(X_tab=X_tab_test)

predict: 100%|██████████| 1128/1128 [00:02<00:00, 434.16it/s]


In [7]:
mse = mean_squared_error(y_true = list(test_df['resale_price']), y_pred = preds)
rsme = math.sqrt(mse)
r2 = r2_score(y_true = list(test_df['resale_price']), y_pred = preds)

print("Root Mean Squared Error: ", rsme)
print("R2 Score: ", r2)

Root Mean Squared Error:  96488.27015764269
R2 Score:  0.6747273327009404
