# CS4001/4042 Assignment 1, Part B, Q2
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [None]:
!pip install pytorch-widedeep

In [5]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [6]:
df = pd.read_csv('hdb_price_prediction.csv')

df2021andaftertest = df[df['year']>=2021]
df2020andbeforetrain = df[df['year']<=2020]

>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [7]:
# For questions B1 and B2, the following features should be used:    
# - Numeric / Continuous features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm 
# - Categorical features: month, town, flat_model_type, storey_range

target = df2020andbeforetrain["resale_price"].values
cat_embed_cols = ["month", "town", "flat_model_type", "storey_range",]
continuous_cols = ["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm" ]
tab_preprocessor = TabPreprocessor(cat_embed_cols=cat_embed_cols, continuous_cols=continuous_cols)
X_tab = tab_preprocessor.fit_transform(df2020andbeforetrain)

tab_mlp = TabMlp(
    mlp_hidden_dims=[200, 100],
    column_idx=tab_preprocessor.column_idx,
    cat_embed_input=tab_preprocessor.cat_embed_input,
    continuous_cols=continuous_cols,
)

model = WideDeep(deeptabular=tab_mlp)

trainer = Trainer(model, objective="root_mean_squared_error", num_workers=0)
trainer.fit(
    X_tab=X_tab,
    target=target,
    n_epochs=100,
    batch_size=64,
)

X_tab_te = tab_preprocessor.transform(df2021andaftertest)
preds = trainer.predict(X_tab=X_tab_te)

epoch 1: 100%|██████████████████████████████████████████████████████| 1366/1366 [00:13<00:00, 102.09it/s, loss=2.22e+5]
epoch 2: 100%|███████████████████████████████████████████████████████| 1366/1366 [00:14<00:00, 92.78it/s, loss=9.66e+4]
epoch 3: 100%|███████████████████████████████████████████████████████| 1366/1366 [00:15<00:00, 88.70it/s, loss=8.58e+4]
epoch 4: 100%|███████████████████████████████████████████████████████| 1366/1366 [00:17<00:00, 80.17it/s, loss=7.99e+4]
epoch 5: 100%|███████████████████████████████████████████████████████| 1366/1366 [00:16<00:00, 81.34it/s, loss=7.63e+4]
epoch 6: 100%|███████████████████████████████████████████████████████| 1366/1366 [00:16<00:00, 84.00it/s, loss=7.33e+4]
epoch 7: 100%|███████████████████████████████████████████████████████| 1366/1366 [00:17<00:00, 77.51it/s, loss=7.11e+4]
epoch 8: 100%|██████████████████████████████████████████████████████████| 1366/1366 [00:16<00:00, 81.32it/s, loss=7e+4]
epoch 9: 100%|██████████████████████████

epoch 67: 100%|███████████████████████████████████████████████████████| 1366/1366 [00:15<00:00, 89.00it/s, loss=5.7e+4]
epoch 68: 100%|██████████████████████████████████████████████████████| 1366/1366 [00:15<00:00, 89.29it/s, loss=5.65e+4]
epoch 69: 100%|██████████████████████████████████████████████████████| 1366/1366 [00:15<00:00, 89.73it/s, loss=5.59e+4]
epoch 70: 100%|██████████████████████████████████████████████████████| 1366/1366 [00:15<00:00, 88.85it/s, loss=5.53e+4]
epoch 71: 100%|██████████████████████████████████████████████████████| 1366/1366 [00:15<00:00, 89.01it/s, loss=5.47e+4]
epoch 72: 100%|██████████████████████████████████████████████████████| 1366/1366 [00:15<00:00, 89.08it/s, loss=5.38e+4]
epoch 73: 100%|██████████████████████████████████████████████████████| 1366/1366 [00:15<00:00, 87.84it/s, loss=5.33e+4]
epoch 74: 100%|██████████████████████████████████████████████████████| 1366/1366 [00:15<00:00, 88.39it/s, loss=5.29e+4]
epoch 75: 100%|█████████████████████████

>Report the test RMSE and the test R2 value that you obtained.

In [8]:
from sklearn.metrics import mean_squared_error, r2_score

rmse = mean_squared_error(df2021andaftertest["resale_price"], preds, squared=False)
print("The test RMSE value is " + str(rmse))

r2value = r2_score(df2021andaftertest["resale_price"], preds)
print("The test R2 value is " + str(r2value))

The test RMSE value is 97224.1841396755
The test R2 value is 0.6697467159129393
