# CS4001/4042 Assignment 1, Part B, Q2
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
# !pip install pytorch-widedeep
# !pip install --upgrade torch --user

In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

  warn(f"Failed to load image Python extension: {e}")


>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

# TODO: Enter your code here
train_data = df[df['year'] <= 2020]
test_data = df[df['year'] >= 2021]



>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [9]:
# TODO: Enter your code here
categorical_cols = [('month', df['month'].nunique()), ('town', df['town'].nunique()), ('flat_model_type', df['flat_model_type'].nunique()), ('storey_range', df['storey_range'].nunique())]
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']

# cat_embed_cols = [(column_name, embed_dim), ...]
tab_preprocessor = TabPreprocessor(cat_embed_cols=categorical_cols, continuous_cols=continuous_cols)

X_tab = tab_preprocessor.fit_transform(train_data)

#TabMlp(column_idx, cat_embed_input=None, cat_embed_dropout=0.1, use_cat_bias=False, cat_embed_activation=None, continuous_cols=None, cont_norm_layer='batchnorm', embed_continuous=False, cont_embed_dim=32, cont_embed_dropout=0.1, use_cont_bias=True, cont_embed_activation=None, mlp_hidden_dims=[200, 100], mlp_activation='relu', mlp_dropout=0.1, mlp_batchnorm=False, mlp_batchnorm_last=False, mlp_linear_first=False)

#https://pytorch-widedeep.readthedocs.io/en/latest/quick_start.html (reference code from here)
tab_mlp = TabMlp(
    column_idx=tab_preprocessor.column_idx, 
    cat_embed_input=tab_preprocessor.cat_embed_input,
    continuous_cols=continuous_cols,
    mlp_hidden_dims=[200, 100]
)

model = WideDeep(deeptabular=tab_mlp)

#train and validate
trainer = Trainer(model, cost_fn="rmse",num_workers=0, metrics=[R2Score])
trainer.fit(
    X_tab=X_tab,
    target=train_data['resale_price'].values,
    n_epochs=100,
    batch_size=64,
)












epoch 1: 100%|██████████| 1366/1366 [00:19<00:00, 69.96it/s, loss=2.03e+5, metrics={'r2': -1.6901}]
epoch 2: 100%|██████████| 1366/1366 [00:21<00:00, 62.35it/s, loss=8.14e+4, metrics={'r2': 0.6858}]
epoch 3: 100%|██████████| 1366/1366 [00:23<00:00, 59.13it/s, loss=7.21e+4, metrics={'r2': 0.7667}]
epoch 4: 100%|██████████| 1366/1366 [00:28<00:00, 48.33it/s, loss=6.88e+4, metrics={'r2': 0.7893}]
epoch 5: 100%|██████████| 1366/1366 [00:22<00:00, 61.10it/s, loss=6.69e+4, metrics={'r2': 0.8012}]
epoch 6: 100%|██████████| 1366/1366 [00:21<00:00, 63.71it/s, loss=6.56e+4, metrics={'r2': 0.8091}]
epoch 7: 100%|██████████| 1366/1366 [00:26<00:00, 52.09it/s, loss=6.47e+4, metrics={'r2': 0.814}] 
epoch 8: 100%|██████████| 1366/1366 [00:24<00:00, 56.71it/s, loss=6.39e+4, metrics={'r2': 0.8184}]
epoch 9: 100%|██████████| 1366/1366 [00:29<00:00, 46.05it/s, loss=6.34e+4, metrics={'r2': 0.8209}]
epoch 10: 100%|██████████| 1366/1366 [00:28<00:00, 48.16it/s, loss=6.27e+4, metrics={'r2': 0.8247}]
epoch 11

>Report the test RMSE and the test R2 value that you obtained.

In [12]:
# TODO: Enter your code here
from sklearn.metrics import mean_squared_error, r2_score
X_tab_test = tab_preprocessor.transform(test_data)
preds = trainer.predict(X_tab=X_tab_test, batch_size=64)

print("RMSE:", mean_squared_error(preds, test_data['resale_price'].values, squared=False))
print("R^2:", r2_score(preds, test_data['resale_price'].values))

predict: 100%|██████████| 1128/1128 [00:08<00:00, 128.55it/s]

RMSE: 105928.49407185947
R^2: 0.524054047867564



