# CS4001/4042 Assignment 1, Part B, Q2
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep

Collecting pytorch-widedeep
  Obtaining dependency information for pytorch-widedeep from https://files.pythonhosted.org/packages/17/f4/48f8d4c527baea10808b822fd3c00260f2b3b453937f2ef54bc464da1b88/pytorch_widedeep-1.3.2-py3-none-any.whl.metadata
  Downloading pytorch_widedeep-1.3.2-py3-none-any.whl.metadata (10 kB)
Collecting gensim (from pytorch-widedeep)
  Obtaining dependency information for gensim from https://files.pythonhosted.org/packages/0c/a7/2dd786427bedd2c3dc6c74b70e1e53c6c180a7da0a686c61c2ab17f6fc63/gensim-4.3.2-cp38-cp38-macosx_10_9_x86_64.whl.metadata
  Downloading gensim-4.3.2-cp38-cp38-macosx_10_9_x86_64.whl.metadata (8.5 kB)
Collecting spacy (from pytorch-widedeep)
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/2c/f5/4aacdbc74b0bfbb485a63a2b1d2982c2fde53702b7cd8b19d9db2ae7bb18/spacy-3.6.1-cp38-cp38-macosx_10_9_x86_64.whl.metadata
  Downloading spacy-3.6.1-cp38-cp38-macosx_10_9_x86_64.whl.metadata (25 kB)
Collecting opencv-contr

In [1]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [2]:
df = pd.read_csv('hdb_price_prediction.csv')

# TODO: Enter your code here
df_train = df[df['year'] <= 2020]
df_test = df[df['year'] >= 2021]

>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [3]:
# TODO: Enter your code here
categorical_columns = ['month', 'town', 'flat_model_type', 'storey_range']
continuous_columns = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
target = df_train['resale_price'].values

tab_preprocessor = TabPreprocessor(
    cat_embed_cols=categorical_columns, continuous_cols=continuous_columns
)

X_tab = tab_preprocessor.fit_transform(df_train)

tab_mlp = TabMlp(
    column_idx=tab_preprocessor.column_idx,
    cat_embed_input=tab_preprocessor.cat_embed_input,
    continuous_cols=continuous_columns,
    mlp_hidden_dims=[200, 100]
)

model = WideDeep(deeptabular=tab_mlp)
trainer = Trainer(model, cost_function="root_mean_squared_error", metrics=[R2Score], num_workers=0)
trainer.fit(
    X_tab=X_tab,
    target=target,
    n_epochs=100,
    batch_size=64,
)

epoch 1: 100%|██████████| 1366/1366 [00:12<00:00, 105.63it/s, loss=2.31e+5, metrics={'r2': -2.2184}]
epoch 2: 100%|██████████| 1366/1366 [00:12<00:00, 108.86it/s, loss=1e+5, metrics={'r2': 0.5272}]   
epoch 3: 100%|██████████| 1366/1366 [00:12<00:00, 107.38it/s, loss=8.65e+4, metrics={'r2': 0.6581}]
epoch 4: 100%|██████████| 1366/1366 [00:12<00:00, 111.60it/s, loss=7.98e+4, metrics={'r2': 0.713}] 
epoch 5: 100%|██████████| 1366/1366 [00:12<00:00, 113.21it/s, loss=7.65e+4, metrics={'r2': 0.739}] 
epoch 6: 100%|██████████| 1366/1366 [00:11<00:00, 116.83it/s, loss=7.42e+4, metrics={'r2': 0.7554}]
epoch 7: 100%|██████████| 1366/1366 [00:11<00:00, 115.81it/s, loss=7.27e+4, metrics={'r2': 0.765}] 
epoch 8: 100%|██████████| 1366/1366 [00:11<00:00, 117.65it/s, loss=7.12e+4, metrics={'r2': 0.7747}]
epoch 9: 100%|██████████| 1366/1366 [00:11<00:00, 115.73it/s, loss=7.02e+4, metrics={'r2': 0.7815}]
epoch 10: 100%|██████████| 1366/1366 [00:11<00:00, 116.87it/s, loss=6.87e+4, metrics={'r2': 0.7905}

>Report the test RMSE and the test R2 value that you obtained.

In [4]:
from sklearn.metrics import mean_squared_error, r2_score

X_tab_test = tab_preprocessor.transform(df_test)
y_pred = trainer.predict(X_tab=X_tab_test, batch_size=64)

y_test = df_test['resale_price'].values
rmse = mean_squared_error(y_pred, y_test, squared=False)

r2 = r2_score(y_pred, y_test)

print(f"RMSE: {rmse}")
print(f"R2: {r2}")

predict: 100%|██████████| 1128/1128 [00:02<00:00, 451.74it/s]


RMSE: 96887.43209364318
R2: 0.6272553772048437
