# Question B2 (10 marks)
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep

Collecting pytorch-widedeep
  Downloading pytorch_widedeep-1.6.3-py3-none-any.whl.metadata (10 kB)
Collecting spacy (from pytorch-widedeep)
  Downloading spacy-3.8.2-cp311-cp311-win_amd64.whl.metadata (27 kB)
Collecting opencv-contrib-python (from pytorch-widedeep)
  Downloading opencv_contrib_python-4.10.0.84-cp37-abi3-win_amd64.whl.metadata (20 kB)
Collecting imutils (from pytorch-widedeep)
  Downloading imutils-0.5.4.tar.gz (17 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting torchvision>=0.15.0 (from pytorch-widedeep)
  Downloading torchvision-0.19.1-cp311-cp311-win_amd64.whl.metadata (6.1 kB)
Collecting fastparquet>=0.8.1 (from pytorch-widedeep)
  Downloading fastparquet-2024.5.0-cp311-cp311-win_amd64.whl.metadata (4.3 kB)
Collecting transformers (from pytorch-widedeep)
  Downloading transformers-4.45.2-py3-none-any.whl.metadata (44 kB)
     ---------------------------------------- 0.0/44.4 kB ? eta -:--:--
     -

In [12]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd
import torch
torch.manual_seed(SEED)

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score
from torch.optim import Adam
from sklearn.metrics import mean_squared_error, r2_score

1.Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [9]:
df = pd.read_csv('hdb_price_prediction.csv')

# YOUR CODE HERE

# Step 2: Train-Test Split
# Using entries from year 2020 and before as training data, and entries from 2021 and after as test data
train_data = df[df['year'] <= 2020]
test_data = df[df['year'] >= 2021]

# Drop unused columns
train_data = train_data.drop(columns=['full_address', 'nearest_stn', 'year'])
test_data = test_data.drop(columns=['full_address', 'nearest_stn', 'year'])

2.Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [10]:
# Step 3: Configuring Features
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
categorical_cols = ['month', 'town', 'flat_model_type', 'storey_range']

# Step 4: Tab Preprocessing
preprocessor = TabPreprocessor(
    cat_embed_cols=categorical_cols,
    continuous_cols=continuous_cols
)

# Fit and transform training data, transform test data
X_train = preprocessor.fit_transform(train_data.drop(columns=['resale_price']))
X_test = preprocessor.transform(test_data.drop(columns=['resale_price']))
y_train = train_data['resale_price'].values
y_test = test_data['resale_price'].values

# Step 5: Create the TabMlp Model
model = TabMlp(
    column_idx=preprocessor.column_idx,
    cat_embed_input=preprocessor.cat_embed_input,
    continuous_cols=continuous_cols,
    mlp_hidden_dims=[200, 100],
    mlp_activation='relu'
)

# Step 6: Create WideDeep Model
wide_deep_model = WideDeep(deeptabular=model)

# Step 7: Trainer Configuration
trainer = Trainer(
    model=wide_deep_model,
    objective='rmse',
    metrics=[R2Score()],
    optimizers=Adam(wide_deep_model.parameters()),
    num_workers=0
)

# Step 8: Train the Model
trainer.fit(X_tab=X_train, target=y_train, n_epochs=100, batch_size=64)

# Step 9: Evaluate the Model on Test Data
y_pred = trainer.predict(X_tab=X_test)

epoch 1: 100%|██████████| 1366/1366 [00:16<00:00, 80.42it/s, loss=1.87e+5, metrics={'r2': -1.3086}]
epoch 2: 100%|██████████| 1366/1366 [00:17<00:00, 77.83it/s, loss=1.01e+5, metrics={'r2': 0.4743}]
epoch 3: 100%|██████████| 1366/1366 [00:18<00:00, 74.25it/s, loss=8.26e+4, metrics={'r2': 0.6563}]
epoch 4: 100%|██████████| 1366/1366 [00:17<00:00, 79.15it/s, loss=6.85e+4, metrics={'r2': 0.7795}]
epoch 5: 100%|██████████| 1366/1366 [00:17<00:00, 77.24it/s, loss=6.25e+4, metrics={'r2': 0.8228}]
epoch 6: 100%|██████████| 1366/1366 [00:17<00:00, 76.53it/s, loss=6e+4, metrics={'r2': 0.8381}]   
epoch 7: 100%|██████████| 1366/1366 [00:18<00:00, 72.18it/s, loss=5.89e+4, metrics={'r2': 0.8447}]
epoch 8: 100%|██████████| 1366/1366 [00:17<00:00, 79.70it/s, loss=5.79e+4, metrics={'r2': 0.8501}]
epoch 9: 100%|██████████| 1366/1366 [00:17<00:00, 76.98it/s, loss=5.73e+4, metrics={'r2': 0.853}] 
epoch 10: 100%|██████████| 1366/1366 [00:18<00:00, 74.46it/s, loss=5.66e+4, metrics={'r2': 0.8564}]
epoch 11

epoch 80: 100%|██████████| 1366/1366 [00:18<00:00, 71.92it/s, loss=4.76e+4, metrics={'r2': 0.8989}]
epoch 81: 100%|██████████| 1366/1366 [00:14<00:00, 91.57it/s, loss=4.76e+4, metrics={'r2': 0.8989}]
epoch 82: 100%|██████████| 1366/1366 [00:15<00:00, 87.34it/s, loss=4.74e+4, metrics={'r2': 0.8999}]
epoch 83: 100%|██████████| 1366/1366 [00:15<00:00, 87.39it/s, loss=4.75e+4, metrics={'r2': 0.8991}] 
epoch 84: 100%|██████████| 1366/1366 [00:15<00:00, 89.68it/s, loss=4.74e+4, metrics={'r2': 0.8999}]
epoch 85: 100%|██████████| 1366/1366 [00:16<00:00, 80.98it/s, loss=4.74e+4, metrics={'r2': 0.9}]   
epoch 86: 100%|██████████| 1366/1366 [00:17<00:00, 79.76it/s, loss=4.75e+4, metrics={'r2': 0.8992}] 
epoch 87: 100%|██████████| 1366/1366 [00:15<00:00, 86.44it/s, loss=4.75e+4, metrics={'r2': 0.8997}]
epoch 88: 100%|██████████| 1366/1366 [00:16<00:00, 85.18it/s, loss=4.73e+4, metrics={'r2': 0.9002}] 
epoch 89: 100%|██████████| 1366/1366 [00:15<00:00, 89.98it/s, loss=4.72e+4, metrics={'r2': 0.9003

3.Report the test RMSE and the test R2 value that you obtained.

In [13]:
# YOUR CODE & RESULT HERE
# Calculate RMSE and R2 on Test Set
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f'TabMlp Model Test RMSE: {rmse}')
print(f'TabMlp Model Test R2: {r2}')

TabMlp Model Test RMSE: 102276.53768557492
TabMlp Model Test R2: 0.6345309827209406


