# CS4001/4042 Assignment 1, Part B, Q2
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
%pip install pytorch-widedeep

Note: you may need to restart the kernel to use updated packages.


In [2]:
SEED = 42

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import torch
torch.manual_seed(SEED)

import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer

def set_seed(SEED=42):
    random.seed(SEED)
    np.random.seed(SEED)
    torch.manual_seed(SEED)

>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

# TODO: Enter your code here

# Divide dataset into training and test sets
train_df = df[df['year'] <= 2020]
test_df = df[df['year'] >= 2021]

# Sanity Check: Get unique values 
train_unique_years = ', '.join(map(str, train_df['year'].unique()))
test_unique_years = ', '.join(map(str, test_df['year'].unique()))

# Sanity Check: Print the formatted unique values
print(f"Unique years in train_df: {train_unique_years}")
print(f"Unique years in test_df: {test_unique_years}")

Unique years in train_df: 2017, 2018, 2019, 2020
Unique years in test_df: 2021, 2022, 2023


>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [4]:
# TODO: Enter your code here

# Utilize Pytorch-WideDeep Library

set_seed()

# TabPreprocessor: Preprocess the data 
# Define the 'column set up'
cat_embed_cols = [
    "month",
    "town",
    "flat_model_type",
    "storey_range",
]
continuous_cols = [
    "dist_to_nearest_stn",
    "dist_to_dhoby",
    "degree_centrality",
    "eigenvector_centrality",
    "remaining_lease_years",
    "floor_area_sqm",
]

# Prepare the data
tab_preprocessor = TabPreprocessor(cat_embed_cols=cat_embed_cols, 
                                    continuous_cols=continuous_cols, 
                                    cols_to_scale=continuous_cols)
X_tab_train = tab_preprocessor.fit_transform(train_df)
X_tab_test = tab_preprocessor.transform(test_df)

# TabMlp: Build the TabMlp model
tab_mlp = TabMlp(
    column_idx=tab_preprocessor.column_idx,
    cat_embed_input=tab_preprocessor.cat_embed_input,
    continuous_cols=continuous_cols,
    mlp_hidden_dims=[200, 100]
)

# Create the WideDeep model with TabMlp as the deeptabular component
model = WideDeep(deeptabular=tab_mlp)

# Trainer: Train the model
# Getting the target values
target_train = train_df['resale_price'].values
target_test = test_df['resale_price'].values

# Train and validate
trainer = Trainer(model=model, objective="rmse", num_workers=0, seed=SEED)
trainer.fit(
    X_tab=X_tab_train,
    target=target_train,
    n_epochs=100,
    batch_size=64, 
    val_split=0.3, # 70-30 split
)

# Train DF Shape
# (87370, 14)
# 70-30 Split means 61159 rows for training and 26211 rows for validation
# Batch size of 64 further means 956 batches for training and 410 batches for validation

epoch 1: 100%|██████████| 956/956 [00:10<00:00, 94.20it/s, loss=2.49e+5] 
valid: 100%|██████████| 410/410 [00:02<00:00, 171.47it/s, loss=8.42e+4]
epoch 2: 100%|██████████| 956/956 [00:09<00:00, 100.94it/s, loss=8.68e+4]
valid: 100%|██████████| 410/410 [00:02<00:00, 175.60it/s, loss=6.24e+4]
epoch 3: 100%|██████████| 956/956 [00:10<00:00, 94.77it/s, loss=7.82e+4] 
valid: 100%|██████████| 410/410 [00:02<00:00, 173.26it/s, loss=5.75e+4]
epoch 4: 100%|██████████| 956/956 [00:10<00:00, 94.56it/s, loss=7.47e+4] 
valid: 100%|██████████| 410/410 [00:02<00:00, 148.81it/s, loss=5.55e+4]
epoch 5: 100%|██████████| 956/956 [00:12<00:00, 77.18it/s, loss=7.3e+4]  
valid: 100%|██████████| 410/410 [00:03<00:00, 134.80it/s, loss=5.44e+4]
epoch 6: 100%|██████████| 956/956 [00:14<00:00, 66.09it/s, loss=7.11e+4]
valid: 100%|██████████| 410/410 [00:02<00:00, 141.31it/s, loss=5.38e+4]
epoch 7: 100%|██████████| 956/956 [00:11<00:00, 82.86it/s, loss=6.98e+4]
valid: 100%|██████████| 410/410 [00:02<00:00, 139.70

>Report the test RMSE and the test R2 value that you obtained.

In [6]:
# TODO: Enter your code here

set_seed()

# make predictions on the test set
preds = trainer.predict(X_tab=X_tab_test).ravel()

# Test DF Shape
# (72183, 14)
# 1128 Iterations because 72183 rows / 64 batch size = 1128

# Calculate the Root Mean Squared Error (RMSE) between the predicted and actual target values
rmse = np.sqrt(mean_squared_error(target_test, preds))

# Calculate the R-squared (R²) score, which measures the goodness of fit of the model
r2 = r2_score(target_test, preds)

# Print the RMSE and R² scores to evaluate the model's performance on the test dataset
print()
print('-------------------------------')
print('Model Performance on Test Set:')
print(f"Test RMSE: {round(rmse,4)} (4 d.p.)")
print(f"Test R²  : {round(r2,4)}      (4 d.p.)")
print('-------------------------------')

predict: 100%|██████████| 1128/1128 [00:02<00:00, 409.97it/s]


-------------------------------
Model Performance on Test Set:
Test RMSE: 118457.3092 (4 d.p.)
Test R²  : 0.5097      (4 d.p.)
-------------------------------



