# CS4001/4042 Assignment 1, Part B, Q2
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep

Collecting pytorch-widedeep
  Downloading pytorch_widedeep-1.3.2-py3-none-any.whl (21.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.8/21.8 MB[0m [31m94.0 MB/s[0m eta [36m0:00:00[0m
Collecting einops (from pytorch-widedeep)
  Downloading einops-0.7.0-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting torchmetrics (from pytorch-widedeep)
  Downloading torchmetrics-1.2.0-py3-none-any.whl (805 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m805.2/805.2 kB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
Collecting fastparquet>=0.8.1 (from pytorch-widedeep)
  Downloading fastparquet-2023.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m91.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cramjam>=2.3 (from fastparquet>=0.8.1->pytorch-wid

In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score



>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

  self._read_thread.setDaemon(True)


Mounted at /content/drive


In [4]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/hdb_price_prediction.csv')

# Splitting the data into train, validation, and test sets
train_data = df[df['year'] <= 2020]
test_data = df[df['year'] >= 2021]

>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [5]:
import torch
from pytorch_widedeep.initializers import Normal
from pytorch_widedeep.callbacks import EarlyStopping

target = ["resale_price"]

# lists for continuous and categorical variables
continuous_var =  ["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"]
categorical_var = ["month", "town", "flat_model_type", "storey_range"]

categorical_tup = []
for variable in categorical_var:
  categorical_tup.append( ( variable,int(train_data[variable].unique().shape[0])))

preprocessor = TabPreprocessor(
    cat_embed_cols=categorical_tup,
    continuous_cols=continuous_var,
    cols_to_scale=continuous_var
)

X_tab = preprocessor.fit_transform(train_data)

target = train_data[target].values

col_names = list(train_data.columns)
col_idx = {i:j for j,i in enumerate(col_names)}
print(col_idx)

hidden_dimenions = [200, 100]

deep_tabular_model = TabMlp(
    column_idx=preprocessor.column_idx,
    cat_embed_input=preprocessor.cat_embed_input,
    continuous_cols=continuous_var,
    mlp_hidden_dims=hidden_dimenions
    )

print(deep_tabular_model)

{'month': 0, 'year': 1, 'town': 2, 'full_address': 3, 'nearest_stn': 4, 'dist_to_nearest_stn': 5, 'dist_to_dhoby': 6, 'degree_centrality': 7, 'eigenvector_centrality': 8, 'flat_model_type': 9, 'remaining_lease_years': 10, 'floor_area_sqm': 11, 'storey_range': 12, 'resale_price': 13}
TabMlp(
  (cat_and_cont_embed): DiffSizeCatAndContEmbeddings(
    (cat_embed): DiffSizeCatEmbeddings(
      (embed_layers): ModuleDict(
        (emb_layer_month): Embedding(13, 12, padding_idx=0)
        (emb_layer_town): Embedding(27, 26, padding_idx=0)
        (emb_layer_flat_model_type): Embedding(44, 43, padding_idx=0)
        (emb_layer_storey_range): Embedding(18, 17, padding_idx=0)
      )
      (embedding_dropout): Dropout(p=0.1, inplace=False)
    )
    (cont_norm): BatchNorm1d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (encoder): MLP(
    (mlp): Sequential(
      (dense_layer_0): Sequential(
        (0): Dropout(p=0.1, inplace=False)
        (1): Linear(in_features=10

  and should_run_async(code)


In [6]:
deep_optimizer = torch.optim.Adam(deep_tabular_model.parameters(), lr=0.01)     # Defining the optimizer for the deep tabular model
callbacks = [EarlyStopping]                                                     # Defining callbacks for early stopping
optimizer = {"deeptabular": deep_optimizer}                                     # Creating an optimizer dictionary with the deep tabular model optimizer
metric = [R2Score]                                                              # Defining the metric for model evaluation
initializer = {"deeptabular": Normal}                                           # Initializing weights and biases using the Normal distribution


model = WideDeep(wide=None,
                deeptabular=deep_tabular_model,
                deeptext=None,
                deepimage=None,
                deephead=None,
                head_hidden_dims=None,
                head_activation='relu',
                head_dropout=0.1,
                head_batchnorm=False,
                head_batchnorm_last=False,
                head_linear_first=True,
                enforce_positive=False,
                enforce_positive_activation='softplus',
                pred_dim=1,
                with_fds=False)

trainer = Trainer(model=model,
                  objective="root_mean_squared_error",
                  optimizers=optimizer,
                  lr_schedulers=None,
                  initializers=initializer,
                  callbacks=callbacks,
                  metrics=metric,
                  verbose=1,
                  num_workers = 0)

trainer.fit(X_tab=X_tab,
            n_epochs=100,
            batch_size=64,
            target = target)

preds = trainer.predict(X_tab=X_tab)

epoch 1: 100%|██████████| 1366/1366 [00:12<00:00, 108.80it/s, loss=9.34e+4, metrics={'r2': 0.3578}]
epoch 2: 100%|██████████| 1366/1366 [00:11<00:00, 117.13it/s, loss=6.08e+4, metrics={'r2': 0.8356}]
epoch 3: 100%|██████████| 1366/1366 [00:15<00:00, 88.35it/s, loss=5.83e+4, metrics={'r2': 0.8477}]
epoch 4: 100%|██████████| 1366/1366 [00:13<00:00, 104.90it/s, loss=5.73e+4, metrics={'r2': 0.8513}]
epoch 5: 100%|██████████| 1366/1366 [00:11<00:00, 116.33it/s, loss=5.63e+4, metrics={'r2': 0.8579}]
epoch 6: 100%|██████████| 1366/1366 [00:12<00:00, 111.23it/s, loss=5.49e+4, metrics={'r2': 0.8653}]
epoch 7: 100%|██████████| 1366/1366 [00:12<00:00, 110.70it/s, loss=5.42e+4, metrics={'r2': 0.8684}]
epoch 8: 100%|██████████| 1366/1366 [00:12<00:00, 110.12it/s, loss=5.42e+4, metrics={'r2': 0.8676}]
epoch 9: 100%|██████████| 1366/1366 [00:11<00:00, 115.83it/s, loss=5.28e+4, metrics={'r2': 0.8746}]
epoch 10: 100%|██████████| 1366/1366 [00:11<00:00, 115.02it/s, loss=5.16e+4, metrics={'r2': 0.8806}]


>Report the test RMSE and the test R2 value that you obtained.

In [8]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Preprocess the test data
X_test = preprocessor.transform(test_data)

# Extract the true target values from the test data
target_test = test_data['resale_price'].values

# Make predictions on the test data
test_preds = trainer.predict(X_tab=X_test)

# Calculate RMSE and R2 for the test predictions
test_rmse = np.sqrt(mean_squared_error(target_test, test_preds))
test_r2 = r2_score(target_test, test_preds)

print("Test RMSE:", test_rmse)
print("Test R2:", test_r2)

predict: 100%|██████████| 1128/1128 [00:03<00:00, 310.49it/s]


Test RMSE: 100919.06071344888
Test R2: 0.6441680590167506
