## Question 3: Include numerical/categorical data with a language model

Hello Hugging Face! I have been using transformers to gauge whether product reviews are
good or bad and so far, I have gotten decent results. I have information about how many
products the user has purchased in the past, how many times users spend looking at that
review, and a few other variables. Can we combine both the textual data with the numerical and
categorical data? Can you provide an example of how this could be done? I have a feeling that
this might produce the best results.

In [None]:
## Download the packages
! pip install transformers datasets torch numpy

In [4]:
from datasets import Dataset, ClassLabel
from transformers import AutoTokenizer, AutoModel
import numpy as np
import torch

In [5]:
# Example data
data = {
    'reviews': ["This product is great!", "Not satisfied with this product."],
    'number_of_products': [3, 1],
    'number_of_times': [5, 2],
    'type_of_product': ["Jeans", "Shirts"]
}

dataset = Dataset.from_dict(data)

## Solutions

### Approach 1: Concatenating numerical and categorical data as string with textual data

In [8]:
# Combine numerical and categorical data as text
def combine_data(row):
    return {'concat_review': (
        f"This user has purchased {row['number_of_products']} products. "
        f"This review was viewed {row['number_of_times']} times. "
        f"The product that has been reviewed belongs to the category {row['type_of_product']}. "
        f"Review of the product: {row['reviews']}"
    )}

In [9]:
# Map the combine_data function to the dataset
dataset = dataset.map(combine_data, remove_columns=['reviews', 'number_of_products', 'number_of_times', 'type_of_product'])
print(dataset['concat_review'][0])

Map: 100%|██████████| 2/2 [00:00<00:00, 174.38 examples/s]

This user has purchased 3 products. This review was viewed 5 times. The product that has been reviewed belongs to the category Jeans. Review of the product: This product is great!





### Approach 2: Concatenating numerical and categorical features with textual features

In [12]:
# Example data
data = {
    'reviews': ["This product is great!", "Not satisfied with this product."],
    'number_of_products': [3, 1],
    'number_of_times': [5, 2],
    'type_of_product': ["Jeans", "Shirts"]
}

dataset = Dataset.from_dict(data)

In [13]:
model = AutoModel.from_pretrained("distilbert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [14]:
product_types = {product: idx for idx, product in enumerate(dataset['type_of_product'])} # map product types to integers
num_classes = 2 

In [15]:
def tokenization(row):
    return tokenizer(row["reviews"], padding=True, max_length=10)

def extract_numerical_features(row):
    return {'num_feature': [row['number_of_products'], row['number_of_times']]}


def extract_categorical_features(row):
    cat_features = np.zeros(num_classes, dtype=int, )  # Initialize with zeros
    cat_features[product_types[row['type_of_product']]] = 1
    return {'cat_feature': cat_features}


In [16]:
new_dataset = dataset.map(extract_numerical_features, remove_columns=['number_of_products', 'number_of_times'])
new_dataset = new_dataset.map(extract_categorical_features, remove_columns=['type_of_product'])
new_dataset = new_dataset.map(tokenization, remove_columns=['reviews'], batched=True)



Map: 100%|██████████| 2/2 [00:00<00:00, 231.40 examples/s]
Map: 100%|██████████| 2/2 [00:00<00:00, 182.29 examples/s]
Map: 100%|██████████| 2/2 [00:00<00:00, 32.17 examples/s]


In [17]:
class ReviewClassifierModel(torch.nn.Module):
    def __init__(self, model, cat_num_dims, num_classes):
        """
        Initialize the ReviewClassifierModel.

        Args:
            model_name (str): The name of the pre-trained transformer model.
            cat_num_dims (int): The number of dimensions summed from numerical and categorical data.
            num_classes (int): The number of output classes.

        """
        super().__init__()

        # Load the model configuration and transformer
        self.model = model

        # Get the text embedding dimension
        text_dim = self.model.config.hidden_size

        # Create a linear classifier layer
        self.classifier = torch.nn.Linear(text_dim + cat_num_dims, num_classes)

    def forward(self, cat_num_vector, input_ids, attention_mask=None):
        """
        Forward pass of the model.

        Args:
            input_ids (torch.Tensor): Input token IDs.
            extra_data (torch.Tensor): Additional numerical/categorical data.
            attention_mask (torch.Tensor, optional): Attention mask.

        Returns:
            torch.Tensor: Model logits.

        """
        # Pass input through the transformer for 
        hidden_states = self.model(input_ids=input_ids, attention_mask=attention_mask)

        # Extract the [CLS] token embeddings
        cls_embeds = hidden_states.last_hidden_state[:, 0, :]
        
        # Concatenate transformer output with categorical and numerical features
        concat = torch.cat((cls_embeds, cat_num_vector), dim=-1)

        # Pass through the classifier
        output = self.classifier(concat)

        return output

In [18]:
new_dataset = new_dataset.with_format("torch")
review_classifier = ReviewClassifierModel(model, cat_num_dims=2+2, num_classes=2)

In [19]:
cat_num_feature = torch.concat((new_dataset['num_feature'], new_dataset['cat_feature']), dim=-1)
review_classifier(cat_num_feature, new_dataset['input_ids'], new_dataset['attention_mask'])


torch.Size([2, 768])
torch.Size([2, 4])


tensor([[ 0.2992, -0.0446],
        [ 0.2632,  0.0151]], grad_fn=<AddmmBackward0>)