# Embedding for Sentiment Analysis 

Now that we know how word embedding works, we'll apply it to a supervised problem of sentiment analysis. The idea is to classify the comments left by users according to the number of stars they gave the Disneyland resort park in their reviews.

## Data Preprocessing

### Import Data 

1. Import the necessary libraries

In [None]:
import torch
import torch.nn.functional as F
import pandas as pd
import torch
import tiktoken
import numpy as np
import pandas as pd
import torch.nn as nn
import torch.optim as optim
from torchinfo import summary
from torch.utils.data import DataLoader, Dataset, random_split

device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

Using mps device


2. Copy the link below and read the file it contains with `pandas`.

* https://go.aws/314bBDq

In [None]:
# Import dataset with Pandas 
all_comments = pd.read_csv("https://go.aws/314bBDq", encoding="utf-8")
all_comments

Unnamed: 0,user_id,review,stars,date_format,time_of_day,hour_of_day,day_of_week,review_format,review_lang,month_year,review_len,review_nb_words
0,efb62a167fee5cf3678b24427de8e31f,"Génial, fabuleux, exceptionnel ! J'aimerais qu...",5,2017-09-29 18:17:00,18:17,18,Ven,génial fabuleux exceptionnel j aimerais qu...,french,2017-09,115,19
1,e3be4f9c9e0b9572bfb2a5f88497bb14,,2,2017-09-29 17:29:00,17:29,17,Ven,,,2017-09,0,0
2,1b8e5760162d867e9b9ca80f645bdc60,"Toujours aussi magic, féerique !",5,2017-09-29 16:46:00,16:46,16,Ven,toujours aussi magic féerique,french,2017-09,32,4
3,fa330e5891a1bb486c3e9bf95c098726,,5,2017-09-29 15:52:00,15:52,15,Ven,,,2017-09,0,0
4,c1a693206aee1a2412d4bd9e45b80ec5,,3,2017-09-29 15:29:00,15:29,15,Ven,,,2017-09,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
299630,299be03d0583edfb9625a7947fbc631a,,5,2012-11-11 11:46:00,11:46,11,Dim,,,2012-11,0,0
299631,39b4e66e3b78d4f8ce60a6b4801b862d,,5,2012-11-11 11:46:00,11:46,11,Dim,,,2012-11,0,0
299632,924eb5ec58470cd00c16060e6ee3c316,,5,2012-11-11 11:46:00,11:46,11,Dim,,,2012-11,0,0
299633,5b484e48319355c12a941577d74a5839,,5,2012-11-11 11:45:00,11:45,11,Dim,,,2012-11,0,0


3. We will need the reviews in French. Filter the reviews so that they are in the right language. For this you need to find a column that gives you that information.

4. Keep only the `review` & `stars` columns.

In [12]:
# Taking only french reviews
# Let's take the columns we're interested in 
dataset = all_comments.loc[all_comments["review_lang"]=="french", ["review", "stars"]]
dataset

Unnamed: 0,review,stars
0,"Génial, fabuleux, exceptionnel ! J'aimerais qu...",5
2,"Toujours aussi magic, féerique !",5
11,En vacances en région parisienne nous nous som...,2
12,Tropbeaufinalpleinlesyeuxoreil,5
23,L'univers Disney reste merveilleux. Toutefois ...,4
...,...,...
295057,Toujours aussi magique même si à la fin du séj...,5
295549,Séjour au top!!! Mes enfants les plus heureux ...,5
298475,"Magnifique un monde parfait <span class=""""""""_4...",5
298832,Oui j'ai aimé car j'adore disney et tout ce q...,4


### Preprocessing

We will now go through a preprocessing phase. The goal is to convert the character strings into sequences of tokens represented by integers.

1. Use the tiktoken library in order to tokenize each sentence based on the `cl100k_base` tokenizer.

In [13]:
tokenizer = tiktoken.get_encoding("cl100k_base")

dataset_tokenized = [tokenizer.encode(text) for text in dataset["review"]]

# print the first ten tokens of the first tokenized sentence
dataset_tokenized[0][:10]

[38, 10610, 532, 11, 9765, 1130, 2249, 11, 4788, 8301]

2. In order to build the data loader, we need all sequences to be of the same length. Calculate the max and average senquence length, and decide which length you want all sequences to adopt.

In [14]:
len(dataset_tokenized)

8474

In [None]:
# How are sequence lengths distributed?
seq_lens = [len(seq) for seq in dataset_tokenized]
print("avg seq len",np.mean(seq_lens))
print("max seq len",np.max(seq_lens))

In [17]:
def pad_sequences(sequences, max_length=100):
    return [seq[:max_length] + [0] * (max_length - len(seq)) for seq in sequences]

dataset_tokenized = pad_sequences(dataset_tokenized)

3. Form a torch dataset object based on the token sequences and labels, and split the data into a train and validation set.

In [18]:
# Define a custom PyTorch dataset class for Disney reviews
class DisneyDataset(Dataset):
    """
    A custom dataset class for Disney reviews.

    This class is used to convert text data (already tokenized) and their corresponding labels
    into a PyTorch Dataset object, which can be easily loaded into a DataLoader.
    """

    def __init__(self, texts, labels):
        """
        Initializes the dataset by storing texts and labels as PyTorch tensors.

        Args:
        - texts (list or numpy array): Tokenized text data, where each text has been converted 
                                       into a sequence of word indices (integer tokens).
        - labels (list or numpy array): The corresponding labels for each text (e.g., sentiment scores or star ratings).
        """
        # Convert text sequences to a PyTorch tensor (long type since they are indices)
        self.texts = torch.tensor(texts, dtype=torch.long)

        # Convert labels to a PyTorch tensor (float32 for compatibility with loss functions)
        self.labels = torch.tensor(labels, dtype=torch.float32)

    def __len__(self):
        """
        Returns the total number of samples in the dataset.

        This method is required for PyTorch datasets as it allows DataLoader to determine
        how many batches it needs.
        """
        return len(self.texts)

    def __getitem__(self, idx):
        """
        Retrieves a single data point (text and label) from the dataset based on an index.

        Args:
        - idx (int): Index of the sample to retrieve.

        Returns:
        - tuple: A tuple containing:
            - self.texts[idx]: The tokenized text at index `idx`.
            - self.labels[idx]: The corresponding label for that text.
        """
        return self.texts[idx], self.labels[idx]
    
label = dataset["stars"]

# Example usage: Creating a dataset instance
disney_dataset = DisneyDataset(dataset_tokenized, label)

# Split dataset into training (80%) and validation (20%)
train_size = int(0.8 * len(disney_dataset))
val_size = len(disney_dataset) - train_size
train_dataset, val_dataset = random_split(disney_dataset, [train_size, val_size])

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

In [19]:
text, label = next(iter(train_loader))

print(label)
print(text)

tensor([5., 5., 5., 2., 4., 5., 4., 4., 4., 4., 5., 5., 5., 5., 4., 5., 5., 4.,
        5., 1., 4., 4., 3., 4., 4., 3., 5., 5., 4., 5., 5., 4.])
tensor([[  5479,   1880,   4983,  ...,      0,      0,      0],
        [   806,     14,    806,  ...,      0,      0,      0],
        [ 66932,  57038,  81621,  ...,      0,      0,      0],
        ...,
        [ 30854,   3355,    729,  ..., 100164,    758,  27530],
        [    19,  49301,   1522,  ...,      0,      0,      0],
        [ 30854,  36731,  12584,  ...,    324,    662,  34447]])


## Build the embedding based prediction model

Now that the data is duely tokenized, let's create a prediction model based on the embedding layer.

1. The first question you need to ask yourself is what kind of prediction problem are we dealing with? The target variable represents the number of stars associated with each comment.

Treating this as a regression problem seems relevant for two reasons :
- The target variable is qualitative ordinal, therefore values of stars can be compared
- This would help the model associate tokens with quantitative measures on only one dimension (as opposed to 5 dimensions in the case of classification) observations associated with each number of stars will actually benefit the training for all values of stars.

2. Build a prediction model based on your choice

In [20]:
# Get the vocabulary size from the tokenizer
# This represents the total number of unique words in the dataset,
# which will be used as the input size for the embedding layer.

# Define a neural network model for text regression
class TextRegressor(nn.Module):
    """
    A simple text regression model using embeddings and pooling.

    This model takes tokenized text as input and predicts a continuous value (e.g., sentiment score or rating).
    """

    def __init__(self, vocab_size, embed_dim):
        """
        Initializes the model layers.

        Args:
        - vocab_size (int): The number of unique words in the vocabulary.
        - embed_dim (int): The size of each word's embedding vector.

        The model consists of:
        1. An Embedding layer that converts tokenized words into dense vectors.
        2. A Pooling layer that reduces the sequence length by averaging word embeddings.
        3. A Fully Connected (Linear) layer that maps the pooled embeddings to the output value.
        """
        super(TextRegressor, self).__init__()

        # Embedding layer: Maps word indices to dense vector representations
        # padding_idx=0 ensures that padding tokens (index 0) do not contribute to learning
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

        # Adaptive Average Pooling: Computes the average of the word embeddings along the sequence length
        # This helps reduce variable-length text into a fixed-size representation
        self.pooling = nn.AdaptiveAvgPool1d(1)

        # Fully Connected (Linear) layer: Maps the fixed-size vector to a single output value
        self.fc = nn.Linear(embed_dim, 1)

    def forward(self, text):
        """
        Defines the forward pass of the model.

        Args:
        - text (Tensor): A batch of tokenized text (word indices).

        Returns:
        - Tensor: The predicted output (e.g., a continuous score or rating).
        """
        # Convert input word indices into dense embeddings
        embedded = self.embedding(text)

        # Permute to match the expected shape for pooling: (batch, channels, sequence_length)
        # Then, apply average pooling to reduce sequence length to 1
        pooled = self.pooling(embedded.permute(0, 2, 1)).squeeze(2)

        # Pass the pooled embeddings through the linear layer to
        return self.fc(pooled)
    
vocab_size = tokenizer.n_vocab

# Create an instance of the model
model = TextRegressor(vocab_size=vocab_size, embed_dim=16)

3. Print out the sructure of the model

In [22]:
print(model)

# Print model summary
summary(model, input_data=text)  # (batch_size, input_features)

TextRegressor(
  (embedding): Embedding(100277, 16, padding_idx=0)
  (pooling): AdaptiveAvgPool1d(output_size=1)
  (fc): Linear(in_features=16, out_features=1, bias=True)
)


Layer (type:depth-idx)                   Output Shape              Param #
TextRegressor                            [32, 1]                   --
├─Embedding: 1-1                         [32, 100, 16]             1,604,432
├─AdaptiveAvgPool1d: 1-2                 [32, 16, 1]               --
├─Linear: 1-3                            [32, 1]                   17
Total params: 1,604,449
Trainable params: 1,604,449
Non-trainable params: 0
Total mult-adds (M): 51.34
Input size (MB): 0.03
Forward/backward pass size (MB): 0.41
Params size (MB): 6.42
Estimated Total Size (MB): 6.85

4. Prepare and run the training loop for 50 epochs.

In [None]:
# Define the loss function
# This function measures how well the model's predictions match the actual values.
# Mean Squared Error (MSE) is commonly used for regression problems.
criterion = nn.MSELoss()

# Define the optimizer
# The optimizer updates the model's weights to minimize the loss function.
# Adam is an adaptive optimization algorithm that adjusts learning rates during training.
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train(model, train_loader, val_loader, criterion, optimizer, epochs=100):
    """
    Function to train a PyTorch model with training and validation datasets.
    
    Parameters:
    model: The neural network model to train.
    train_loader: DataLoader for the training dataset.
    val_loader: DataLoader for the validation dataset.
    criterion: Loss function (e.g., Mean Squared Error for regression).
    optimizer: Optimization algorithm (e.g., Adam, SGD).
    epochs: Number of training epochs (default=100).
    
    Returns:
    history: Dictionary containing loss and metric for both training and validation.
    """
    
    # Dictionary to store training & validation loss and accuracy over epochs
    history = {'train_loss': [], 'val_loss': [], 'train_metric': [], 'val_metric': []}

    for epoch in range(epochs):  # Loop over the number of epochs
        model.train()  # Set model to training mode
        total_loss, train_metric = 0, 0  # Initialize total loss and correct predictions
        
        # Training loop
        for inputs, labels in train_loader:
            optimizer.zero_grad()  # Reset gradients before each batch
            outputs = model(inputs).squeeze() # Forward pass
            loss = criterion(outputs, labels)  # Compute loss
            loss.backward()  # Backpropagation (compute gradients)
            optimizer.step()  # Update model parameters
            
            total_loss += loss.item()  # Accumulate batch loss
        
        # Compute average loss and accuracy for training
        train_loss = total_loss / len(train_loader)
        train_metric = (total_loss / len(train_loader))**(1/2)

        # Validation phase (without gradient computation)
        model.eval()  # Set model to evaluation mode
        val_loss, val_metric = 0, 0
        with torch.no_grad():  # No need to compute gradients during validation
            for inputs, labels in val_loader:
                outputs = model(inputs).squeeze()  # Forward pass
                loss = criterion(outputs, labels)  # Compute loss
                val_loss += loss.item()  # Accumulate validation loss
        
        # Compute average loss and accuracy for validation
        val_loss = val_loss / len(val_loader)
        val_metric = (val_loss)**(1/2)
        
        # Store metrics in history dictionary
        history['train_loss'].append(train_loss)
        history['val_loss'].append(val_loss)
        history['train_metric'].append(val_metric)
        history['val_metric'].append(val_metric)
        
        # Print training progress
        print(f"Epoch [{epoch+1}/{epochs}], Train Loss: {train_loss:.4f}, Train Metric: {train_metric:.4f}, "
              f"Val Loss: {val_loss:.4f}, Val Metric: {val_metric:.4f}")
    
    return history  # Return training history

# Train the model using the training function with defined parameters
history = train(model,
                train_loader=train_loader,
                val_loader=val_loader,
                criterion=criterion,
                optimizer=optimizer,
                epochs=50)

Epoch [1/50], Train Loss: 10.7488, Train metric: 3.2785, Val Loss: 7.9421, Val metric: 2.8182
Epoch [2/50], Train Loss: 6.8957, Train metric: 2.6260, Val Loss: 5.9919, Val metric: 2.4478
Epoch [3/50], Train Loss: 5.8676, Train metric: 2.4223, Val Loss: 5.3846, Val metric: 2.3205
Epoch [4/50], Train Loss: 5.2974, Train metric: 2.3016, Val Loss: 4.8717, Val metric: 2.2072
Epoch [5/50], Train Loss: 4.7605, Train metric: 2.1819, Val Loss: 4.3760, Val metric: 2.0919
Epoch [6/50], Train Loss: 4.2482, Train metric: 2.0611, Val Loss: 3.9112, Val metric: 1.9777
Epoch [7/50], Train Loss: 3.7650, Train metric: 1.9404, Val Loss: 3.4780, Val metric: 1.8649
Epoch [8/50], Train Loss: 3.3210, Train metric: 1.8224, Val Loss: 3.0939, Val metric: 1.7590
Epoch [9/50], Train Loss: 2.9286, Train metric: 1.7113, Val Loss: 2.7653, Val metric: 1.6629
Epoch [10/50], Train Loss: 2.5929, Train metric: 1.6102, Val Loss: 2.4912, Val metric: 1.5783
Epoch [11/50], Train Loss: 2.3139, Train metric: 1.5211, Val Loss: 2

## Error analysis

Error analysis consists in focusing on the observations in the training set and validation sets that were predicted the worst by the model. This often reveals potential inconsistencies in the data, and helps identifies improvement opportunies for our model.

1. Create a function that creates a Dataframe containing:
    - the prediction value
    - the true label of the observation
    - the tokenized input
    - the text input
Based on a data loader, the tokenizer, and the model.
Apply this function to both the train loader, and the val loader.

In [25]:
# Function to evaluate the model and get worst predictions
def evaluate_worst_predictions(model, dataloader, tokenizer, device="cpu"):
    model.eval()  # Set model to evaluation mode
    
    all_predictions = []
    all_labels = []
    all_errors = []
    all_inputs = []

    with torch.no_grad():  # No gradients needed during evaluation
        for batch in dataloader:
            inputs, labels = batch  # Assuming (inputs, labels) in DataLoader
            inputs, labels = inputs.to(device), labels.to(device)

            # Forward pass
            model.to(device)
            outputs = model(inputs)

            # Convert outputs to predicted class (for classification)
            if outputs.shape[-1] > 1:  # Multi-class classification
                preds = torch.argmax(outputs, dim=1)
                errors = (preds != labels).float()  # Misclassified observations
            else:  # Regression
                preds = outputs.squeeze()
                errors = torch.abs(preds - labels)  # Absolute error

            # Save results
            all_predictions.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
            all_errors.extend(errors.cpu().numpy())
            all_inputs.extend(inputs.cpu().numpy())

    # Convert to DataFrame for analysis
    df_results = pd.DataFrame({
        "True_Label": all_labels,
        "Predicted": all_predictions,
        "Error": all_errors,
        "Inputs": all_inputs,
        "Text" : [tokenizer.decode(input) for input in all_inputs]
    })

    # Sort by highest error (worst predictions)
    df_results_sorted = df_results.sort_values(by="Error", ascending=False)

    return df_results_sorted

# Example usage:
worst_predictions_val = evaluate_worst_predictions(model, val_loader, tokenizer, device=device)
worst_predictions_train = evaluate_worst_predictions(model, train_loader, tokenizer, device=device)


2. Display the first ten rows of each dataframe to get an idea of the worst predicted data points. Is there anything that raises questions?

In [26]:
worst_predictions_train.head(10)

Unnamed: 0,True_Label,Predicted,Error,Inputs,Text
4264,1.0,4.531943,3.531943,"[34, 1826, 653, 842, 69596, 4809, 588, 4618, 2...",C est un endroit merveilleux!!!!!!!!!!!!!!!!!!...
759,1.0,4.362868,3.362868,"[47696, 708, 404, 11, 0, 0, 0, 0, 0, 0, 0, 0, ...","Bonsoir,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!..."
6366,1.0,4.270876,3.270876,"[1305, 12416, 14707, 1174, 389, 264, 81621, 40...","Très bien , on a passé de bon temps!!!!!!!!!!!..."
594,1.0,4.218419,3.218419,"[53, 969, 3904, 55455, 264, 13510, 2307, 0, 0,...",Vraiment rien a dire super!!!!!!!!!!!!!!!!!!!!...
5982,1.0,4.085513,3.085513,"[56948, 40574, 978, 481, 86323, 1522, 8047, 11...","Une agréable journée passée, la rencontre avec..."
4320,1.0,3.893484,2.893484,"[73, 364, 1955, 285, 264, 834, 3520, 4363, 514...",j 'etais a disney land le 26.03.2015 avec ma f...
4989,1.0,3.813063,2.813063,"[1966, 39904, 829, 86323, 3869, 20028, 1208, 7...",On passe sa journée à faire la queue. Le temps...
3335,1.0,3.805321,2.805321,"[1378, 8301, 11, 4983, 2428, 11, 58482, 261, 2...","Exceptionnel, magique, féerique.<br/> On est d..."
586,1.0,3.80228,2.80228,"[83, 897, 23008, 5019, 11083, 594, 40751, 5019...",trop cher pour mes ressources pourtant aimerai...
6109,1.0,3.797496,2.797496,"[1737, 220, 17, 73, 5544, 342, 312, 1892, 72, ...",En 2jrs g reussi a faire peu de manège trop d'...


In [27]:
worst_predictions_val.head(10)

Unnamed: 0,True_Label,Predicted,Error,Inputs,Text
956,3.0,-1.037281,4.037281,"[4643, 13281, 272, 1826, 25945, 23008, 5019, 6...","79 € c est très cher pour un parc, surtout qua..."
1411,1.0,4.647778,3.647778,"[1844, 72006, 13612, 321, 11, 6316, 39892, 863...","Un beau soleil, une belle journée … mais une i..."
962,1.0,4.437809,3.437809,"[30854, 36731, 6502, 35597, 69003, 978, 1744, ...",Je suis pas encore arrivé que je deteste déjà....
373,1.0,4.219281,3.219281,"[1951, 2249, 49301, 220, 975, 1880, 220, 868, ...","Deux jours 14 et 15, juillet hôtel cheyenne ri..."
767,1.0,4.205205,3.205205,"[66, 96287, 4983, 2428, 0, 3625, 60404, 1880, ...",c'était magique! les enfants et nous même en o...
814,1.0,4.15559,3.15559,"[1951, 40970, 665, 40970, 58482, 261, 2428, 12...",De moins en moins féerique!!! Dommage c était ...
975,1.0,4.149827,3.149827,"[2356, 3197, 1826, 2267, 1892, 978, 14465, 367...",Le plan est faussé Je suis à Disney Village!!!...
1561,1.0,4.148654,3.148654,"[30854, 308, 17771, 6502, 5363, 978, 514, 3990...",Je n'est pas regardé le passe Disney avant de ...
168,1.0,4.108095,3.108095,"[32960, 13510, 1744, 4864, 11457, 2852, 55398,...",Et dire que je venais jusqu'à 7 fois par an av...
1678,1.0,4.099079,3.099079,"[40, 4835, 80664, 306, 75831, 951, 83229, 70, ...",Ils semblent créer des règlements à la tête de...


3. Calculate the mean error for each category of the target. Also calculate the number of samples belonging to each category. What do you think?

In [28]:
print("Train set prediction error by star review")
worst_predictions_train.groupby("True_Label")["Error"].mean()

Train set prediction error by star review


True_Label
1.0    1.022648
2.0    0.671437
3.0    0.482908
4.0    0.392095
5.0    0.373207
Name: Error, dtype: float32

In [29]:
print("Train set star distribution")
worst_predictions_train["True_Label"].value_counts()

Train set star distribution


True_Label
5.0    3919
4.0    1214
3.0     797
1.0     454
2.0     395
Name: count, dtype: int64

In [30]:
print("Validation set prediction error by star review")
worst_predictions_val.groupby("True_Label")["Error"].mean()

Validation set prediction error by star review


True_Label
1.0    1.842363
2.0    0.957422
3.0    0.765634
4.0    0.600415
5.0    0.553516
Name: Error, dtype: float32

In [31]:
print("Validation set star distribution")
worst_predictions_val["True_Label"].value_counts()

Validation set star distribution


True_Label
5.0    962
4.0    324
3.0    213
1.0    104
2.0     92
Name: count, dtype: int64