# Sentiment Analysis
***
## Table of Contents

***

In [1]:
import pandas as pd
import numpy as np
import string
import time
import matplotlib.pyplot as plt
from torchinfo import summary
import re
from collections import Counter
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
from torchmetrics import Accuracy

## 1. Introduction

## 2. Device Agnostic Code
Mac GPU acceleration (`mps` backend) delivers significant speed-up over CPU for deep learning tasks, especially for large models and batch sizes. On Windows, `cuda` is used instead of `mps`.

In [2]:
device = torch.device(
    "mps"  # MacOS
    if torch.backends.mps.is_available()
    else "cuda"  # Windows
    if torch.cuda.is_available()
    else "cpu"  # No GPU Available
)

## 3. Loading Data

The dataset used in this project (retrieved from [Kaggle - IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)) includes:

- **review**: Review comments in text.
- **sentiment**: Whether the review is positive or negative.

In [3]:
df = pd.read_csv("_datasets/IMDB_Dataset.csv")

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
print("=" * 50)
print(f"Shape of the dataset: {df.shape}")
print("=" * 50)
print(f"Count of null values: {df.isnull().sum().sum()}")

Shape of the dataset: (50000, 2)
Count of null values: 0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


## 3. Data Preprocessing
1. Text Cleaning
    - Lower all letters
    - Removing HTML Tags
    - Removing URLs
    - Removing Emojis and Non-ASCII Characters
    - Remove Punctuations
    - Remove extra whitespace
2. Tokenisation
3. Building Vocabulary and Mapping Tokens to Indices

### Text Cleaning

In [7]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [8]:
def clean_text(col: pd.Series) -> pd.Series:
    col = col.str.lower()
    col = col.str.replace(r"<.*?>", "", regex=True)
    col = col.str.replace(r"http\S+|www\.\S+", "", regex=True)
    col = col.str.replace(r"[^\x00-\x7F]+", "", regex=True)
    col = col.str.replace("[{}]".format(re.escape(string.punctuation)), "", regex=True)
    col = col.str.replace(
        r"\s+", " ", regex=True
    ).str.strip()  # Leave a space between words
    return col

In [9]:
df["clean_text"] = clean_text(df["review"])
df.head()

Unnamed: 0,review,sentiment,clean_text
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production the filming tech...
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,negative,basically theres a family where a little boy j...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter matteis love in the time of money is a ...


In [10]:
df["n_tokens"] = df["clean_text"].apply(lambda text: len(text.split()))

In [11]:
df.head()

Unnamed: 0,review,sentiment,clean_text,n_tokens
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...,301
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production the filming tech...,156
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...,162
3,Basically there's a family where a little boy ...,negative,basically theres a family where a little boy j...,129
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter matteis love in the time of money is a ...,222


### Tokenisation
Split all reviews into tokens (words). 

In [12]:
all_words = [token for text in df["clean_text"] for token in text.split()]

In [13]:
print(all_words[:10])

['one', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching']


### Building Vocabulary and Mapping Tokens to Indices
Using `Counter()` allows us to get the frequency of each word, sort in descending order (we can specify `n` parameter to extract the top N most frequent words).
Then we assign a unique index to each word, create mapping (word2index), and reserve indices for padding (`<PAD>`) and unknown (`<UNK>`) tokens.

**Padding**:
- Padding is the process of adding special tokens (usually represented as `<PAD>`) to sequences so that all sequences in a batch have the same length.
- This is necessary because neural networks, especially in libraries like PyTorch, require inputs to be in tensors of consistent shape.

*Example*:
- Original:
    - ["i", "loved", "this", "movie"]
- After padding to length 6:
    - ["i", "loved", "this", "movie", "`<PAD>`", "`<PAD>`"]

**Unknown**:
- `<UNK>` stands for '*unknown token*', serving as a placeholder for any token (word) in the input text that does not exist in the vocabulary.

*Example*:
- Vocabulary:
    - { "the":2, "movie":3, "`<PAD>`":0, "`<UNK>`":1 }
- Input:
    - "the plot was amazing" -> ["the", "`<UNK>`", "`<UNK>`", "`<UNK>`"]

In [14]:
word_counts = Counter(all_words)

In [15]:
all_words_sorted = word_counts.most_common()

In [16]:
word2index = {word: i for i, (word, counts) in enumerate(all_words_sorted, start=2)}
word2index["<PAD>"] = 0
word2index["<UNK>"] = 1

In [17]:
print(f"Number of unique words: {len(word2index) - 2}")

Number of unique words: 221659


In [18]:
# Convert text to int sequences
def text_to_int(text, word2index):
    return [word2index.get(word, word2index["<UNK>"]) for word in text.split()]


def pad_or_truncate(text, max_len, pad_value=0):
    if len(text) >= max_len:  # Text length ok
        return text[:max_len]
    else:  # Text too short -> Add padding
        return text + [pad_value] * (max_len - len(text))

In [19]:
SEQUENCE_LENGTH = 200

all_review_seq = [
    pad_or_truncate(text_to_int(text, word2index), SEQUENCE_LENGTH, word2index["<PAD>"])
    for text in df["clean_text"]
]

In [20]:
print(all_review_seq[9])

[45, 22, 37, 207, 7378, 8971, 2183, 22, 76, 37, 11, 17, 45, 22, 23, 183, 39, 168, 92, 22, 76, 110, 11, 17, 540, 53, 56, 1546, 411, 44430, 1206, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


### Convert Labels to Integer

In [21]:
labels = df["sentiment"].map({"positive": 1, "negative": 0})
labels.value_counts()

sentiment
1    25000
0    25000
Name: count, dtype: int64

## 4. Preparing DataLoaders
### Train Test Splitting
80% train, 10% validation and 10% testing

In [22]:
RANDOM_SEED = 42

seq_train, seq_sub, y_train, y_sub = train_test_split(
    all_review_seq, labels, test_size=0.2, random_state=RANDOM_SEED, stratify=labels
)

seq_val, seq_test, y_val, y_test = train_test_split(
    seq_sub, y_sub, test_size=0.5, random_state=RANDOM_SEED, stratify=y_sub
)

### Creating IMDB Datasets in Tensor

In [23]:
class IMDBDataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = torch.tensor(sequences, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.float)

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, index):
        return self.sequences[index], self.labels[index]

PyTorch's `torch.tensor()` constructor works with lists, tuples, or NumPy arrays, not pandas Series.

In [24]:
BATCH_SIZE = 64
train_ds = IMDBDataset(seq_train, y_train.values)
val_ds = IMDBDataset(seq_val, y_val.values)
test_ds = IMDBDataset(seq_test, y_test.values)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False)

## 5. Neural Network Model Architectures
### Long Short-Term Memory (LSTM)
An LSTM architecture for sentiment analysis typically includes following structures:
- **Embedding Layer**:
    - Converts input token indices to dense vector embeddings.
    - Inputs shape: `(batch_size, sequence_length)`
    - Outputs shape: `(batch_size, sequence_length, embedding_size)`
- **LSTM Layer**:
    - Processes the embedded sequence to capture temporal depenencies.
    - Can be either single or multi-layer LSTM.
- **Classification Head**:
    - Usually a fully-connected (linear) layer projecting the hidden state(s) from LSTM to the output classes.
    - Often preceded by dropout for regularisation.
- **Activation and Loss**:
    - For binary sentiment, output logits go through sigmoid activation function with binary cross entropy loss.
    - For multi-class, softmax activation with cross entropy loss.

In [25]:
class LSTMClassifier(nn.Module):
    def __init__(
        self,
        vocab_size,
        embedding_size,
        hidden_size,
        output_size,
        n_layers=1,
        is_bidirectional=False,
        dropout_rate=0.5,
    ) -> None:
        super().__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.is_bidirectional = is_bidirectional

        self.embedding = nn.Embedding(
            num_embeddings=vocab_size, embedding_dim=embedding_size
        )
        self.lstm = nn.LSTM(
            input_size=embedding_size,
            hidden_size=hidden_size,
            num_layers=n_layers,
            bidirectional=is_bidirectional,
            dropout=dropout_rate if n_layers > 1 else 0,
            batch_first=True,
        )
        self.dropout = nn.Dropout(dropout_rate)
        directional_factor = 2 if is_bidirectional else 1
        self.fc = nn.Linear(
            in_features=hidden_size * directional_factor,
            out_features=output_size,
        )

    def forward(self, text):  # text shape: (batch_size, seq_len)
        embedded = self.embedding(text)  # (batch_size, seq_len, embedding_size)
        lstm_out, (hidden, cell) = self.lstm(embedded)

        # If bidirectional, concatenate last forward and backward hidden states
        if self.lstm.bidirectional:
            hidden = torch.cat(
                (hidden[-2, :, :], hidden[-1, :, :]), dim=1
            )  # (batch_size, hidden_dim*2)
        else:
            hidden = hidden[-1, :, :]  # (batch_size, hidden_dim)

        dropped = self.dropout(hidden)
        out = self.fc(dropped)
        return out

### Transformer

In [26]:
class TransformerClassifier(nn.Module):
    def __init__(
        self,
        vocab_size,
        embed_dim,
        nhead,
        num_layers,
        hidden_dim,
        max_len,
        num_classes=1,
        dropout_rate=0.2,
    ):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

        # Positional Encoding module
        self.pos_encoding = nn.Parameter(
            self._get_positional_encoding(max_len, embed_dim), requires_grad=False
        )

        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim,
            nhead=nhead,
            dim_feedforward=hidden_dim,
            dropout=dropout_rate,
            batch_first=True,
        )
        self.transformer_encoder = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )

        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(embed_dim, num_classes)

    def _get_positional_encoding(self, seq_len, d_model):
        pos = torch.arange(seq_len, dtype=torch.float32).unsqueeze(1)
        i = torch.arange(d_model, dtype=torch.float32).unsqueeze(0)
        angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model)
        angle_rads = pos * angle_rates
        encoding = torch.zeros(seq_len, d_model)
        encoding[:, 0::2] = torch.sin(angle_rads[:, 0::2])
        encoding[:, 1::2] = torch.cos(angle_rads[:, 1::2])
        return encoding.unsqueeze(0)  # (1, seq_len, d_model)

    def forward(self, x):
        # x: (batch_size, seq_len)
        embedded = self.embedding(x) + self.pos_encoding[:, : x.size(1), :]
        mask = x == 0  # Assuming 0 is the PAD index
        out = self.transformer_encoder(embedded, src_key_padding_mask=mask)
        pooled = out.mean(dim=1)  # mean pooling over sequence
        return self.fc(self.dropout(pooled))


In [27]:
# Hyperparameters
VOCAB_SIZE = len(word2index)
EMBEDDING_SIZE = 100
HIDDEN_SIZE = 128
OUTPUT_SIZE = 1  # For binary sentiment analysis (1 or 0)
N_LAYERS = 2
IS_BIDIRECTIONAL = True
DROPOUT_RATE = 0.5
N_EPOCHS = 5
LEARNING_RATE = 1e-3
N_HEAD = 4

In [28]:
lstm = LSTMClassifier(
    vocab_size=VOCAB_SIZE,
    embedding_size=EMBEDDING_SIZE,
    hidden_size=HIDDEN_SIZE,
    output_size=OUTPUT_SIZE,
    n_layers=N_LAYERS,
    is_bidirectional=IS_BIDIRECTIONAL,
    dropout_rate=DROPOUT_RATE,
)

transformer = TransformerClassifier(
    vocab_size=VOCAB_SIZE,
    embed_dim=EMBEDDING_SIZE,
    nhead=N_HEAD,
    num_layers=N_LAYERS,
    hidden_dim=HIDDEN_SIZE,
    max_len=SEQUENCE_LENGTH,
    dropout_rate=DROPOUT_RATE,
)

In [29]:
print(
    summary(
        lstm,
        input_size=(
            BATCH_SIZE,
            SEQUENCE_LENGTH,
        ),  # (batch_size, sequence_length)
        dtypes=[torch.long],
        verbose=0,
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"],
        device=device,
    )
)

Layer (type (var_name))                  Input Shape          Output Shape         Param #              Trainable
LSTMClassifier (LSTMClassifier)          [64, 200]            [64, 1]              --                   True
├─Embedding (embedding)                  [64, 200]            [64, 200, 100]       22,166,100           True
├─LSTM (lstm)                            [64, 200, 100]       [64, 200, 256]       630,784              True
├─Dropout (dropout)                      [64, 256]            [64, 256]            --                   --
├─Linear (fc)                            [64, 256]            [64, 1]              257                  True
Total params: 22,797,141
Trainable params: 22,797,141
Non-trainable params: 0
Total mult-adds (Units.GIGABYTES): 9.49
Input size (MB): 0.10
Forward/backward pass size (MB): 36.45
Params size (MB): 91.19
Estimated Total Size (MB): 127.75


## Evaluation Metrics

In [30]:
accuracy = Accuracy(
    task="binary",
    num_classes=2,
).to(device)

## Loss Function
### Binary Cross-Entropy

In [31]:
criterion = nn.BCEWithLogitsLoss()

## Optimiser
An optimiser in neural networks is used to adjust the parameters (weights and biases) of a model during training to minimise the loss. Optimisers are essential for enabling neural networks to learn from data: without them, the model would not improve over time.

In [32]:
lstm_optim = optim.Adam(lstm.parameters(), lr=LEARNING_RATE)
transformer_optim = optim.Adam(transformer.parameters(), lr=LEARNING_RATE)

## Training and Evaluation
1. Iterate through epochs
1. For each epoch, iterate through training batches, perform training steps, calculate the train loss and evaluation metrics per batch.
1. For each epoch, iterate through testing batches, perform testing steps, calculate the test loss and evaluation metrics per batch.
1. Store the results.

### Training Steps
1. Forward pass
    - Pass inputs through the model to obtain predictions.
1. Calculate loss and evaluation metrics per batch
    - Measure how far the predictions deviate from the true labels using a loss function.
    - Compute evaluation metrics (e.g., accuracy, F1 Score, etc.) for the current batch.
1. Zero the gradients
    - Clear the gradients from the previous iteration to prevent accumulation across batches.

1. Backward pass
    - Compute gradients of the loss with respect to the model's parameters via backpropagation.
    - Update the parameter $\theta$ using the computed gradients, typically following:
    $$
        \theta \leftarrow \theta - \eta \dfrac{\partial \mathcal{L}}{\partial \theta}
    $$
    where $\eta$ is the learning rate.
1. Average training loss and evaluation metrics
    - Calculate the mean loss and metric values across all batches in the epoch.

In [33]:
def train_step(
    model: nn.Module,
    data_loader: DataLoader,
    criterion: nn.Module,
    optimiser: optim.Optimizer,
    accuracy: Accuracy,
    device: torch.device,
):
    model.train()
    accuracy.reset()
    train_loss = 0.0
    total_samples = 0

    for texts, labels in data_loader:
        texts, labels = texts.to(device), labels.to(device).unsqueeze(1)

        # Optimiser zero grad without intervening forward pass
        optimiser.zero_grad()

        # Forward pass
        y_logits = model(texts)  # Shape (batch_size, 1)

        # Calculate loss
        loss = criterion(y_logits, labels)
        train_loss += loss.item() * texts.size(0)
        total_samples += texts.size(0)

        # Calculate accuracy
        y_preds = (torch.sigmoid(y_logits) >= 0.5).int()
        accuracy.update(y_preds, labels.int())

        # Loss backward for backpropagation (computing gradients)
        loss.backward()

        # Optimiser step to apply gradients and update parameters
        optimiser.step()
    avg_train_loss = train_loss / total_samples
    train_acc = accuracy.compute().item() * 100
    return avg_train_loss, train_acc

In [34]:
def validation_step(
    model: nn.Module,
    data_loader: DataLoader,
    criterion: nn.Module,
    accuracy: Accuracy,
    device: torch.device,
):
    model.eval()
    accuracy.reset()
    val_loss = 0.0
    total_samples = 0
    with torch.inference_mode():
        for texts, labels in data_loader:
            texts, labels = texts.to(device), labels.to(device).unsqueeze(-1)

            # 1. Forward Pass
            y_logits = model(texts)

            # 2. Calculate loss
            loss = criterion(y_logits, labels)
            val_loss += loss.item() * texts.size(0)
            total_samples += texts.size(0)

            # 3. Calculate accuracy
            y_preds = torch.sigmoid(y_logits) >= 0.5
            accuracy.update(y_preds, labels.int())

    avg_val_loss = val_loss / total_samples
    val_acc = accuracy.compute().item() * 100
    return avg_val_loss, val_acc


### Model Training and Evaluation Pipeline

In [None]:
def train_and_validate(
    model: nn.Module,
    train_loader: DataLoader,
    val_loader: DataLoader,
    criterion: nn.Module,
    optimiser: optim.Optimizer,
    device: torch.device,
    total_epochs: int,
):
    model.to(device)
    epochs_range = range(1, total_epochs + 1)
    train_results = {"Loss": [], "Accuracy": []}
    val_results = {"Loss": [], "Accuracy": []}

    start_time = time.time()

    for epoch in epochs_range:
        train_loss, train_acc = train_step(
            data_loader=train_loader,
            model=model,
            criterion=criterion,
            optimiser=optimiser,
            accuracy=accuracy,
            device=device,
        )
        train_results["Loss"] = train_loss
        train_results["Accuracy"] = train_acc

        val_loss, val_acc = validation_step(
            data_loader=val_loader,
            model=model,
            criterion=criterion,
            accuracy=accuracy,
            device=device,
        )
        val_results["Loss"] = val_loss
        val_results["Accuracy"] = val_acc

        print(
            f"Epoch: {epoch}/{total_epochs} - "
            f"Train Loss: {train_loss:.4f}  Train Accuracy: {train_acc:.2f}% | "
            f"Val Loss: {val_loss:.4f}  Val Accuracy: {val_acc:.2f}%"
        )
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Training and validation completed in {elapsed_time:.2f} seconds.")
    return train_results, val_results

In [36]:
# # Save the model if needed
# torch.save(model.state_dict(), "sentiment_lstm.pt")

In [37]:
print("Training LSTM...")
lstm_train_results, lstm_test_results = train_and_validate(
    lstm,
    train_loader,
    val_loader,
    criterion,
    lstm_optim,
    device,
    N_EPOCHS,
)

Training LSTM...
Epoch: 1/5 - Train Loss: 0.6422  Train Accuracy: 62.94% | Val Loss: 0.5566  Val Accuracy: 71.12%

Epoch: 2/5 - Train Loss: 0.4633  Train Accuracy: 78.73% | Val Loss: 0.4101  Val Accuracy: 81.86%

Epoch: 3/5 - Train Loss: 0.3332  Train Accuracy: 85.98% | Val Loss: 0.3667  Val Accuracy: 84.74%

Epoch: 4/5 - Train Loss: 0.2553  Train Accuracy: 89.91% | Val Loss: 0.3505  Val Accuracy: 85.00%

Epoch: 5/5 - Train Loss: 0.1924  Train Accuracy: 92.88% | Val Loss: 0.3403  Val Accuracy: 86.14%

Training and validation completed in 345.68 seconds.


In [38]:
print("Training Transformer...")
transformer_train_results, transformer_test_results = train_and_validate(
    transformer,
    train_loader,
    val_loader,
    criterion,
    transformer_optim,
    "cpu",
    N_EPOCHS,
)

Training Transformer...


  output = torch._nested_tensor_from_mask(


Epoch: 1/5 - Train Loss: 0.4950  Train Accuracy: 74.44% | Val Loss: 0.4165  Val Accuracy: 82.08%

Epoch: 2/5 - Train Loss: 0.3266  Train Accuracy: 86.30% | Val Loss: 0.3569  Val Accuracy: 85.34%

Epoch: 3/5 - Train Loss: 0.2406  Train Accuracy: 90.49% | Val Loss: 0.3596  Val Accuracy: 85.96%

Epoch: 4/5 - Train Loss: 0.1730  Train Accuracy: 93.62% | Val Loss: 0.4298  Val Accuracy: 85.88%

Epoch: 5/5 - Train Loss: 0.1175  Train Accuracy: 95.99% | Val Loss: 0.4669  Val Accuracy: 85.66%

Training and validation completed in 1080.34 seconds.
