# Sentiment-140 Classification

In this notebook, we will explore the application of transformers for sentiment analysis of different tweets. Our goal is to classify tweets to have either positive or negative sentiment. We will first consider the performance of a pre-trained BERT model, finetuned on the Sentiment-140 dataset, and later we'll develop and train our own transformer architecture from scratch. Let's dive right in!

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
from tqdm import tqdm
import matplotlib.pyplot as plt
import numpy as np
from transformers import BertTokenizer, BertForSequenceClassification
from data_utils.SentimentDataset import SentimentDataset

# this automatically reloads the libraries so you can update them dynamically
%load_ext autoreload
%autoreload 2

## Outline

1. [Data Preparation](#loading-our-data)
2. [Fine-tuning a Pre-trained BERT Model](#fine-tuning-bert)
3. 

## Loading our data

Let's start by preparing our dataset. We have implemented the dataset class for you in `data_utils/SentimentDataset.py`, feel free to check the implementation. However, before we can use the `SentimentDataset` class to create our train data and test data objects, we need to pre-process the raw data. 

You can download raw data from [here](https://www.kaggle.com/datasets/kazanova/sentiment140). If you examine the raw data file, you can see that there is a lot of redundant information such as the time of each tweet or usernames. For our sentiment analysis, we simply need the tweets and ground truth labels. To preprocess the data file, simply paste the original CSV file in the root project directory without changing the file name and run the script `preprocess_dataset.py`. This should create a new CSV file called `dataset.csv`. (The reason why we do not include the data files on the GitHub repository is that they are simply too big.) Then, you're ready to create a dataset and dataloader below.

In [None]:
train_data = SentimentDataset('dataset.csv', training_set=True)
test_data = SentimentDataset('dataset.csv', training_set=False)

# train_subset_indices = np.random.choice(len(train_data), size=1000, replace=False)
# train_sampler = SubsetRandomSampler(train_subset_indices)

# test_subset_indices = np.random.choice(len(test_data), size=100, replace=False)
# test_sampler = SubsetRandomSampler(test_subset_indices)

# train_loader = DataLoader(train_data, batch_size=32, sampler=train_sampler)
# test_loader = DataLoader(test_data, batch_size=32, sampler=test_sampler)

train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
test_loader = DataLoader(test_data, batch_size=32, shuffle=False)

Let's examine a sample tweet from the dataset and print its corresponding ground truth label. Label 1 corresponds to positive sentiment, while label 0 stands for negative sentiment. You can run the cell below multiple times to see different tweets.

In [None]:
tweets, labels = next(iter(train_loader))
print("Sample tweet: ", tweets[0])
print("Ground truth label: ", labels[0].numpy())

## Fine-tuning BERT

First, let's examine the performance of a pre-trained BERT model on the sentiment analysis task. BERT (original paper [here](https://arxiv.org/abs/1810.04805)) is a language representation model released by Google in 2018 that uses a Bidirectional Transformer architecture (shown below). 

![BERT](BERT.png)

It can be easily fine-tuned on downstream tasks such as sentiment analysis by appending a classification layer to the original BERT architecture. This is already done automatically by importing the `BertForSequenceClassification` from the Hugging Face library called `transformers`. The newly added classification layer is untrained (as you'll see in the warning that appears when you run the following cell) and so we have to fine-tune on the Sentiment-140 data.

In [None]:
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
pretrained_model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

In [None]:
# Declare constants
MAX_SEQ_LENGTH = 512 
VOCAB_SIZE = 30522
N_LAYERS = 12
N_HEADS = 12
EMB_SIZE = 768
INTERMEDIATE_SIZE = EMB_SIZE * 4
DROPOUT = 0.1
N_CLASSES = 2
LAYER_NORM_EPS = 1e-12
PAD_TOKEN_ID = 103

Below, we define a `preprocess` function that takes in the tweets and tokenizes them using the WordPiece algorithm, so that they can be process by our model.

In [None]:
def preprocess(tweets):

    encoded_batch = tokenizer.batch_encode_plus(
        tweets,
        add_special_tokens=True,
        max_length=MAX_SEQ_LENGTH,
        pad_to_max_length=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

    input_ids = encoded_batch['input_ids']
    attention_masks = encoded_batch['attention_mask']

    return input_ids, attention_masks

Next, we need to define a training and evaluation loop.

In [None]:
def train(model, train_loader, val_loader, optimizer, criterion, device,
          num_epochs):

    # Place model on device
    model = model.to(device)

    loss_history = []
    acc_history = []

    for epoch in range(num_epochs):
        model.train()  # Set model to training mode

        # Use tqdm to display a progress bar during training
        with tqdm(total=len(train_loader),
                  desc=f'Epoch {epoch + 1}/{num_epochs}',
                  position=0,
                  leave=True) as pbar:
            
            for inputs, labels in train_loader:

                input_ids, attention_masks = preprocess(inputs)
                # Move inputs and labels to device
                input_ids = input_ids.to(device)
                attention_masks = attention_masks.to(device)
                labels = labels.to(device)

                # Zero out gradients
                optimizer.zero_grad()

                # Compute the logits and loss
                outputs = model(input_ids=input_ids, attention_mask=attention_masks)
                loss = criterion(outputs.logits, labels)

                # Backpropagate the loss
                loss.backward()

                # Update the weights
                optimizer.step()

                # Update the progress bar
                pbar.update(1)
                pbar.set_postfix(loss=loss.item())

                loss_history.append(loss.item())

                    
        avg_loss, accuracy = evaluate(model, val_loader, criterion, device)
        print(
            f'Test set: Average loss = {avg_loss:.4f}, Accuracy = {accuracy:.4f}'
        )
        acc_history.append(accuracy)

    return loss_history, acc_history


def evaluate(model, test_loader, criterion, device):

    model.eval()  # Set model to evaluation mode

    with torch.no_grad():
        total_loss = 0.0
        num_correct = 0
        num_samples = 0

        for inputs, labels in test_loader:

            input_ids, attention_masks = preprocess(inputs)
            # Move inputs and labels to device
            input_ids = input_ids.to(device)
            attention_masks = attention_masks.to(device)
            labels = labels.to(device)

            # Compute the logits and loss
            outputs = model(input_ids=input_ids, attention_mask=attention_masks)
            loss = criterion(outputs.logits, labels)
            total_loss += loss.item()

            # Compute the accuracy
            _, predictions = torch.max(outputs.logits, dim=1)
            num_correct += (predictions == labels).sum().item()
            num_samples += len(labels)

    # Compute the average loss and accuracy
    avg_loss = total_loss / len(test_loader)
    accuracy = num_correct / num_samples

    return avg_loss, accuracy

Make sure you are running the model on CUDA.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

We will finetune all the weights in the BERT model (including the pre-trained weights and the newly initialized classifier weights). An alternative approach would be to freeze the pre-trained weights and only fine-tune the classifier weights. You can do that by passing in `model.classifier.parameters()` to the optimizer instead of passing in all the weights of the model.

In [None]:
optimizer = torch.optim.AdamW(pretrained_model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()

loss_history, acc_history = train(pretrained_model, train_loader, test_loader, optimizer, criterion, device, num_epochs=3)

Next, let's plot our training loss and test accuracy history that documents the fine-tuning process.

In [None]:
plt.plot(loss_history)
plt.xlabel('Iterations')
plt.ylabel('Cross-entropy Loss')
plt.title('BERT Fine-tuning, Loss history')
plt.show()

In [None]:
plt.plot(acc_history)
plt.xlabel('Epochs')
plt.ylabel('Test accuracy')
plt.title('BERT Fine-tuning, Accuracy history')
plt.show()

## Creating our own Transformer architecture from scratch

That was fun, right? But maybe slightly too easy, don't you think? Let's try to implement our own transformer architecture now, without the help of pre-trained models. We will base our architecture off of the original paper that introduced transformers, [*Attention is All You Need*](https://arxiv.org/abs/1706.03762).

## Transformers

Transformers are a type of neural network architecture primarily used in the field of natural language processing (NLP). Unlike previous models that processed inputs sequentially (like RNNs), transformers process whole sequences of data in parallel, which significantly speeds up training.

### Why Important

They have been highly successful in a variety of NLP tasks like translation, text summarization, and sentiment analysis due to their ability to handle long-range dependencies in text.

### Components of Transformers

1. Input Embeddings
2. Positional Encoding
3. Multi-Head Attention
4. Layer Normalization and Residual Connections
5. Feed-Forward Networks
6. Output Layer

## 1. Input Embeddings

Converts tokens (words) into vectors of continuous numbers that the model can understand.

### Concept of Embeddings

Embeddings are vector representations of text, where words with **similar meanings** have** similar representations**. These vectors are learned and adjusted during the training of the model to optimize performance on a specific task.

### Importance of Embeddings

Embeddings capture the semantic properties of words, which means they convert the words/tokens into a form that a neural network can work with. This allows the model to **understand and process language** by **quantifying the similarities between different words or phrases**.

### Embeddings in Transformers

1. **Tokenization**:
   - **Process**: Text input is split into tokens (words or subwords), which are then converted into numerical IDs that correspond to entries in the embedding table.
   - **Role in Model**: This step is critical because it translates human-readable text into a format that can be mathematically manipulated by the model.

2. **Lookup Table**:
   - **Process**: Each token ID is used to retrieve its corresponding embedding vector from an embedding matrix (or table) that the model learns during training.
   - **Role in Model**: This embedding matrix acts as the foundational input layer of the transformer, providing a dense and informative representation for each token.

3. **Dimensionality**:
   - **Process**: Embeddings are usually vectors of fixed size (e.g., 256, 512 dimensions), regardless of the vocabulary size.
   - **Role in Model**: Higher dimensions generally capture more detailed semantic information about each token, but also increase model complexity and computational cost.

### Positional Embeddings

If we only turn words into embeddings, the model won't know the order of them in a sentence. We need to somehow preserve the positional info.

- **Concept**: Since transformers process words in parallel rather than sequentially, positional encodings are added to embeddings to give the model information about the order of words in a sentence.
- **Why Necessary**: Helps the model understand how the position of a word in a sentence affects its meaning.

### Training and Adjustments

**Learning Process**:
- Embeddings are not static; they are adjusted during the training process to minimize the model's prediction error. This learning happens through backpropagation, which adjusts the embeddings to better capture the relationships and contexts in which words appear.

**Impact on Performance**:
- The quality of embeddings significantly affects the model's performance. Better embeddings lead to a better understanding of language nuances, such as synonyms, antonyms, and different contexts.

### Toy Example

In [None]:
import torch
import torch.nn as nn

# Vocabulary size and embedding dimension
vocab_size = 1000  # Number of unique tokens
embedding_dim = 64  # Size of each embedding vector

# Create an embedding layer
embedding_layer = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

# Example token IDs
token_ids = torch.LongTensor([10, 200, 500])

# Get embeddings for token IDs
embeddings = embedding_layer(token_ids)
print(embeddings)

This example initializes an embedding layer with random values which then can be trained to learn meaningful representations through a task like sentiment analysis or machine translation.

### Your Turn

Now, let's implement the `Embeddings` class. This class will be responsible for converting tokens into dense vectors that the model can understand. We will use PyTorch's `nn.Embedding` module to create the embedding layer.

1. token embedding: vocab -> embedding (take a look at the `padding_idx` argument of `nn.Embedding`).
2. positional embedding: position -> embedding
3. add them together
4. layer normalization
5. dropout

In [None]:
class EmbeddingLayer(nn.Module):

    def __init__(self, vocab_size, emb_size, pad_token_id, max_seq_length, layer_norm_eps=1e-12, dropout=0.1):
        super().__init__()

        # TODO: Initialize word embeddings
        self.token_embedding_table = None
        self.position_embedding_table = None
        self.ln = None
        self.dropout = None

        # sets self.position_ids to a tensor containing values from 0 to max_seq_length, reshaped to (1, max_seq_length)
        # register_buffer makes it so that this does NOT get updated during training
        self.register_buffer("position_ids", torch.arange(max_seq_length).expand((1, -1)))

    def forward(self, input_ids):
        pass

        return emb


### Components of Transformers



3. **Multi-Head Attention**
   - **Concept**: Allows the model to focus on different parts of the input sentence simultaneously. It computes attention scores, representing how much focus to place on other parts of the input for each word.
   - **Why Necessary**: Essential for understanding contexts and dependencies between words in sentences, regardless of their distance from each other.

4. **Layer Normalization and Residual Connections**
   - **Concept**: Techniques used within transformers to stabilize the learning process. Residual connections help in propagating gradients through deep networks without vanishing.
   - **Why Necessary**: Improves training efficiency and model performance by preventing the vanishing gradient problem.

5. **Feed-Forward Networks**
   - **Concept**: Each layer of the transformer includes a feed-forward neural network which applies further transformations to the output of the attention mechanism.
   - **Why Necessary**: Adds additional capabilities to the transformer to modify the attention output before passing it to the next layer.

6. **Output Layer**
   - **Concept**: The final decoder outputs are passed through a linear layer followed by a softmax to predict the next word in the sequence.
   - **Why Necessary**: Converts the complex representations formed by the transformer into understandable and usable output, like words in a sentence.


In [None]:
class MultiHeadAttention(nn.Module):

    def __init__(self):
        super().__init__()
        self.head_size = emb_size // n_heads
        self.query = nn.Linear(emb_size, emb_size)
        self.key = nn.Linear(emb_size, emb_size)
        self.value = nn.Linear(emb_size, emb_size)
        self.dropout = nn.Dropout(dropout)

        self.final_linear = nn.Linear(emb_size, emb_size)
        self.ln = nn.LayerNorm(emb_size, eps=layer_norm_eps)


    def forward(self, emb, att_mask):
        B, T, C = emb.shape  # batch size, sequence length, embedding size   
    
        q = self.query(emb).view(B, T, self.n_heads, self.head_size).transpose(1, 2)
        k = self.key(emb).view(B, T, self.n_heads, self.head_size).transpose(1, 2)
        v = self.value(emb).view(B, T, self.n_heads, self.head_size).transpose(1, 2)
        
        weights = q @ k.transpose(-2, -1) * self.head_size ** -0.5

        # set the pad tokens to -inf so that they equal zero after softmax
        if att_mask != None:
            att_mask = (att_mask > 0).unsqueeze(1).repeat(1, att_mask.size(1), 1).unsqueeze(1)
            weights = weights.masked_fill(att_mask == 0, float('-inf'))

        weights = F.softmax(weights, dim=-1)
        weights = self.dropout(weights)
        
        attention = weights @ v
        attention = attention.transpose(1, 2).contiguous().view(B, T, C)   

        out = self.final_linear(attention)
        out = self.dropout(out)
        out = out + emb
        out = self.ln(out)
        
        return out


In [None]:
class PositionWiseFeedForward(nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(emb_size, intermediate_size)
        self.fc2 = nn.Lienar(intermediate_size, emb_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(dropout)
        self.ln = nn.LayerNorm(emb_size, eps=layer_norm_eps)

    def forward(self, att_out):
        x = self.fc1(att_out)
        x = self.gelu(x)
        x = self.fc2(x)
        x = self.dropout(x)
        x = x + att_out
        out = self.ln(x)

        return out

In [None]:
class Transformer(nn.Module):

    def __init__(self):
        super().__init__()
        self.attn = MultiHeadAttention()
        self.ff = PositionWiseFeedForward()

    def forward(self, emb, att_mask):
        att_out = self.attn(emb, att_mask)
        out = self.ff(att_out)

In [None]:
class Encoder(nn.Module):

    def __init__(self):
        super().__init__()
        self.layer = nn.ModuleList([Transformer() for layer_num in range(n_layers)])

    def forward(self, emb, att_mask):
        for bert_layer in self.layer:
            emb = bert_layer(emb, att_mask)
        return emb

In [None]:
class Pooler(nn.Module):
    def __init__(self):
        super().__init__()
        self.dense = nn.Linear(emb_size, emb_size)
        self.tanh = nn.Tanh()

    def forward(self, encoder_out):
        pool_first_token = encoder_out[:, 0]
        out = self.dense(pool_first_token)
        out = self.tanh(out)
        return out

In [None]:
class Model(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embedding = EmbeddingLayer()
        self.encoder = Encoder()
        self.pooler = Pooler()

    def forward(self, input_ids, att_mask):
        emb = self.embedding(input_ids)
        out = self.encoder(emb, att_mask)
        pooled_out = self.pooler(out)
        return out, pooled_out

In [None]:
class TransformerSequenceClassification(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = Model()
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(emb_size, n_classes)

    def forward(self, input_ids, attention_mask=None):
        _, pooled_out = self.model(input_ids, attention_mask)
        pooled_out = self.dropout(pooled_out)
        logits = self.classifier(pooled_out)
         
        return logits

In [None]:
our_model = TransformerSequenceClassification()

optimizer = torch.optim.AdamW(our_model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()

loss_history, acc_history = train(our_model, train_loader, test_loader, optimizer, criterion, device, num_epochs=3)


In [None]:
plt.plot(loss_history)
plt.xlabel('Iterations')
plt.ylabel('Cross-entropy Loss')
plt.title('BERT From Scratch, Loss history')
plt.show()

In [None]:
plt.plot(acc_history)
plt.xlabel('Epochs')
plt.ylabel('Test accuracy')
plt.title('BERT From Scratch, Accuracy history')
plt.show()