# Transformer Encoder Based Sentiment Analysis

This notebook implements a Transformer-based model for sentiment analysis using the SST-2 (Stanford Sentiment Treebank) dataset. We'll go through each component of the model, the data preparation process, and the training procedure.


## Importing Required Libraries

These libraries provide the necessary tools for building and training our model:
- `torch`: The main PyTorch library for tensor computations and neural networks.
- `torch.nn` and `torch.nn.functional`: Contain neural network layers and activation functions.
- `math`: Used for mathematical operations.
- `transformers`: Hugging Face's library for working with transformer models.
- `datasets`: Hugging Face's library for easily loading and processing datasets.
- `DataLoader`: PyTorch's data loading utility.
- `tqdm`: Used for progress bars during training.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from transformers import AutoTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

## Self-Attention Mechanism

This class implements the core self-attention mechanism:
1. It computes attention scores by multiplying Q (query) with K (key).
2. The scores are scaled by the square root of the dimension of the keys.
3. Softmax is applied to get attention weights.
4. The weights are used to compute a weighted sum of V (values).

In [3]:
class SelfAttention(nn.Module):
    def __init__(self, d_k):
        super().__init__()
        self.d_k = d_k

    def forward(self, Q, K, V):
        # Q, K, V shape: (batch_size, seq_len, d_k)

        # Compute attention scores
        # (batch_size, seq_len, d_k) @ (batch_size, d_k, seq_len) -> (batch_size, seq_len, seq_len)
        attn_scores = torch.bmm(Q, K.transpose(1, 2))

        # Scale attention scores
        attn_scores = attn_scores / math.sqrt(self.d_k)

        # Compute attention weights
        # (batch_size, seq_len, seq_len)
        attn_weights = F.softmax(attn_scores, dim=-1)

        # Apply attention weights to values
        # (batch_size, seq_len, seq_len) @ (batch_size, seq_len, d_k) -> (batch_size, seq_len, d_k)
        output = torch.bmm(attn_weights, V)

        return output

## Multi-Head Attention

This class implements multi-head attention:
1. It creates separate projections for Q, K, and V.
2. The input is split into multiple heads.
3. Self-attention is applied to each head separately.
4. The results from all heads are concatenated and passed through a final linear layer.

In [4]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, h):
        super().__init__()
        self.d_model = d_model
        self.h = h
        self.d_k = d_model // h

        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)

        self.attention = SelfAttention(self.d_k)

    def forward(self, x):
        # x shape: (batch_size, seq_len, d_model)
        batch_size, seq_len, _ = x.size()

        # Linear transformations and split into h heads
        # (batch_size, seq_len, d_model) -> (batch_size, h, seq_len, d_k)
        Q = self.W_Q(x).view(batch_size, seq_len, self.h, self.d_k).transpose(1, 2)
        K = self.W_K(x).view(batch_size, seq_len, self.h, self.d_k).transpose(1, 2)
        V = self.W_V(x).view(batch_size, seq_len, self.h, self.d_k).transpose(1, 2)

        # Apply attention to each head
        # (batch_size, h, seq_len, d_k)
        attn_outputs = []
        for i in range(self.h):
            attn_output = self.attention(Q[:, i], K[:, i], V[:, i])
            attn_outputs.append(attn_output)

        # Concatenate attention outputs from all heads
        # (batch_size, seq_len, d_model)
        attn_output = torch.cat(attn_outputs, dim=-1)

        # Final linear transformation
        # (batch_size, seq_len, d_model) -> (batch_size, seq_len, d_model)
        output = self.W_O(attn_output)

        return output

## Transformer Block

This class represents a single Transformer block:
1. It applies multi-head self-attention to the input.
2. The output is passed through a feed-forward neural network.
3. Layer normalization and residual connections are used after each sub-layer.

In [5]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, h):
        super().__init__()
        self.d_model = d_model
        self.h = h
        self.d_k = d_model // h

        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)

        self.attention = SelfAttention(self.d_k)

    def forward(self, x):
        # x shape: (batch_size, seq_len, d_model)
        batch_size, seq_len, _ = x.size()

        # Linear transformations and split into h heads
        # (batch_size, seq_len, d_model) -> (batch_size, h, seq_len, d_k)
        Q = self.W_Q(x).view(batch_size, seq_len, self.h, self.d_k).transpose(1, 2)
        K = self.W_K(x).view(batch_size, seq_len, self.h, self.d_k).transpose(1, 2)
        V = self.W_V(x).view(batch_size, seq_len, self.h, self.d_k).transpose(1, 2)

        # Apply attention to each head
        # (batch_size, h, seq_len, d_k)
        attn_outputs = []
        for i in range(self.h):
            attn_output = self.attention(Q[:, i], K[:, i], V[:, i])
            attn_outputs.append(attn_output)

        # Concatenate attention outputs from all heads
        # (batch_size, seq_len, d_model)
        attn_output = torch.cat(attn_outputs, dim=-1)

        # Final linear transformation
        # (batch_size, seq_len, d_model) -> (batch_size, seq_len, d_model)
        output = self.W_O(attn_output)

        return output

class TransformerBlock(nn.Module):
    def __init__(self, d_model, h):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, h)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.ReLU(),
            nn.Linear(d_model * 4, d_model)
        )
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.layer_norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # x shape: (batch_size, seq_len, d_model)

        # Self-attention
        attention_output = self.attention(x)

        # Add & Norm
        x = self.layer_norm1(x + attention_output)

        # Feed-forward
        ff_output = self.feed_forward(x)

        # Add & Norm
        x = self.layer_norm2(x + ff_output)

        return x

## Positional Encoding

This class implements positional encoding:
1. It creates a matrix of positional encodings.
2. The encodings are based on sine and cosine functions of different frequencies.
3. These encodings are added to the input embeddings to provide position information.

In [6]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()

        # Create positional encoding matrix
        # Shape: (1, max_len, d_model)
        pe = torch.zeros(1, max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

        pe[0, :, 0::2] = torch.sin(position * div_term)
        pe[0, :, 1::2] = torch.cos(position * div_term)

        # Register pe as a buffer, which is a tensor that's part of the module
        # but not a parameter. This allows it to be saved and loaded with the model,
        # and moved to the correct device automatically.
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x shape: (batch_size, seq_len, d_model)
        # Add positional encoding to the input
        return x + self.pe[:, :x.size(1)]

## Encoder

This is the main Encoder class:
1. It embeds the input tokens and adds positional encoding.
2. The input is passed through multiple Transformer blocks.
3. The output is averaged across the sequence dimension.
4. A final linear layer produces class logits.

In [7]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, h, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([TransformerBlock(d_model, h) for _ in range(num_layers)])
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        # x shape: (batch_size, seq_len)

        # Embedding
        # (batch_size, seq_len) -> (batch_size, seq_len, d_model)
        x = self.embedding(x)

        # Add positional encoding
        x = self.pos_encoding(x)

        # Pass through transformer blocks
        for layer in self.layers:
            x = layer(x)

        # Global average pooling
        # (batch_size, seq_len, d_model) -> (batch_size, d_model)
        x = x.mean(dim=1)

        # Classification
        # (batch_size, d_model) -> (batch_size, num_classes)
        x = self.fc(x)

        return x

## Data Preparation

This section prepares the SST-2 dataset:
1. We use a pre-trained tokenizer from the 'distilbert-base-cased' model.
2. The SST-2 dataset is loaded using the `load_dataset` function.
3. A tokenization function is defined and applied to the dataset.
4. The dataset is formatted for PyTorch and split into training and validation sets.

In [8]:
# Load a pre-trained tokenizer. We use DistilBERT's tokenizer, which is a smaller, faster version of BERT
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')

# Load the SST-2 (Stanford Sentiment Treebank) dataset from the GLUE benchmark
# This dataset contains movie reviews labeled as positive or negative
raw_datasets = load_dataset("glue", "sst2")

# Define a function to tokenize our input sentences
def tokenize_function(examples):
    # Tokenize the sentences, add padding to the maximum length, and truncate if necessary
    # max_length=128 means we'll use sequences of up to 128 tokens
    return tokenizer(examples['sentence'], padding='max_length', truncation=True, max_length=128)

# Apply the tokenization function to our dataset
# This processes the entire dataset, tokenizing each sentence
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# Remove unnecessary columns from the dataset
# We only need the tokenized input and the label
tokenized_datasets = tokenized_datasets.remove_columns(['sentence', 'idx'])

# Rename the 'label' column to 'labels' for consistency with PyTorch conventions
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')

# Set the format of the datasets to PyTorch tensors
tokenized_datasets.set_format('torch')

# Create data loaders for training and evaluation
# DataLoader handles batching and shuffling of the data
train_dataloader = DataLoader(tokenized_datasets['train'], shuffle=True, batch_size=32)
eval_dataloader = DataLoader(tokenized_datasets['validation'], batch_size=32)

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

## Model Initialization

Here we initialize our Encoder model:
1. We check if a GPU is available and set the device accordingly.
2. The model is created with specific hyperparameters and moved to the appropriate device.

In [9]:
# Check if a GPU is available, and set the device accordingly
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initialize our Encoder model with specific hyperparameters
model = Encoder(
    vocab_size=tokenizer.vocab_size,  # Size of the tokenizer's vocabulary
    d_model=256,  # Dimension of the model's hidden states
    num_layers=4,  # Number of transformer blocks
    h=8,  # Number of attention heads
    num_classes=2  # Number of output classes (positive/negative)
).to(device)  # Move the model to the appropriate device (GPU if available)

## Training Loop

This section implements the training loop:
1. We define an Adam optimizer and set the number of epochs.
2. For each epoch, we iterate over the training data, compute loss, and update the model parameters.
3. After each epoch, we evaluate the model on the validation set and print the accuracy.

In [10]:
# Initialize the Adam optimizer with a learning rate of 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
num_epochs = 5  # Number of times to iterate over the entire dataset

# Training Loop
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    total_loss = 0

    # Iterate over batches in the training data
    for batch in tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{num_epochs}"):
        optimizer.zero_grad()  # Reset gradients

        # Move input data to the appropriate device
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids)  # Forward pass
        loss = F.cross_entropy(outputs, labels)  # Compute loss
        loss.backward()  # Backpropagation
        optimizer.step()  # Update model parameters

        total_loss += loss.item()  # Accumulate loss for reporting

    # Compute and print average loss for the epoch
    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")

    # Evaluation
    model.eval()  # Set the model to evaluation mode
    correct = 0
    total = 0

    # Disable gradient computation for evaluation
    with torch.no_grad():
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids)  # Forward pass
            _, predicted = torch.max(outputs.data, 1)  # Get predicted class

            total += labels.size(0)  # Count total number of samples
            correct += (predicted == labels).sum().item()  # Count correct predictions

    # Compute and print accuracy
    accuracy = 100 * correct / total
    print(f"Validation Accuracy: {accuracy:.2f}%")

Epoch 1/5: 100%|██████████| 2105/2105 [01:48<00:00, 19.36it/s]


Epoch 1/5, Average Loss: 0.4719


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 74.55it/s]


Validation Accuracy: 77.18%


Epoch 2/5: 100%|██████████| 2105/2105 [01:54<00:00, 18.37it/s]


Epoch 2/5, Average Loss: 0.2657


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 72.90it/s]


Validation Accuracy: 80.62%


Epoch 3/5: 100%|██████████| 2105/2105 [01:48<00:00, 19.48it/s]


Epoch 3/5, Average Loss: 0.1826


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 77.49it/s]


Validation Accuracy: 77.06%


Epoch 4/5: 100%|██████████| 2105/2105 [01:50<00:00, 19.11it/s]


Epoch 4/5, Average Loss: 0.1291


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 78.88it/s]


Validation Accuracy: 80.85%


Epoch 5/5: 100%|██████████| 2105/2105 [01:50<00:00, 19.00it/s]


Epoch 5/5, Average Loss: 0.0995


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 78.69it/s]

Validation Accuracy: 79.36%





## Model Testing

Finally, we test our trained model:
1. We define a `predict_sentiment` function that takes a sentence and returns the predicted sentiment.
2. We test the model on a few example sentences and print the results.

This completes our implementation of a Transformer-based sentiment analysis model. The model learns to classify movie reviews as positive or negative based on the text content.

In [11]:
def predict_sentiment(text):
    # Tokenize input text and move to appropriate device
    input_ids = tokenizer.encode(text, return_tensors='pt').to(device)

    with torch.no_grad():  # Disable gradient computation
        output = model(input_ids)  # Forward pass
        prediction = torch.argmax(output, dim=1)  # Get predicted class

    # Return "Positive" for class 1, "Negative" for class 0
    return "Positive" if prediction.item() == 1 else "Negative"

# Test sentences to evaluate our model
test_sentences = [
    "This movie is great!",
    "I didn't enjoy the film at all.",
    "The acting was superb and the plot was engaging.",
    "A complete waste of time and money."
]

print("\nTesting the model:")
for sentence in test_sentences:
    sentiment = predict_sentiment(sentence)
    print(f"Sentence: '{sentence}'")
    print(f"Predicted sentiment: {sentiment}\n")


Testing the model:
Sentence: 'This movie is great!'
Predicted sentiment: Positive

Sentence: 'I didn't enjoy the film at all.'
Predicted sentiment: Positive

Sentence: 'The acting was superb and the plot was engaging.'
Predicted sentiment: Positive

Sentence: 'A complete waste of time and money.'
Predicted sentiment: Positive



# Vectorized version

In [12]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from transformers import AutoTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, h):
        super().__init__()
        self.d_model = d_model
        self.h = h
        self.d_k = d_model // h

        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch_size, seq_len, _ = x.size()

        # Vectorized computation of Q, K, V for all heads at once
        Q = self.W_Q(x).view(batch_size, seq_len, self.h, self.d_k).transpose(1, 2)
        K = self.W_K(x).view(batch_size, seq_len, self.h, self.d_k).transpose(1, 2)
        V = self.W_V(x).view(batch_size, seq_len, self.h, self.d_k).transpose(1, 2)

        # Compute attention scores for all heads in parallel
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        # Apply softmax to get attention weights
        attn_weights = F.softmax(attn_scores, dim=-1)

        # Compute attention output for all heads in parallel
        attn_output = torch.matmul(attn_weights, V)

        # Reshape and apply final linear transformation
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        return self.W_O(attn_output)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, h):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, h)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.ReLU(),
            nn.Linear(d_model * 4, d_model)
        )
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.layer_norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        attention_output = self.attention(x)
        x = self.layer_norm1(x + attention_output)
        ff_output = self.feed_forward(x)
        x = self.layer_norm2(x + ff_output)
        return x

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        pe = torch.zeros(1, max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        pe[0, :, 0::2] = torch.sin(position * div_term)
        pe[0, :, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, h, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([TransformerBlock(d_model, h) for _ in range(num_layers)])
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        x = self.embedding(x)
        x = self.pos_encoding(x)
        for layer in self.layers:
            x = layer(x)
        x = x.mean(dim=1)
        x = self.fc(x)
        return x

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')
raw_datasets = load_dataset("glue", "sst2")

def tokenize_function(examples):
    return tokenizer(examples['sentence'], padding='max_length', truncation=True, max_length=128)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['sentence', 'idx'])
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')
tokenized_datasets.set_format('torch')

train_dataloader = DataLoader(tokenized_datasets['train'], shuffle=True, batch_size=32)
eval_dataloader = DataLoader(tokenized_datasets['validation'], batch_size=32)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Encoder(
    vocab_size=tokenizer.vocab_size,
    d_model=256,
    num_layers=4,
    h=8,
    num_classes=2
).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
num_epochs = 5

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch in tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{num_epochs}"):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids)
        loss = F.cross_entropy(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")

    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"Validation Accuracy: {accuracy:.2f}%")

def predict_sentiment(text):
    input_ids = tokenizer.encode(text, return_tensors='pt').to(device)
    with torch.no_grad():
        output = model(input_ids)
        prediction = torch.argmax(output, dim=1)
    return "Positive" if prediction.item() == 1 else "Negative"

test_sentences = [
    "This movie is great!",
    "I didn't enjoy the film at all.",
    "The acting was superb and the plot was engaging.",
    "A complete waste of time and money."
]

print("\nTesting the model:")
for sentence in test_sentences:
    sentiment = predict_sentiment(sentence)
    print(f"Sentence: '{sentence}'")
    print(f"Predicted sentiment: {sentiment}\n")

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

Epoch 1/5: 100%|██████████| 2105/2105 [01:25<00:00, 24.54it/s]


Epoch 1/5, Average Loss: 0.4836


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 76.09it/s]


Validation Accuracy: 77.41%


Epoch 2/5: 100%|██████████| 2105/2105 [01:23<00:00, 25.12it/s]


Epoch 2/5, Average Loss: 0.2672


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 70.40it/s]


Validation Accuracy: 79.47%


Epoch 3/5: 100%|██████████| 2105/2105 [01:24<00:00, 25.05it/s]


Epoch 3/5, Average Loss: 0.1833


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 75.86it/s]


Validation Accuracy: 79.13%


Epoch 4/5: 100%|██████████| 2105/2105 [01:23<00:00, 25.16it/s]


Epoch 4/5, Average Loss: 0.1308


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 75.36it/s]


Validation Accuracy: 78.67%


Epoch 5/5: 100%|██████████| 2105/2105 [01:23<00:00, 25.17it/s]


Epoch 5/5, Average Loss: 0.0977


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 74.93it/s]

Validation Accuracy: 78.90%

Testing the model:
Sentence: 'This movie is great!'
Predicted sentiment: Positive

Sentence: 'I didn't enjoy the film at all.'
Predicted sentiment: Positive

Sentence: 'The acting was superb and the plot was engaging.'
Predicted sentiment: Positive

Sentence: 'A complete waste of time and money.'
Predicted sentiment: Negative






# Using Pytorch Modules

In [13]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.multihead_attn = nn.MultiheadAttention(d_model, num_heads, batch_first=True)

    def forward(self, x):
        return self.multihead_attn(x, x, x)[0]

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.ReLU(),
            nn.Linear(d_model * 4, d_model)
        )
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.layer_norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        x = self.layer_norm1(x + self.attention(x))
        x = self.layer_norm2(x + self.feed_forward(x))
        return x

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.dropout = nn.Dropout(p=0.1)
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-torch.log(torch.tensor(10000.0)) / d_model))
        pe = torch.zeros(1, max_len, d_model)
        pe[0, :, 0::2] = torch.sin(position * div_term)
        pe[0, :, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model, num_heads, dim_feedforward=d_model * 4, batch_first=True)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers)
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        x = self.embedding(x)
        x = self.pos_encoding(x)
        x = self.transformer_encoder(x)
        x = x.mean(dim=1)
        x = self.fc(x)
        return x

# Data preparation
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')
raw_datasets = load_dataset("glue", "sst2")

def tokenize_function(examples):
    return tokenizer(examples['sentence'], padding='max_length', truncation=True, max_length=128)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['sentence', 'idx'])
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')
tokenized_datasets.set_format('torch')

train_dataloader = DataLoader(tokenized_datasets['train'], shuffle=True, batch_size=32)
eval_dataloader = DataLoader(tokenized_datasets['validation'], batch_size=32)

# Model initialization
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Encoder(
    vocab_size=tokenizer.vocab_size,
    d_model=256,
    num_layers=4,
    num_heads=8,
    num_classes=2
).to(device)

# Training settings
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
num_epochs = 5

# Training loop
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch in tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{num_epochs}"):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids)
        loss = F.cross_entropy(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")

    # Evaluation
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"Validation Accuracy: {accuracy:.2f}%")

# Model testing
def predict_sentiment(text):
    input_ids = tokenizer.encode(text, return_tensors='pt').to(device)
    with torch.no_grad():
        output = model(input_ids)
        prediction = torch.argmax(output, dim=1)
    return "Positive" if prediction.item() == 1 else "Negative"

test_sentences = [
    "This movie is great!",
    "I didn't enjoy the film at all.",
    "The acting was superb and the plot was engaging.",
    "A complete waste of time and money."
]

print("\nTesting the model:")
for sentence in test_sentences:
    sentiment = predict_sentiment(sentence)
    print(f"Sentence: '{sentence}'")
    print(f"Predicted sentiment: {sentiment}\n")

Epoch 1/5: 100%|██████████| 2105/2105 [01:33<00:00, 22.58it/s]


Epoch 1/5, Average Loss: 0.5441


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 83.46it/s]


Validation Accuracy: 77.18%


Epoch 2/5: 100%|██████████| 2105/2105 [01:32<00:00, 22.71it/s]


Epoch 2/5, Average Loss: 0.3762


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 80.37it/s]


Validation Accuracy: 78.67%


Epoch 3/5: 100%|██████████| 2105/2105 [01:32<00:00, 22.70it/s]


Epoch 3/5, Average Loss: 0.3018


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 86.38it/s]


Validation Accuracy: 79.93%


Epoch 4/5: 100%|██████████| 2105/2105 [01:32<00:00, 22.70it/s]


Epoch 4/5, Average Loss: 0.2510


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 82.11it/s]


Validation Accuracy: 80.28%


Epoch 5/5: 100%|██████████| 2105/2105 [01:32<00:00, 22.69it/s]


Epoch 5/5, Average Loss: 0.2153


Evaluating: 100%|██████████| 28/28 [00:00<00:00, 88.28it/s]

Validation Accuracy: 79.70%

Testing the model:
Sentence: 'This movie is great!'
Predicted sentiment: Positive

Sentence: 'I didn't enjoy the film at all.'
Predicted sentiment: Positive

Sentence: 'The acting was superb and the plot was engaging.'
Predicted sentiment: Positive

Sentence: 'A complete waste of time and money.'
Predicted sentiment: Negative




