# Transformer Model Implementation

Based on the paper "Attention Is All You Need" (https://arxiv.org/abs/1706.03762)

## 1. Big Picture
- Transformer is a new model for translating text
- It's different because it doesn't use traditional methods (no recurrence or convolution)
- Uses "attention" to understand relationships between words

## 2. Model Structure
- Has an encoder (for input) and decoder (for output)
- Both have 6 identical layers stacked on top of each other
- Key innovation: Multi-Head Attention

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import math

## 3. Positional Encoding
- Problem: The model doesn't naturally understand word order
- Solution: Add special codes to each word to indicate its position
- Uses sine and cosine functions to create these codes
- This allows the model to understand relative positions of words

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.size(0), :]

## 4. Transformer Model

This model incorporates:
- Multi-Head Attention: Allows the model to focus on different parts of the input simultaneously
- Position-wise Feed-Forward Networks: Helps the model learn more complex patterns
- Positional Encoding: Allows the model to understand word order

In [None]:
class TransformerModel(nn.Module):
    def __init__(self, ntoken, d_model, nhead, nhid, nlayers, dropout=0.5):
        super(TransformerModel, self).__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(d_model)
        encoder_layers = nn.TransformerEncoderLayer(d_model, nhead, nhid, dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, d_model)
        self.d_model = d_model
        self.decoder = nn.Linear(d_model, ntoken)

        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src, src_mask):
        src = self.encoder(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, src_mask)
        output = self.decoder(output)
        return output

## 5. Model Setup and Training

Here we set up the model with specific hyperparameters and demonstrate a basic training loop.

In [None]:
# Hyperparameters
ntokens = 10000  # size of vocabulary
emsize = 200  # embedding dimension
nhid = 200  # the dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2  # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2  # the number of heads in the multiheadattention models
dropout = 0.2  # the dropout value

# Create the model
model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers, dropout)

# Example input
src = torch.randint(0, ntokens, (10, 32))  # (sequence_length=10, batch_size=32)
src_mask = torch.zeros((10, 10)).type(torch.bool)

# Forward pass
output = model(src, src_mask)
print(output.shape)  # Should be (10, 32, ntokens)

## 6. Training Loop

Here's a basic training loop. In practice, you'd need a proper dataset and data loader.

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

for epoch in range(10):  # Loop over epochs
    model.train()  # Set model to training mode
    for batch in range(100):  # Assume 100 batches of training data
        optimizer.zero_grad()  # Reset gradients
        output = model(src, src_mask)  # Forward pass
        loss = criterion(output.view(-1, ntokens), src.view(-1))  # Calculate loss
        loss.backward()  # Backpropagate
        optimizer.step()  # Update model weights
    
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

## 7. Why It's Cool
- Can process all words in parallel (very fast!)
- Connects all words directly (helps with long-range dependencies)
- Produces attention patterns that we can visualize and interpret

## 8. Results
- Achieves state-of-the-art results in translation tasks
- Trains much faster than previous models