<a href="https://colab.research.google.com/github/xchen2763/TorchLeet/blob/main/torch/hard/transformer/transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem: Build a Transformer Model from Scratch

## Objective
Implement a **Transformer model** in PyTorch for sequence processing and prediction. The model should include an embedding layer, a Transformer encoder, and an output projection layer.

## Tasks

1. Implement Positional Encoding to inject sequence order into embeddings  
Create sinusoidal positional encodings that are added to input embeddings to provide order information.

2. Implement Multi-Head Self Attention mechanism  
Apply attention in parallel across multiple heads to capture different representation subspaces.

3. Linear projection of queries, keys, and values  
Use a single linear layer to project input into concatenated Q, K, V tensors.

4. Scaled dot-product attention  
Compute attention scores by scaled dot product of queries and keys, followed by softmax and application to values.

5. Output projection after head concatenation  
Concatenate the outputs of all heads and project back to the original embedding dimension.

6. Implement FeedForward layer used within Transformer blocks  
Build a two-layer MLP with a ReLU activation in between to process each token independently.

7. Connect components in a TransformerEncoderLayer with proper layer normalization and residual connections  
Apply residual connections and layer normalization around the attention and feedforward sublayers.


## Requirements

- Support padded input sequences for variable-length data.
- Ensure the model handles batched inputs with correct tensor shapes.


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import math

In [6]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
      super().__init__()
      pe = torch.zero(max_len, d_model)
      position = torch.arange(max_len).unsqueeze(1)
      div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000) / d_model))

      pe[:, 0::2] = torch.sin(position * div_term)
      pe[:, 1::2] = torch.cos(position * div_term)
      pe = pe.unsqueeze(0)
      self.register_buffer('pe', pe)

    def forward(self, x):
      seq_len = x.size(1)
      return x + self.pe[:, :seq_len]


class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        assert embed_dim % num_heads == 0

        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        self.qkv_proj = nn.Linear(embed_dim, 3 * embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)


    def forward(self, x):
        B, T, D = x.shape
        qkv = self.qkv_proj(x)
        qkv = qkv.reshape(B, T, 3, self.num_heads, self.head_dim).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]  # Each is (B, num_heads, T, head_dim)

        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        attn_weights = torch.softmax(scores, dim=-1)
        attn_output = attn_weights @ v  # (B, num_heads, T, head_dim)
        attn_output = attn_output.transpose(1, 2).reshape(B, T, D)
        return self.out_proj(attn_output)


class FeedForward(nn.Module):
    def __init__(self, embed_dim, ff_dim):
        ...

    def forward(self, x):
        ...


class TransformerEncoderLayer(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim):
        ...

    def forward(self, x):
        ...

class TransformerModel(nn.Module):
    def __init__(self, input_dim, embed_dim, num_heads, num_layers, ff_dim, output_dim):
        ...

    def forward(self, x):
        ...


In [None]:
# Generate synthetic data
torch.manual_seed(42)
seq_length = 10
num_samples = 100
input_dim = 1
X = torch.rand(num_samples, seq_length, input_dim)  # Random sequences
y = torch.sum(X, dim=1)  # Target is the sum of each sequence

# Initialize the model, loss function, and optimizer
input_dim = 1
embed_dim = 16
num_heads = 2
num_layers = 2
ff_dim = 64
output_dim = 1

model = TransformerModel(input_dim, embed_dim, num_heads, num_layers, ff_dim, output_dim)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
# Training loop
epochs = 1000
for epoch in range(epochs):
    # Forward pass
    predictions = model(X)
    loss = criterion(predictions, y)

    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Log progress every 100 epochs
    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}")

Epoch [100/1000], Loss: 1.5771
Epoch [200/1000], Loss: 0.8907
Epoch [300/1000], Loss: 0.6074
Epoch [400/1000], Loss: 0.3587
Epoch [500/1000], Loss: 0.1986
Epoch [600/1000], Loss: 0.1157
Epoch [700/1000], Loss: 0.0762
Epoch [800/1000], Loss: 0.0629
Epoch [900/1000], Loss: 0.0575
Epoch [1000/1000], Loss: 0.0379


In [None]:
# Testing on new data
X_test = torch.rand(2, seq_length, input_dim)
with torch.no_grad():
    predictions = model(X_test)
    print(f"Predictions for {X_test.tolist()}: {predictions.tolist()}")

Predictions for [[[0.6648573279380798], [0.6041934490203857], [0.3187063932418823], [0.9813531041145325], [0.09837877750396729], [0.3223891258239746], [0.3124500513076782], [0.36122316122055054], [0.8705818057060242], [0.4751177430152893]], [[0.569571316242218], [0.05407053232192993], [0.16180634498596191], [0.8140731453895569], [0.34717607498168945], [0.6788632273674011], [0.11463749408721924], [0.21608346700668335], [0.7405895590782166], [0.8521053194999695]]]: [[5.141801834106445], [5.020108699798584]]
