# The Transformer Architecture

## Introduction

The Transformer architecture introduced in the paper *"Attention is All You Need"* (Vaswani et al., 2017) revolutionized the field of Natural Language Processing (NLP) by relying solely on attention mechanisms, dispensing with recurrence and convolutions entirely. This design allows for more parallelization and better handling of long-range dependencies.

In this notebook, we will:

1. **Explain the core components of the Transformer:**
   - Multi-Head Self-Attention
   - Positional Encoding
   - Feedforward Layers

2. **Show a simplified implementation in PyTorch.**

**Prerequisites:**

- Familiarity with PyTorch tensors and modules.
- Basic understanding of attention mechanisms from previous studies.
- Knowledge of linear transformations and embeddings.

---

## 1. The Transformer: A High-Level Overview

A Transformer consists of a stack of identical layers in both the **encoder** and **decoder**. Each encoder layer primarily includes:

1. **Multi-Head Self-Attention:**  
   Allows the model to focus on different parts of the input sequence, computing attention multiple times in parallel.

2. **Feedforward Layers:**  
   A fully connected network applied to each position, expanding the representational capacity.

3. **Add & Norm Steps:**  
   Residual connections and layer normalization are applied after the attention and feedforward sub-layers.

The decoder has a similar structure but includes a second attention layer for attending to the encoder outputs and uses masking to ensure the decoder only attends to previous tokens when training.

**Key Innovation:**  
The Transformer does not rely on recurrence or convolutions. Instead, it uses **attention** to directly relate every token in the input sequence to every other token. This allows parallel processing and better handles long-range dependencies.

---

## 2. Multi-Head Self-Attention

### 2.1 Self-Attention Recap

Self-attention takes a sequence of embeddings (the input tokens embedded into vector space) and transforms them into a new sequence of the same length. For each position, it computes a weighted combination of all other positions, determining how much each other token influences this one.

**Formula:**

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
$$

- \( Q \) (Query): Transformed representation of the token we’re focusing on.
- \( K \) (Key): Transformed representation of the context tokens.
- \( V \) (Value): Transformed representation used to produce the final output.
- \( d_k \): Dimensionality of the keys.

### 2.2 Multi-Head Attention

Instead of computing just one attention function, the Transformer uses **multiple heads**. Each head projects the queries, keys, and values into a different subspace and performs attention in parallel. The results are concatenated and projected again, allowing the model to focus on different types of relationships simultaneously.

**Formula for Multi-Head:**

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(head_1, ..., head_h)W^O
$$

where each head \( head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) \).

**Benefits:**

- Different attention heads can capture different relations or types of information.
- More expressive power than a single-head attention.

---

## 3. Positional Encoding

The Transformer has no inherent notion of sequence order since it does not use recurrence or convolution. To represent the position of each token within the sequence, we add a **positional encoding** to the input embeddings.

**Key Idea:**

- Use trigonometric functions of different frequencies:
  
$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$

- The positional encoding is added to the token embeddings, allowing the model to learn positional information and generalize to sequences longer than those seen during training.

**Properties:**

- Positions that are closer together produce similar positional encodings.
- The model can learn to attend based on relative positions due to the sinusoidal pattern.

---

## 4. Feedforward Layers

Each position in the sequence is then processed by a two-layer fully connected feedforward network (often with a ReLU or GELU non-linearity) applied identically to each position.

**Formula:**

$$
\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
$$

- Expands the dimensionality in the intermediate representation, increasing the representational capacity.
- Applied pointwise to each token’s representation.

**Benefit:**

- Adds more depth and non-linearity to the model.
- Helps the model capture more complex transformations of the input representations.

---

## 5. Example Implementation in PyTorch

Below is a simplified example of a few core Transformer components: Multi-Head Attention, Positional Encoding, and a feedforward layer. This is not a full Transformer implementation, but it highlights the core building blocks.

### 5.1 Setup

In [6]:
import torch
import torch.nn as nn
import math

### 5.2 Positional Encoding Module

In [7]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        
        # Create a matrix of shape (max_len, d_model)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0)  # shape: (1, max_len, d_model)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        # x: (batch_size, seq_len, d_model)
        seq_len = x.size(1)
        x = x + self.pe[:, :seq_len, :]
        return x

### 5.3 Multi-Head Self-Attention

In [8]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        self.out = nn.Linear(d_model, d_model)
        
    def forward(self, x, mask=None):
        # x: (batch, seq_len, d_model)
        B, L, D = x.size()
        
        # Linear projections
        Q = self.W_q(x).view(B, L, self.num_heads, self.head_dim).transpose(1,2)  # (B, heads, L, head_dim)
        K = self.W_k(x).view(B, L, self.num_heads, self.head_dim).transpose(1,2)
        V = self.W_v(x).view(B, L, self.num_heads, self.head_dim).transpose(1,2)
        
        # Scaled dot product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)  # (B, heads, L, L)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn_weights = torch.softmax(scores, dim=-1)
        attn_output = torch.matmul(attn_weights, V)  # (B, heads, L, head_dim)
        
        # Concatenate heads
        attn_output = attn_output.transpose(1,2).contiguous().view(B, L, D)
        
        # Final linear projection
        output = self.out(attn_output)
        return output, attn_weights

### 5.4 Feedforward Layer

In [9]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, dim_ff=2048):
        super(PositionwiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, dim_ff)
        self.fc2 = nn.Linear(dim_ff, d_model)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        # x: (batch, seq_len, d_model)
        return self.fc2(self.relu(self.fc1(x)))

### 5.5 Putting It All Together (Single Encoder Layer)

An encoder layer in a Transformer consists of:

1. Multi-Head Self-Attention + Add & Norm
2. Feedforward + Add & Norm

In [10]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, dim_ff=2048):
        super(TransformerEncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.ff = PositionwiseFeedForward(d_model, dim_ff)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
    def forward(self, x, mask=None):
        # Self-attention block
        attn_output, _ = self.self_attn(x, mask)
        x = self.norm1(x + attn_output)  # Residual + Norm
        
        # Feedforward block
        ff_output = self.ff(x)
        x = self.norm2(x + ff_output)     # Residual + Norm
        
        return x

### 5.6 Example Usage

In [11]:
# Example: Embedding dimension (d_model) = 128, 8 heads, FF dim = 512
d_model = 128
num_heads = 8
dim_ff = 512
seq_len = 10
batch_size = 2

# Dummy input: (batch, seq_len, d_model)
x = torch.randn(batch_size, seq_len, d_model)
mask = None  # no mask for simplicity

# Initialize encoder layer and positional encoding
pe = PositionalEncoding(d_model)
encoder_layer = TransformerEncoderLayer(d_model, num_heads, dim_ff)

# Add positional encoding
x = pe(x)

# Pass through the encoder layer
output = encoder_layer(x, mask)
print("Output shape:", output.shape)  # (2, 10, 128)

Output shape: torch.Size([2, 10, 128])



## 6. Analysis and Further Steps
---

- **Parallelization:**  
  Since the Transformer relies on attention rather than recurrence, it can process all tokens in a sequence simultaneously, greatly speeding up training.

- **Long-Range Dependencies:**  
  With self-attention, every token can attend to any other token, allowing the model to capture long-range dependencies more effectively than RNNs.

- **Scalability:**  
  Transformers scale well to large datasets and longer sequences, which led to breakthroughs in language models (e.g., BERT, GPT).

**Further Steps:**

- Implement a full Transformer encoder and decoder stack.
- Experiment with different positional encodings (learned, sinusoidal).
- Explore how masking is used in the decoder to prevent access to future tokens.
- Integrate advanced concepts like relative positional encodings, rotary embeddings, or efficient attention variants.

**Remember:** The Transformer architecture and its variants now underpin many state-of-the-art NLP models and have even found applications in vision (ViT) and speech processing. Mastering the fundamentals of Multi-Head Self-Attention, Positional Encoding, and Feedforward Layers is key to understanding modern deep learning models in NLP and beyond.

---

## References

- Vaswani et al. (2017), *"Attention Is All You Need"*, [arXiv:1706.03762](https://arxiv.org/abs/1706.03762)
- The Annotated Transformer: [http://nlp.seas.harvard.edu/2018/04/03/attention.html](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
- PyTorch Documentation: [https://pytorch.org/docs/stable/nn.html](https://pytorch.org/docs/stable/nn.html)
