# 📌 **Transformers from Scratch - A Complete Guide**

This notebook provides a **detailed breakdown** of Transformer models from a **pure mathematical and coding perspective**. We will cover:

✔ **Mathematical foundations of self-attention**  
✔ **Multi-head attention, positional encoding, and feedforward layers**  
✔ **How to build a Transformer step-by-step (without external libraries like Hugging Face)**  
✔ **Pure PyTorch and Keras implementations**  

---

## 🔹 **1. Introduction to Transformers**
Traditional **Recurrent Neural Networks (RNNs)** process words **sequentially**, making them **slow and difficult to parallelize**.  
Transformers **solve this problem** by using **self-attention**, which allows them to process all words **at the same time**.

✔ **Process entire sequences in parallel** (unlike RNNs)  
✔ **Capture long-range dependencies efficiently**  
✔ **Used in models like GPT, BERT, and T5**  

---

## 🧮 **2. Mathematical Foundation of Transformers**

### 🔹 **1. The Self-Attention Mechanism**
Self-attention is the **core of Transformers**. It allows the model to weigh different words **based on their relevance** to the current word.

#### **Step 1: Compute Query, Key, and Value Matrices**
Each word in the sentence is **transformed into three vectors**:

$$
Q = XW_Q, \quad K = XW_K, \quad V = XW_V
$$

where:  
- **\( X \)** = input sentence as word embeddings  
- **\( W_Q, W_K, W_V \)** = weight matrices for **Query, Key, and Value**  

#### **Step 2: Compute Attention Scores**
We calculate the similarity between each word using **dot-product attention**:

$$
\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V
$$

where:  
- **\( d_k \)** = embedding size (used for scaling)  
- **Softmax** ensures attention weights sum to **1**.  

This means that **important words get more weight** while unimportant words contribute less.

---

### 🔹 **2. Multi-Head Attention**
Instead of a **single attention head**, Transformers use **multiple attention heads** to focus on **different parts** of the sentence.

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W_O
$$

Each **head** applies independent self-attention, helping the model capture different **semantic relationships**.

---

### 🔹 **3. Positional Encoding**
Transformers **do not have recurrence**, so they need a way to **encode word order**.  
Positional Encoding is added to the embeddings:

$$
PE(i, 2j) = \sin \left(\frac{i}{10000^{2j/d}}\right), \quad PE(i, 2j+1) = \cos \left(\frac{i}{10000^{2j/d}}\right)
$$

where **\( i \)** is the word position and **\( j \)** is the embedding index.

---

### 🔹 **4. Feedforward Neural Network (FFN)**
Each Transformer layer also includes a **fully connected network**:

$$
\text{FFN}(x) = \text{ReLU}(x W_1 + b_1) W_2 + b_2
$$

This helps the model **learn complex patterns** beyond attention.

---

## 🚀 **3. Implementing a Transformer from Scratch (PyTorch)**

In [6]:
# 📦 Import basic PyTorch libraries
import torch
import torch.nn as nn
import math


In [7]:
# ✅ 1. Define Self-Attention Mechanism
class SelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # Query, Key, and Value weight matrices
        self.W_q = nn.Linear(embed_dim, embed_dim)
        self.W_k = nn.Linear(embed_dim, embed_dim)
        self.W_v = nn.Linear(embed_dim, embed_dim)
        self.W_o = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        batch_size, seq_length, embed_dim = x.shape
        
        # Compute Query, Key, Value
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)

        # Compute Attention Scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
        attention_weights = torch.nn.functional.softmax(scores, dim=-1)
        
        # Apply attention weights
        output = torch.matmul(attention_weights, V)
        return self.W_o(output)

# ✅ 2. Define Transformer Encoder Layer
class TransformerEncoderLayer(nn.Module):
    def __init__(self, embed_dim, num_heads, hidden_dim):
        super().__init__()
        self.self_attention = SelfAttention(embed_dim, num_heads)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim)
        )
        self.norm2 = nn.LayerNorm(embed_dim)

    def forward(self, x):
        # Self-Attention + Residual Connection
        attn_output = self.self_attention(x)
        x = self.norm1(x + attn_output)

        # Feedforward + Residual Connection
        ffn_output = self.ffn(x)
        x = self.norm2(x + ffn_output)
        return x

# ✅ 3. Define Full Transformer Model
class Transformer(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2, hidden_dim=256):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.positional_encoding = nn.Parameter(torch.zeros(1, 500, embed_dim))  # Fixed max length = 500
        self.encoder_layers = nn.ModuleList([TransformerEncoderLayer(embed_dim, num_heads, hidden_dim) for _ in range(num_layers)])
        self.fc = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x) + self.positional_encoding[:, :x.size(1), :]
        for layer in self.encoder_layers:
            x = layer(x)
        return self.fc(x)

# ✅ 4. Initialize and Print Model Summary
vocab_size = 10000  # Example vocab size
model = Transformer(vocab_size)
print(model)


Transformer(
  (embedding): Embedding(10000, 128)
  (encoder_layers): ModuleList(
    (0-1): 2 x TransformerEncoderLayer(
      (self_attention): SelfAttention(
        (W_q): Linear(in_features=128, out_features=128, bias=True)
        (W_k): Linear(in_features=128, out_features=128, bias=True)
        (W_v): Linear(in_features=128, out_features=128, bias=True)
        (W_o): Linear(in_features=128, out_features=128, bias=True)
      )
      (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (ffn): Sequential(
        (0): Linear(in_features=128, out_features=256, bias=True)
        (1): ReLU()
        (2): Linear(in_features=256, out_features=128, bias=True)
      )
      (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    )
  )
  (fc): Linear(in_features=128, out_features=10000, bias=True)
)


## 🚀 **3. Implementing a Transformer from Scratch (PyTorch)**

In [8]:
# 📦 Import necessary libraries
import tensorflow as tf
from tensorflow.keras import layers


In [9]:
# ✅ Define Multi-Head Attention
class MultiHeadSelfAttention(layers.Layer):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)

    def call(self, x):
        return self.att(x, x)

# ✅ Define Transformer Encoder Layer
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, num_heads, hidden_dim):
        super().__init__()
        self.attention = MultiHeadSelfAttention(embed_dim, num_heads)
        self.ffn = tf.keras.Sequential([
            layers.Dense(hidden_dim, activation="relu"),
            layers.Dense(embed_dim)
        ])
        self.layernorm1 = layers.LayerNormalization()
        self.layernorm2 = layers.LayerNormalization()

    def call(self, x):
        x = self.layernorm1(x + self.attention(x))
        x = self.layernorm2(x + self.ffn(x))
        return x

# ✅ Initialize Model
inputs = layers.Input(shape=(500,))
x = layers.Embedding(10000, 128)(inputs)
x = TransformerEncoder(128, 4, 256)(x)
x = layers.Dense(10000, activation="softmax")(x)
model = tf.keras.Model(inputs, x)
model.summary()


# ❓ **Key Questions for Interviews**

---

### 1️⃣ **What is the main advantage of Transformers over RNNs and LSTMs?**  
Transformers process **entire sequences in parallel**, unlike RNNs, which process text **sequentially**. This enables **faster training** and helps capture **long-range dependencies** better.

---

### 2️⃣ **How does the self-attention mechanism work?**  
Self-attention allows the model to determine the **importance of each word** in a sentence relative to others. It computes:

$$
\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V
$$

where \( Q, K, V \) are **Query, Key, and Value matrices** derived from input embeddings.

---

### 3️⃣ **Why do we scale the dot product by \( \sqrt{d_k} \) in self-attention?**  
Without scaling, the dot product of **large vectors** can produce **very high values**, leading to extreme softmax outputs (close to 0 or 1). Scaling by \( \sqrt{d_k} \) ensures **stability and smooth gradients**.

---

### 4️⃣ **What is multi-head attention and why is it important?**  
Multi-head attention allows the model to **focus on multiple aspects** of a sentence **simultaneously**. Instead of using a single attention mechanism, we apply multiple **independent attention heads** and concatenate their results:

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W_O
$$

This helps the model capture **different relationships** within the sentence.

---

### 5️⃣ **Why do Transformers use positional encoding?**  
Transformers **do not have recurrence** (unlike RNNs), so they need **positional encodings** to retain word order. The encoding function is:

$$
PE(i, 2j) = \sin \left(\frac{i}{10000^{2j/d}}\right), \quad PE(i, 2j+1) = \cos \left(\frac{i}{10000^{2j/d}}\right)
$$

where \( i \) is the position and \( j \) is the embedding dimension.

---

### 6️⃣ **What are LayerNorm and residual connections used for in Transformers?**  
✔ **LayerNorm** stabilizes training by normalizing activations across feature dimensions.  
✔ **Residual connections** help **prevent vanishing gradients** by allowing direct information flow:

$$
\text{Output} = \text{LayerNorm}(x + \text{Self-Attention}(x))
$$

---

### 7️⃣ **What is the Feedforward Neural Network (FFN) in a Transformer?**  
Each Transformer layer contains a **fully connected feedforward network (FFN)** after the attention mechanism:

$$
\text{FFN}(x) = \text{ReLU}(x W_1 + b_1) W_2 + b_2
$$

This introduces **non-linearity** and allows the model to learn more **complex representations**.

---

### 8️⃣ **How do Transformers handle long sequences efficiently?**  
Transformers use **self-attention**, which has a computational complexity of **\( O(n^2) \)**. To handle long sequences, models like **Longformer and Reformer** use **sparse attention**, reducing complexity to **\( O(n) \)**.

---

### 9️⃣ **What is the difference between an Encoder and a Decoder in a Transformer?**  
✔ **Encoder** → Processes input sequences and extracts meaningful representations.  
✔ **Decoder** → Generates output sequences (used in translation, text generation).  
✔ **Both** use **self-attention**, but **decoders** also have **cross-attention**, allowing them to attend to encoder outputs.

---

### 🔟 **What are the main differences between BERT and GPT?**  
| Feature | BERT | GPT |
|---------|------|-----|
| Architecture | Bidirectional | Unidirectional |
| Pretraining Task | Masked Language Model (MLM) | Causal Language Model (CLM) |
| Use Case | Text Understanding | Text Generation |

BERT **understands** text by predicting missing words, while GPT **generates** text based on previous tokens.

---