<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/07.trans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/07.trans.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

# The Transformer

📝 SALP chapter 9

## **The Transformer Overview**

- **Transformers** are the core architecture for **large language models (LLMs)**
  - Which are revolutionizing speech and language processing.
- Focus: **Left-to-right (causal) language modeling**, where tokens are predicted sequentially based on prior context.
- The key mechanism: **Self-attention** (multi-head attention) allows the model to integrate information from surrounding tokens to capture long-range dependencies and contextual relationships.

---

### **Transformer Architecture Components**

- **1. Transformer Blocks**: 
  - Each block contains **multi-head attention**, a **feedforward network**, and **layer normalization**.
  - A series of blocks maps input tokens $𝐱_i = (𝐱_1, ..., 𝐱_n)$ to output tokens $(𝐡_i = 𝐡_1, ..., 𝐡_n)$, allowing deep processing over multiple layers.
  
- **2. Input Encoding**: 
  - Transforms input tokens into **contextual vectors** using an **embedding matrix $𝐄$** and **positional encoding** to capture token order.

- **3. Language Modeling Head**: 
  - Projects hidden states through an **unembedding matrix $𝐔$**, applying **softmax** over the vocabulary to generate a token prediction.

![The architecture of a (left-to-right) transformer](./images/trans/lrtrans.png)

---

### **Next Steps in Transformer Exploration**

- **Multi-head attention** and detailed transformer block mechanisms are covered in upcoming chapters.
- **Pretraining** and **token generation** via **sampling** (Chapter 10).
- **Masked language modeling** (Chapter 11) introduces **BERT** models.
- Prompting LLMs and aligning with **human preferences** (Chapter 12).
- **Encoder-decoder** architecture for **machine translation** (Chapter 13).

## Attention

- **Problem with Static Embeddings**:
  - Word embeddings like **word2vec** are static, i.e., the word’s meaning remains constant across contexts.
  - 🍎 The word "it" is always represented by the same vector, even though its meaning varies in sentences:
    - The chicken didn’t cross the road because **it** was too tired.
      - it → chicken
    - The chicken didn’t cross the road because **it** was too wide.
      - it → road
  
- **Need for Contextual Representations**: Context-dependent meanings are crucial
  - e.g., "it" refers to different entities in the two sentences above.

---

### **Challenges of Context in Language Models**

- **Context Sensitivity**:
  - Left-to-right language models struggle with context.
    - e.g., The chicken didn’t cross the road because **it**
    - At this point, the model can't resolve whether "it" refers to the chicken or the road, requiring further context.

- **Distant Dependencies**:
  - Linguistic relationships, like subject-verb agreement or disambiguation, often span across long distances.
  - e.g,: "The **keys** to the cabinet **are** on the table" 
    - The model must understand that "keys" (plural) governs "are" (plural verb).
  - I walked along the **pond**, and noticed one of the trees along the **bank**.
    - Bank refers to the side of a pond or river and not a financial institution.

---

### **Transformers and Contextual Representations**

- **Solution: Transformers**:
  - **Transformers** address the need for contextualized word meanings by integrating information from surrounding words.
  - Layer-by-layer, transformers build up **contextual embeddings**, refining token meanings by considering neighboring tokens.

- **Self-Attention**:
  - Attention is the core mechanism allowing transformers to focus on **relevant words** within the context.
  - At each layer, token representations are refined by weighing contributions from surrounding tokens in previous layer.

---

### **Self-Attention in Action: Example with "It"**
- In the sentence: "The chicken didn’t cross the road because **it**," the attention mechanism weighs heavily on the tokens **chicken** and **road**.
- The transformer dynamically adjusts the representation of "it" by considering both **chicken** and **road** as possible references.

![The self-attention weight distribution α](./images/trans/getit.png)

---

### **Formal Definition of Attention**

- **Attention Mechanism**:
  - Takes the input token representation $𝐱_i[1×d]$ and a context window of prior inputs $𝐱_1, \dots, 𝐱_{i-1}$, producing an output $𝐚_i[1×d]$.
    - $d$ - model dimensionality
  - **Context Window**: In left-to-right models, the model attends only to previous tokens.

![Information flow in causal self-attention](./images/trans/casat.png)

- $\text{self-attention}: (𝐱_1, \dots, 𝐱_n) ↦ (𝐚_1, ⋯, 𝐚_n)$ 

---

### **Simplified Attention: Weighted Sum of Context Vectors**

- **Core Idea**: Attention is a **weighted sum** of context vectors.
  - **Weighting**: The weight $\alpha_{ij}$ is computed via the similarity between tokens $𝐱_i$ and $𝐱_j$, typically using the dot product.
  - $𝐚_i = \sum_{j \leq i} \alpha_{ij} 𝐱_j$
  - $\alpha_{ij}$: Weight determining how much token $𝐱_j$ contributes to the final representation $𝐚_i$.
- **Similarity Scores**:
  - Dot product computes similarity:
    - $s_{ij} = \text{score}(𝐱_i, 𝐱_j) = 𝐱_i \cdot 𝐱_j$
  - These scores are normalized using softmax to create a **probability distribution** over the tokens.
    - $\displaystyle \underset{j≤ i}{\alpha_{ij}} = \frac{e^{s_{ij}}}{\sum_{k \leq i} e^{s_{ik}}}$

---

### **A Single Attention Head Using Query, Key, and Value Matrices**
- Attention head in transformer refers to specific structured layers
  - It represent three different roles that each input embedding plays during attention
- **Roles of Query, Key, and Value**:
  - **Query** $𝐪_i[1×d_k]$: Represents the current token being attended to.
    - $d_k$ - dimension for the key and query vectors
  - **Key** $𝐤_i[1×d_k]$: Represents previous tokens compared with the current token.
  - **Value** $𝐯_i[1×d_v]$: The content of previous tokens used to update the current token.
    - $d_v$ - dimension for the value vectors

- **Projections**: 
  - $𝐪_i = 𝐱_i 𝐖^𝐐, \quad 𝐤_i = 𝐱_i 𝐖^𝐊, \quad 𝐯_i = 𝐱_i 𝐖^𝐕$
    - weight matrix shapes: $𝐖^𝐐[d×d_k], 𝐖^𝐊[d×d_k], 𝐖^𝐕[d×d_v]$
  - These projections enable each token to play different roles in the attention process.

---

### **Scaling Attention for Stability**

- **Dot Product Scaling**: To prevent large dot product values from causing instability, similarity scores are scaled by the square root of the embedding dimension $d_k$:
  - $s_{ij} = \text{score}(𝐱_i, 𝐱_j) = \dfrac{𝐪_i \cdot 𝐤_j}{\sqrt{d_k}}$
  - This scaling ensures more stable training and avoids gradient vanishing.
- **Final Attention Output**: The output for token $𝐚_i$ is the weighted sum of the value vectors:
  - $\displaystyle 𝐚_i = \sum_{j \leq i} \alpha_{ij} 𝐯_j$
    - This sums up the information from previous tokens, weighted by their relevance to the current token $𝐱_i$.
    - $𝐪_i = 𝐱_i 𝐖^𝐐, \quad 𝐤_j = 𝐱_j 𝐖^𝐊, \quad 𝐯_j = 𝐱_j 𝐖^𝐕$

- **Attention Layer**: This computation happens for all tokens simultaneously, creating an output sequence of the same length as the input.
- 🍎 Calculating the value of $𝐚_3$ 
  - the third element of a sequence using causal (left-to-right) self-attention.

![Calculating the value of a₃](./images/trans/cala.png)

---

### **Multi-Head Attention: Expanding Model Capability**

- **Multiple Attention Heads**: Transformers use $h$ **multiple attention heads** in parallel, each with its own **query, key, and value matrices**.
  - Each head focuses on different patterns or relationships within the sequence.
  
- **Concatenation and Projection**:
  - $𝐚_i = (\textbf{head}_1 \oplus \textbf{head}_2 \dots \oplus \textbf{head}_h) 𝐖_O$
    - The outputs from all heads are concatenated and projected back to the original dimensionality $d$.
    
    - $\displaystyle\textbf{head}_i^c = ∑_{j≤i}α_{ij}^c 𝐯_j^c,\quad ∀c (1≤c≤h)$
    
    - $\displaystyle \underset{j≤ i}{\alpha_{ij}^c} = \underset{j≤i}{\text{softmax}}(s^c_{ij}) = \frac{e^{s^c_{ij}}}{\sum_{m \leq i} e^{s^c_{im}}}$
    
    - $s^c_{ij} = \text{score}^c(𝐱_i, 𝐱_j) = \dfrac{𝐪^c_i \cdot 𝐤^c_j}{\sqrt{d_k}}$
    
    - $𝐪^c_i = 𝐱_i 𝐖^{𝐐_c}, \quad 𝐤^c_j = 𝐱_j 𝐖^{𝐊_c}, \quad 𝐯^c_j = 𝐱_j 𝐖^{𝐕_c}$

- 🍎 The multi-head attention computation for input $𝐱_i$ producing output $𝐚_i$
![The multi-head attention computation for input 𝐱ᵢ producing output 𝐚ᵢ](./images/trans/mutihead.png)

## Transformer Blocks
- Self-attention lies at the core of the transformer block
- The processing of a token through the transformer block is called a **residual stream**.

![The architecture of a transformer block showing the residual stream](./images/trans/block.png)

- An input vector $𝐱_i$ is processed as a stream of $d$-dimensional representation
- Layer norms before attention and feedforward layers (prenorm architecture)
- Each layer adds output back into the residual stream

---

### **Feedforward Layer(FFN)**
- A fully connected, 2-layer network
  - One hidden layer, two weight matrices
  - Same weights for each token, different across layers
- Dimensionality $d_{\text{ff}}$ of the hidden layer is usually larger than the model dimensionality $d$
  - e.g.,  in the original transformer model: $d = 512$, $d_{\text{ff}} = 2048$

- $\text{FFN}(𝐱_i) = \text{ReLU}(𝐱_i 𝐖^1 + b_1) 𝐖^2 + b_2$

---

### **Layer Normalization (Layer Norm)**
- Applied to single token embedding vector, not the entire layer
- Normalizes vector of dimensionality $d$ to have zero mean and unit variance
  - Mean $\mu$ over the vector components:
    - $\displaystyle\mu = \dfrac{1}{d} \sum_{i=1}^{d} x_i$
  - Standard deviation $\sigma$:
    - $\displaystyle\sigma = \sqrt{\dfrac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2}$
  - Normalized vector $\hat{𝐱}$:
    - $\displaystyle\hat{𝐱} = \dfrac{(𝐱 - \mu)}{\sigma}$
- Keeps values within a range for gradient-based training
  - Layer normalization with two learnable parameters $\gamma$ and $\beta$:
  - $\displaystyle\text{LayerNorm}(𝐱) = \gamma \dfrac{(𝐱 - \mu)}{\sigma} + \beta$
  - $\gamma$ - gain, and $\beta$ - offset

---

### **Putting it All Together: Transformer Block Process**
- **Input:** Token embedding vector $𝐱_i$
- **Step-by-step process:**
  1. Layer Norm:

     - $𝐭_i^1 = \text{LayerNorm}(𝐱_i)$

  2. Multi-Head Attention (over all tokens $𝐱_1, \ldots, 𝐱_N$):

     - $𝐭_i^3 = \text{MultiHeadAttention}(𝐭_i^1, [𝐱_1^1, \ldots, 𝐱_N^1])$

  3. Residual connection:

     - $𝐭_i^3 = 𝐭_i^2 + 𝐱_i$

  4. Second Layer Norm:

     - $𝐭_i^4 = \text{LayerNorm}(𝐭_i^3)$

  5. Feedforward Network:

     - $𝐭_i^5 = \text{FFN}(𝐭_i^4)$

  6. Final residual connection (output of the block):

     - $𝐡_i = 𝐭_i^5 + 𝐭_i^3$

-  An attention head can move information from token A’s residual stream into token B’s residual stream.

![pass from one residual stream to another](./images/trans/pass.png)
---

### **Stacking Transformer Blocks**

- Blocks can be stacked (12 layers in GPT-3-small, 96 layers in GPT-3-large)
- Residual stream metaphor: input processed across layers
- Each block refines token representation, leading to better predictions
- Essential for large-scale models like GPT and T5

## Parallelization in Transformers Using A Single Matrix 𝐗
- Computation for each token in a transformer block can be parallelized.
  - ∵ All the computation in the transformer block computing $𝐡_i$ from the input $𝐱_i$
- Input embeddings for all tokens are packed into a matrix $𝐗$ of size $[N \times d]$
  - where $N$ is the number of tokens, and $d$ is the embedding dimensionality.
- Common input sizes range from 1K to 32K tokens.

---

### **Attention Computation for a Single Attention Head**

- Multiply $𝐗$ by key, query, and value weight matrices $𝐖^𝐊∈[d×d_k]$,  $𝐖^𝐐∈[d×d_k]$, and $𝐖^𝐕∈[d×d_v]$.
- Compute $𝐐\in \mathbb{R}^{N \times d_k}$, $𝐊\in \mathbb{R}^{N \times d_k}$, and $𝐕\in \mathbb{R}^{N \times d_v}$ matrices for the entire input sequence.
  - $𝐐 = 𝐗𝐖^𝐐, \quad 𝐊 = 𝐗𝐖^𝐊, \quad 𝐕 = 𝐗𝐖^𝐕$

- **Efficient 𝐐uery-Key Comparisons**
  - Perform all query-key comparisons using matrix multiplication $𝐐𝐊^𝐓∈[N \times N]$.
  - ![all qᵢ · kⱼ comparisons in a single matrix multiple](./images/trans/qk.png)

- The entire self-attention step for an entire sequence of $N$ tokens for one head
  - $\displaystyle 𝐀 = \text{softmax}\left( \text{mask}\left(\frac{𝐐𝐊^𝐓}{\sqrt{d_k}}\right) \right) 𝐕$
  - This computes the attention for all tokens simultaneously.

---

### **Masking Future Tokens**
- For language modeling, we prevent access to future tokens by masking the upper-triangular part of the $𝐐𝐊^𝐓$ matrix.
  - $M_{ij} = 
  \begin{cases}
    -\infty & \text{if } j > i \text{, i.e. for the upper-triangular portion} \\
    0 & \text{otherwise}
  \end{cases}$
  - ![mask](./images/trans/mask.png)
  - This avoids cheating by looking at the following tokens.
- The attention computation for a single attention head in parallel

![The attention computation for a single attention head in parallel](./images/trans/singleall.png)
---

### **Parallelizing Multi-Head Attention**
- Multiple attention heads run in parallel.
- For each head $i$, calculate $𝐐^i[N×d_k]$, $𝐊^i[N×d_k]$, and $𝐕^i[N×d_v]$ 
  - $𝐐^i = 𝐗𝐖^{𝐐^i}, \quad 𝐊^i = 𝐗𝐖^{𝐊_i}, \quad 𝐕^i = 𝐗𝐖^{𝐕_i}$
    - Matrix shape: $𝐖^{𝐐_i}∈ ℝ^{d×d_k}$, $𝐖^{𝐊_i}∈ ℝ^{d×d_k}$, and $𝐖^{𝐕_i}∈ ℝ^{d×d_v}$
  - $\textbf{head}_i = \text{SelfAttention}(𝐐^i, 𝐊^i, 𝐕^i) = \text{softmax}\left( \dfrac{𝐐^i 𝐊^{i𝐓}}{\sqrt{d_k}} \right) 𝐕^i$
    - $\textbf{head}_i ∈ N×d_v$, so the concatenation of $h$ attentions as a single output has shape $N×hd_v$

- **Final Multi-Head Attention Output:**
  - The output of the multi-head attention is concatenated and projected back to dimensionality $[N \times d]$.
  - $\text{MultiHeadAttention}(𝐗) = (\textbf{head}_1 \oplus \textbf{head}_2 \oplus \cdots \oplus \textbf{head}_h) 𝐖^𝐎$
    - $𝐖^𝐎 \in \mathbb{R}^{hd_v \times d}$ is the linear projection matrix.
  - This ensures the dimensionality is preserved for further transformer layers.

---

### **Putting It All Together with the Parallelized Input Matrix $𝐗$**
- The entire layer of $N$ transformer block over the entire $N$ input tokens is parallelized.
- Layer normalization and feedforward layers are applied in parallel to each token.

   - $𝐎 = \text{LayerNorm}(𝐗 + \text{MultiHeadAttention}(𝐗))$
   - $𝐇 = \text{LayerNorm}(𝐎 + \text{FFN}(𝐎))$

- Break it down with one equation for each component computation:
   - $𝐓^1 = \text{MultiHeadAttention}(𝐗)$
   - $𝐓^2 = 𝐗 + 𝐓^1$
   - $𝐓^3 = \text{LayerNorm}(𝐓^2)$
   - $𝐓^4 = \text{FFN}(𝐓^3)$
   - $𝐓^5 = 𝐓^4 + 𝐓^3$
   - $𝐇 = \text{LayerNorm}(𝐓^5)$


## The Input: Embeddings for Token and Position

  - The transformer creates two embeddings: 
    - `token` embedding and `positional` embedding.
  - The combined embeddings form the input matrix $𝐗$ of shape $[N \times d]$
    - where $N$ is the number of tokens, and $d$ is the embedding dimensionality.
- Token Embeddings from the Vocabulary
  - Token embeddings are vectors of dimension $d$ stored in the embedding matrix $𝐄$.
  - Each word from the vocabulary $V$ has a corresponding row in $𝐄$.
    - $𝐄 \in \mathbb{R}^{|V| \times d}$

---

### **Selecting the Embedding for a Sequence of Tokens**
  - Convert tokens into `vocabulary indices` and select the corresponding rows from the embedding matrix $𝐄$.
    - This is equivalent to multiplying a `one-hot vector` with the embedding matrix to retrieve the token embedding.

- 🍎 **One-Hot Vector:**
  - $[0\ 0\ 0\ 0\ 1\ 0\ 0 \dots 0]$
  - The embedding for token $i$ is selected as $𝐄[i]$.

![Selecting the embedding vector for word V₅](./images/trans/seli.png)

- The entire token sequence can be selected similarly
  
![Selecting the embedding matrix for the input sequence of token ids W](./images/trans/selseq.png)

---

### **Positional Embeddings**
- Token embeddings are position-independent.
- Positional embeddings represent the position of each token in the sequence.
  - Stored in `Positional Embedding Matrix`: $𝐄_{pos} \in \mathbb{R}^{1 \times N}$
- How to get these positional embeddings?
  - Absolute positional embeddings are randomly initialized and learned during training.
  - Each position in the sequence has a corresponding embedding.

- The final input representation $𝐗[N \times d]$ is obtained by `adding token embeddings and positional embeddings.`
  - $𝐗[i] = 𝐄[\text{id}(i)] + 𝐏[i]$

![A simple way to model position](./images/trans/posem.png)

---

### **Limitations of Absolute Position Embeddings**

- Early positions are well-trained due to their frequency in training
  - while later positions may be undertrained and may not generalize well.

- A more robust alternative is using a `static function` like sine and cosine functions to map integer inputs to real-valued vectors in a way that captures the inherent relationships among the positions.
- `Relative positional embeddings` capture relationships between positions and are often implemented in the attention mechanism at each layer.

## Language Modeling Head
  - A "head" refers to additional neural circuitry added to the transformer for specific tasks.
  - It enables transformers to predict the next word given a context.
- **Word Prediction with Language Models**
  - Given a context like “Thanks for all the”, the model computes $P(\text{fish}|\text{Thanks for all the})$.
  - The language model outputs a distribution over the entire vocabulary for predicting the next word.
  - In transformers, the context size is determined by the model’s window, allowing large contexts (e.g., 2K, 4K tokens).

---

### **Structure of the Language Modeling Head**
- The head takes the output embedding $h_L^N$ from the last transformer layer for the last token $N$.
- The goal is to produce a probability distribution over the vocabulary for the next token $N+1$.

![Word Prediction with Language Models](./images/trans/lmhead.png)

- Linear Layer: From Embedding to Logits
  - The first step is a linear layer that projects the output embedding $𝐡^𝐋_𝐍[1 \times d]$ into a logit vector $𝐮[1 \times |V|]$.
  - This vector contains a score for each word in the vocabulary $V$.
  - $𝐮 = 𝐡^𝐋_𝐍 𝐄^T$
- Weight Tying with the Embedding Matrix
  - The same embedding matrix $𝐄[|V| \times d]$ used to embed tokens is reused as $𝐄^T$ to map embeddings back to the vocabulary space.
  - This process is known as **unembedding**.
- Softmax: Turning Logits into Probabilities
  - The logit vector $𝐮$ is passed through a **softmax** function to produce a probability distribution $𝐲$ over the vocabulary.
    - $𝐲 = \text{softmax}(𝐮)$
    - The output $𝐲$ represents the probabilities of each word in the vocabulary being the next word.
- Generating Text with the Language Model
  - The most probable word (highest $y_k$) can be selected (greedy decoding) or sampled probabilistically using other methods.
  - The word corresponding to the selected index $k$ is the generated next word.

![A transformer language model (decoder-only)](./images/trans/gen.png)

---

### **The Logit Lens: A Tool for Model Interpretability**
- The **logit lens** is a method to interpret the internal states of a transformer model.
-  **How it Works:**
   1. Take a vector from **any layer** of the transformer.
   2. Multiply it by the **unembedding layer** (transpose of the embedding matrix).
   3. Apply a **softmax** to compute a probability distribution over words.
   4. This gives insight into the words that internal layers might be predicting.

---

### **Transformer Decoder for Language Modeling**
- Transformers originally used an **encoder-decoder** architecture.
- For causal language modeling, we only use the **decoder** part to generate the next word in sequence.
  - This architecture is commonly used for text generation and machine translation.

### 🍎 Building a basic transformer-based language model

In [None]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to predict
model.eval()

def predict_next_word(text, model, tokenizer, max_length=50):
    # Tokenize the input text
    input_ids = tokenizer.encode(text, return_tensors='pt')

    # Generate the next token logits
    with torch.no_grad():
        outputs = model(input_ids)
        logits = outputs.logits

    # Get the predicted next token id
    predicted_token_id = torch.argmax(logits[:, -1, :], dim=-1).item()

    # Decode the predicted token id back to a word
    predicted_word = tokenizer.decode(predicted_token_id)

    return predicted_word

# Test the function
input_text = "The quick brown fox jumps over"
predicted_word = predict_next_word(input_text, model, tokenizer)
print(f"Input text: {input_text}")
print(f"Predicted next word: {predicted_word}")

# References
- [Formal Algorithms for Transformers](https://arxiv.org/pdf/2207.09238)