#### References:
- _Illustrations_ are from this great [blog post](https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse)
- For _intuition_ on self-attention and LLMs, watch 3Blue1Brown animated videos [DL5](https://www.youtube.com/watch?v=wjZofJX0v4M) and [DL6](https://www.youtube.com/watch?v=eMlx5fFNoYc)
- For a deep-dive, this [textbook chapter](https://web.stanford.edu/~jurafsky/slp3/9.pdf) for example.

# **Transformers: Lecture Notes**

Transformers are a deep learning architecture introduced in the paper *"Attention Is All You Need"* (Vaswani et al., 2017). They are widely used in NLP, computer vision, and other AI applications.

- Fully **attention-based**, replacing recurrence (RNNs) and convolution (CNNs).
- Uses **self-attention** to model relationships between words in a sequence.
- Processes sequences in **parallel**, enabling faster training.


---
## **I. Input Representation in Transformers**
Transformers operate on a **sequence of vectors**. The input data must first be encoded as a sequence of numerical representations. 

### **I.1 Tokenization**
For different types of data, tokenization methods vary:
- **Text:** Splitting into words, subwords (e.g., BPE, WordPiece), or characters.
- **Images:** Splitting into patches (e.g., ViTs use 16x16 pixel patches).
- **Other Data:** Graphs, audio, and protein sequences have specialized tokenization approaches.

### **I.1.1. Embedding Layer for Text**
For text, each token is mapped to a **learnable embedding vector**:
1. **Token Embeddings:** Each token is assigned a fixed-length vector.
2. **Positional Encodings:** Since Transformers have no recurrence, position information is added explicitly.

Given a vocabulary size $V$ and embedding dimension $d$, the embedding matrix is $E \in \mathbb{R}^{(V×d)}$, where each word index $i$ maps to $E[i]$.
<div style="max-width:400px">
<img src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fd4ac84-3925-428c-8f6a-64dfed5268ad_1714x848.png" alt="Embedding layer" />
<div\>

### **I.2 Positional encodings**
Transformers process input tokens **in parallel**, meaning they lack an inherent sense of token order. To distinguish sequences where the same words appear in different orders (e.g., *"The cat chased the dog."* vs. *"The dog chased the cat."*), **positional encodings** are added to token embeddings.  

- The original **sinusoidal positional encoding** (Vaswani et al., 2017) assigns fixed patterns based on position $p$:
$$PE_{(p, 2i)} = \sin(p / 10000^{2i/d}) , \quad PE_{(p, 2i+1)} = \cos(p / 10000^{2i/d})$$

- Alternatively, **learned positional embeddings** use a trainable lookup table (e.g., `nn.Embedding(max_length, d_model)`).  

- Modern **LLMs use variants** like:  
    - **ALiBi (Attention Linear Bias)**: Adds a decaying bias to attention scores.  
    - **Rotary Positional Embeddings (RoPE)**: Rotates query/key vectors based on position for better generalization.  
    - **T5-style relative positions**: Uses learned biases based on relative token distances.  

The final input representation is the sum of **token embeddings** and **positional encodings**:

$$X = E[tokens] + P[position]$$

where $P[position]$ is a fixed or learned positional encoding matrix.


---
## II. **Self-Attention Layer (Single-Head)**  

#### **Intuition**  
Self-attention allows each token in a sequence to attend to all other tokens, dynamically adjusting its representation based on their relevance. Instead of treating words independently, self-attention **mixes information** across tokens by computing a weighted sum of their embeddings.  

<div style="max-width:400px">
<img src="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85c1f60b-8521-411b-9b84-54973929c251_1500x843.gif" alt="Embedding layer" />
</div>

Each token generates:  
- A **query** vector ($Q$) – what it is looking for.  
- A **key** vector ($K$) – how relevant it is to others.  
- A **value** vector (\(V\)) – what information it carries.  

Each token then gathers information from others based on how well its **query** matches their **keys**.  

---

#### **Scalar Formulation**  
For two tokens $i$ and $j$, their **attention score** is computed as:  

$$\text{score}(i, j) = \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}}$$

where:  
- $\mathbf{q}_i$ is the query vector of token $i$.  
- $\mathbf{k}_j$ is the key vector of token $j$.  
- $d_k$ is the scaling factor to keep scalar products within a reasonable range before application of the softmax.  

The scores are normalized using **softmax**, producing attention weights:  

$$
\alpha_{ij} = \frac{\exp(\text{score}(i, j))}{\sum_k \exp(\text{score}(i, k))}
$$

Each token's final representation is the **weighted sum** of value vectors:  

$$
\mathbf{x}_i \leftarrow \sum_{j} \alpha_{ij} \mathbf{v}_j
$$

---

#### **Matrix Formulation**  
Given a sequence of $L$ tokens, represented as a matrix $X$ of shape $(L, d)$:  

1. Compute **queries, keys, and values** using weight matrices:  

   $$
   Q = X W^Q, \quad K = X W^K, \quad V = X W^V
   $$

   where $W^Q, W^K, W^V$ are learnable weight matrices of shape $(d, d_k)$. 

2. Compute the **scaled attention scores**:  

   $$
   A = \frac{QK^T}{\sqrt{d_k}}
   $$

   where $A$ is an $(L \times L)$ matrix of pairwise scores.  

3. Apply **softmax** to get attention weights:  

   $$
   \tilde{A} = \text{softmax}(A)
   $$

4. Compute the **final output** as a weighted sum of values:  

   $$
   X = \tilde{A} V
   $$

   where $Z$ is the updated representation of shape $(L, d_k)$.  

This forms the **core operation** of a single **self-attention head**, which will later be extended to **multi-head attention**.

In [None]:
import torch.nn as nn

class SingleHeadAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()

        self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)  # Query projection
        self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)  # Key projection
        self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)  # Value projection

    def forward(self, x):
        # Input x: (batch_size, seq_length, embed_dim)
        Q = self.W_q(x)  # (batch_size, seq_length, embed_dim)
        K = self.W_k(x)  # (batch_size, seq_length, embed_dim)
        V = self.W_v(x)  # (batch_size, seq_length, embed_dim)
        d_k = K.shape[-1] # Key dimension

        # Dot-product similarities
        scores = Q @ K.transpose(1, 2)
        # Scale by dimension
        scores /= d_k ** 0.5            
        # Transform the scores into probabilities with the softmax function
        scores = torch.softmax(scores, dim=-1)
    
        # Optional: store the attention weights for visualization
        self.attention_weights = scores

        # Update the vectors x
        x = scores @ V

        return x

---
## Multi-Head Attention

### **Intuition**
A single self-attention head learns a specific way to mix information across tokens. However, a single attention function might not be enough to capture different types of relationships in the data. **Multi-head attention** improves this by using multiple attention heads in parallel, each learning different attention patterns.

Instead of computing a single set of **queries (Q), keys (K), and values (V)**, we compute multiple sets (one per head). Each head processes information differently, and their outputs are concatenated and linearly transformed to form the final representation.

### **Formulation**
Given an input matrix $X \in \mathbb{R}^{L \times d}$ (where $L$ is the sequence length and $d$ is the embedding size), multi-head attention performs the following steps:

1. **Compute multiple sets of Q, K, V**  
   Each attention head has independent learned projection matrices $W^Q_h, W^K_h, W^V_h$:

   $$
   Q_h = X W^Q_h, \quad K_h = X W^K_h, \quad V_h = X W^V_h
   $$

   where $h$ indexes the head.

2. **Compute self-attention for each head**  
   Each head applies scaled dot-product attention independently:

   $$
   A_h = \frac{Q_h K_h^T}{\sqrt{d_k}}, \quad \tilde{A}_h = \text{softmax}(A_h), \quad Z_h = \tilde{A}_h V_h
   $$

   The output of each head $Z_h$ has shape $(L, d_k)$.

3. **Concatenate and project**  
   The outputs from all $H$ heads are concatenated:

   $$
   Z = \text{Concat}(Z_1, Z_2, ..., Z_H)
   $$

   Since each head outputs a vector of dimension $d_k$, the concatenated representation has shape $(L, H \cdot d_k)$. We then apply a final linear transformation with weight matrix $W^O \in \mathbb{R}^{(H \cdot d_k) \times d}$:

   $$
   Z_{\text{final}} = Z W^O
   $$


<div style="max-width:300px">
<figure>
<img src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png" alt="Embedding layer" />
    <figcaption>Multi-head attention: each head performs self-attention in parallel. Outputs are concatenated.</figcaption>
</figure>
</div>

### **Key Takeaways**
- Multiple attention heads allow the model to capture different relationships between tokens.
- Each head computes independent self-attention, but the results are combined into a single representation.
- This technique improves the expressiveness of transformers without increasing computational complexity.

---
## III. Transformer Layer and Transformer Network

A **Transformer layer** consists of two main components:  
1. **Self-attention mixing** – Updates each token representation by attending to all other tokens.  
2. **Token-wise transformations** – Applies feedforward transformations to each token independently.
<div style="max-width:600px">
<figure>
    <img src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21b82064-554d-43d2-b6f1-589090e830ce_1294x410.png" alt="Tokenwise transformation" />
    <figcaption>Tokenwise transformation: the _same_ network is applied to each token independently</figcaption>
</figure>
</div>


To stabilize training, **residual connections** and **layer normalization** are applied after both self-attention and feedforward transformations.

<div style="max-width:300px">
<figure>
    <img src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5024bcc5-33c9-4d53-9bd7-56cbcf9c4627_874x1108.png" alt="Transformer Layer" />
    <figcaption>Transformer layer architecture</figcaption>
</figure>
</div>


A **Transformer network** is built by **stacking multiple Transformer layers**, allowing deeper contextual representations to emerge.


## IV. Task-Specific Processing After Transformers

Once the Transformer has computed **rich contextualized vectors**, the next step depends on the specific task. Different types of processing heads can be applied:

1. **Single prediction (sequence-level classification)**:  
    - A **pooling operation** (e.g., mean/max pooling over all token embeddings), or
    - a **single special token** (e.g., `[CLS]` in BERT) is used to represent the entire sequence.  
    - **Examples**: Sentiment analysis, document classification.

2. **One-to-One Prediction (Token-Level Classification)**
    - A classifier is applied **independently to each token's representation**.  
    - **Example**: Sorting a sequence of numbers, Named Entity Recognition (NER), next-token prediction (at training).
    <div style="max-width:500px">
    <figure>
        <img src="
        https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cad60e3-cec7-4bfa-9ad1-356b6d181f7c_1640x862.png" alt="Transformer Layer" />
        <figcaption>Classification head, with one output per token</figcaption>
    </figure>
    </div>

3. **Sequence-to-Sequence (Decoder-Based Tasks)**
    - The transformer generates an **output sequence** based on an **input sequence**.  
    - **Decoder mechanism**: Uses both self-attention and cross-attention (attending to encoder outputs).  
    - **Examples**: Machine translation (English → French), text summarization.

Each task type leverages the **same underlying Transformer architecture**, with different output processing layers tailored to the task.


## Cross-Attention

Cross-attention extends the self-attention mechanism by allowing queries, keys, and values to come from **different sequences**. This is particularly useful in **encoder-decoder architectures**, where the decoder attends to the encoder's output.

### **Key Difference from Self-Attention**
- In **self-attention**, queries, keys, and values all come from the **same** sequence.
- In **cross-attention**, queries come from the **decoder**, while keys and values come from the **encoder**.

### **Key Property**
- The number of **queries** does not have to match the number of **keys**.  
- This allows attention to be computed **between sequences of different lengths**.
- **Example**: In machine translation, the input sentence (source) may have a different length than the output sentence (target).

### **Formulation**
Given:
- **Decoder queries**: $Q \in \mathbb{R}^{n_{\text{dec}} \times d}$
- **Encoder keys/values**: $K, V \in \mathbb{R}^{n_{\text{enc}} \times d}$

The attention scores are computed as:

$$
A = \frac{QK^T}{\sqrt{d_k}}
$$

where $A \in \mathbb{R}^{n_{\text{dec}} \times n_{\text{enc}}}$ captures the relevance of each encoder token to each decoder token.

Applying softmax and weighting by $V$:

$$
Z = \text{softmax}(A) V
$$

where $Z \in \mathbb{R}^{n_{\text{dec}} \times d}$ is the updated decoder representation.

This mechanism allows the decoder to focus on the most relevant encoder tokens at each step, enabling **context-aware generation** in sequence-to-sequence tasks.


## Masked Language Modeling (MLM)

**Masked Language Modeling (MLM)** is a training objective where a model learns to predict missing words in a sentence. This is a **one-to-one prediction task**, where each token is processed independently, but only some tokens are used for supervision.

### **How It Works**
1. **Randomly mask** some tokens in the input sequence (e.g., 15% of tokens in BERT).
2. **Pass the masked sequence** through the Transformer.
3. **Predict the original tokens** for the masked positions using a classifier on top of each token's representation.
4. **Loss is only computed** on masked tokens, ignoring others.

### **Example**
**Input:**  
*"The cat sat on the [MASK]."*  
**Target:**  
*"The cat sat on the mat."*

### **Formulation**
Let $X = (x_1, x_2, ..., x_n)$ be the input tokens. A subset is masked, creating $X'$.  
The model computes contextualized embeddings for all tokens:

$$
Z = \text{Transformer}(X')
$$

A classification layer predicts the masked words:

$$
p(x_i) = \text{softmax}(W Z_i)
$$

where $W$ maps hidden states to vocabulary logits.

### **Why MLM?**
- Enables **bidirectional context learning**, unlike traditional left-to-right language modeling.
- Pretraining with MLM improves performance on **downstream NLP tasks** like classification, QA, and summarization.
