<a href="https://colab.research.google.com/github/somendrew/LLMs/blob/main/Self_Multihead_positional_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Self-Attention ‚Äî Complete Master Notes

---

# 1Ô∏è‚É£ Why Self-Attention Exists

Traditional sequence models (RNNs, LSTMs):

- Process tokens sequentially
- Struggle with long-range dependencies
- Cannot be fully parallelized
- Gradients degrade over long sequences

Self-attention solves this by:

> Allowing every token to directly attend to every other token in a single operation.

Benefits:

- Global context access
- Parallel computation
- Better long-range modeling
- Simpler architecture

---

# 2Ô∏è‚É£ Core Intuition

Example:

"The animal didn‚Äôt cross the street because it was too tired."

To understand "it", the model must:

- Look at "animal"
- Compare semantic compatibility
- Assign importance

Self-attention lets each token compute:

> ‚ÄúWhich tokens are relevant to me?‚Äù

Each token dynamically builds its own context.

---

# 3Ô∏è‚É£ Core Components: Query, Key, Value (Q, K, V)

Each input embedding is linearly projected into three vectors:

- Query (Q) ‚Üí what this token is searching for
- Key (K) ‚Üí what this token represents
- Value (V) ‚Üí information to pass forward

Think:

| Component | Analogy |
|------------|----------|
| Query | Question |
| Key | Label |
| Value | Content |

These are learned projections ‚Äî not manually defined.

---

# 4Ô∏è‚É£ Mathematical Formulation

Let:

- Sequence length = N
- Model dimension = d_model
- Key dimension = d_k

Input embeddings:

X ‚àà ‚Ñù^(N √ó d_model)

Linear projections:

Q = XW_Q
K = XW_K
V = XW_V

Where:

W_Q, W_K, W_V ‚àà ‚Ñù^(d_model √ó d_k)

Now:

Q, K, V ‚àà ‚Ñù^(N √ó d_k)

---

# 5Ô∏è‚É£ Attention Score Matrix

Compute similarity:

Score = QK·µÄ

Shape:

(N √ó d_k) ¬∑ (d_k √ó N) = N √ó N

Each element (i, j) measures how much token i attends to token j.

This produces the attention matrix.

---

# 6Ô∏è‚É£ Scaled Dot-Product Attention

Full formula:

Attention(Q, K, V) = softmax(QK·µÄ / ‚àöd_k) V

Why divide by ‚àöd_k?

Without scaling:

- Dot products grow large as dimension increases
- Softmax becomes extremely sharp
- Gradients vanish

Scaling stabilizes variance of dot products.

---

# 7Ô∏è‚É£ Softmax Step

Softmax is applied row-wise.

Each row becomes a probability distribution:

- Values between 0 and 1
- Each row sums to 1

Interpretation:

For each token:
- A distribution over all tokens
- Determines importance weights

---

# 8Ô∏è‚É£ Weighted Sum

Final output:

Output = Attention_weights ¬∑ V

Shape:

(N √ó N) ¬∑ (N √ó d_k) = N √ó d_k

Each token becomes a weighted mixture of all tokens.

This creates contextualized representations.

---

# 9Ô∏è‚É£ Masking (Critical Detail)

Self-attention behaves differently in encoder and decoder.

## Encoder Self-Attention
- Bidirectional
- Tokens can attend to past and future

## Decoder Self-Attention
- Uses causal mask
- Tokens cannot see future tokens

Masking sets future positions to -‚àû before softmax.

This ensures autoregressive behavior.

---

# üîü Computational Complexity

Self-attention cost:

O(N¬≤ ¬∑ d_k)

Because attention matrix is N √ó N.

This is why very long sequences become expensive.

Modern research tries to reduce this (Longformer, FlashAttention, etc.).

---

# 1Ô∏è‚É£1Ô∏è‚É£ Tensor Shape Example

Example:

Batch size = 2  
Sequence length = 5  
Hidden size = 768  
Number of heads = 12  

Per head:

d_k = 768 / 12 = 64

Shapes:

Q, K, V ‚Üí (2, 12, 5, 64)

Attention matrix ‚Üí (2, 12, 5, 5)

Final output ‚Üí (2, 5, 768)

Heads are concatenated and projected back to d_model.

---

# 1Ô∏è‚É£2Ô∏è‚É£ What Self-Attention Is NOT

It is not:

- A memory lookup
- A symbolic reasoning engine
- True understanding

It is:

> Learned statistical weighting of token interactions.

Over layers, this produces:

- Syntax awareness
- Coreference resolution
- Semantic similarity
- Structural patterns

---

# 1Ô∏è‚É£3Ô∏è‚É£ Relationship to Multi-Head Attention

Single-head attention learns one type of relationship.

Multi-head attention:

- Runs attention multiple times in parallel
- Each head learns different patterns
- Outputs are concatenated

Self-attention = core mechanism  
Multi-head = parallel enhancement

---

# 1Ô∏è‚É£4Ô∏è‚É£ Role in Transformer Layer

A full Transformer encoder layer:

1. Multi-head self-attention
2. Add & LayerNorm
3. Feedforward network (MLP)
4. Add & LayerNorm

Self-attention is only one component of the layer.

---

# 1Ô∏è‚É£5Ô∏è‚É£ Training vs Inference Behavior

Self-attention mechanism itself does NOT change.

What changes:

- Decoder masking during inference
- Autoregressive token feeding

But attention computation remains identical.

---

# 1Ô∏è‚É£6Ô∏è‚É£ Common Interview Questions

Q: Why is self-attention better than RNNs?  
A: Parallelization + direct long-range dependency modeling.

Q: Why scale by ‚àöd_k?  
A: Prevents softmax saturation and stabilizes gradients.

Q: What is the shape of the attention matrix?  
A: N √ó N (per head).

Q: Why separate Q, K, V?  
A: To decouple similarity computation from information content.

Q: What is the computational bottleneck?  
A: O(N¬≤) memory and compute.

---

# 1Ô∏è‚É£7Ô∏è‚É£ One-Line Summary

Self-attention allows each token to dynamically compute a weighted combination of all other tokens using learned similarity projections, producing contextualized representations.

---

# 1Ô∏è‚É£8Ô∏è‚É£ Ultra-Compressed Formula View

Given X:

Q = XW_Q  
K = XW_K  
V = XW_V  

Attention = softmax(QK·µÄ / ‚àöd_k) V

That is the entire mechanism.


# Multi-Head Attention ‚Äî Complete Master Notes

---

# 1Ô∏è‚É£ Why Multi-Head Attention Exists

Single-head attention can only learn **one type of relationship** at a time.

Example relationships:
- Subject‚Äìverb agreement
- Coreference resolution
- Long-range dependency
- Positional patterns
- Semantic similarity

Instead of forcing one attention mechanism to capture everything,
Transformers use **multiple attention heads in parallel**.

> Each head learns a different representation subspace.

This increases expressiveness.

---

# 2Ô∏è‚É£ Core Idea

Instead of computing:

Attention(Q, K, V)

Once,

We compute it **h times in parallel**.

Each head:

- Has its own W_Q, W_K, W_V
- Works in a smaller dimensional space
- Learns different patterns

Then we:

1. Concatenate outputs of all heads
2. Apply a final linear projection

---

# 3Ô∏è‚É£ Mathematical Formulation

Let:

- d_model = model dimension
- h = number of heads
- d_k = d_model / h

Input:

X ‚àà ‚Ñù^(N √ó d_model)

For each head i:

Q_i = XW_Q_i  
K_i = XW_K_i  
V_i = XW_V_i  

Where:

W_Q_i, W_K_i, W_V_i ‚àà ‚Ñù^(d_model √ó d_k)

Compute attention per head:

head_i = softmax(Q_i K_i·µÄ / ‚àöd_k) V_i

Each head output shape:

‚Ñù^(N √ó d_k)

---

# 4Ô∏è‚É£ Concatenation Step

After computing all heads:

Concat(head‚ÇÅ, head‚ÇÇ, ..., head_h)

Shape:

‚Ñù^(N √ó (h √ó d_k))

Since:

h √ó d_k = d_model

So concatenated output shape:

‚Ñù^(N √ó d_model)

---

# 5Ô∏è‚É£ Final Linear Projection

Apply output projection:

Output = Concat(...) W_O

Where:

W_O ‚àà ‚Ñù^(d_model √ó d_model)

This mixes information from all heads.

Final output shape:

‚Ñù^(N √ó d_model)

---

# 6Ô∏è‚É£ Full Formula

MultiHead(Q, K, V) = Concat(head‚ÇÅ, ..., head_h) W_O

Where:

head_i = Attention(QW_Q_i, KW_K_i, VW_V_i)

---

# 7Ô∏è‚É£ Why Split Into Smaller Dimensions?

If we kept full dimension per head:

Cost would explode.

Instead:

d_k = d_model / h

So total computation stays similar to single-head attention.

Benefits:

- Multiple representation subspaces
- No increase in output dimension
- Better learning capacity

---

# 8Ô∏è‚É£ Intuition Example

Imagine 8 heads:

Head 1 ‚Üí grammar relationships  
Head 2 ‚Üí semantic similarity  
Head 3 ‚Üí positional relations  
Head 4 ‚Üí coreference  
Head 5 ‚Üí long-range dependency  
Head 6 ‚Üí local context  
Head 7 ‚Üí phrase boundaries  
Head 8 ‚Üí global sentence meaning  

Each head sees the same sentence,
but focuses on different aspects.

---

# 9Ô∏è‚É£ Tensor Shape Example (Concrete)

Example:

Batch size = 2  
Sequence length = 5  
d_model = 768  
h = 12  

Then:

d_k = 768 / 12 = 64

After projection:

Q, K, V ‚Üí (2, 5, 768)

Reshaped per head:

(2, 12, 5, 64)

Attention matrix per head:

(2, 12, 5, 5)

Output per head:

(2, 12, 5, 64)

After concatenation:

(2, 5, 768)

After W_O:

(2, 5, 768)

---

# üîü Computational Complexity

Still dominated by:

O(N¬≤ ¬∑ d_model)

Because each head computes N √ó N attention.

Multi-head does NOT change quadratic complexity.

---

# 1Ô∏è‚É£1Ô∏è‚É£ Encoder vs Decoder Multi-Head Attention

In Encoder:

- Multi-head self-attention
- Bidirectional

In Decoder:

Two types of multi-head attention:

1. Masked self-attention
2. Cross-attention (attends to encoder output)

Cross-attention uses:

Queries ‚Üí from decoder  
Keys/Values ‚Üí from encoder  

---

# 1Ô∏è‚É£2Ô∏è‚É£ Why It Works So Well

Because:

‚úî Parallel attention mechanisms  
‚úî Different learned projection spaces  
‚úî Richer feature extraction  
‚úî Improves gradient flow  
‚úî Enables specialization  

It increases model capacity without increasing output dimension.

---

# 1Ô∏è‚É£3Ô∏è‚É£ What Multi-Head Attention Is NOT

It is not:

- Multiple independent models
- Multiple independent sequences

All heads share the same input,
but learn different projections.

---

# 1Ô∏è‚É£4Ô∏è‚É£ Implementation Insight (PyTorch Style)

In practice:

Instead of storing separate matrices for each head,
frameworks often:

- Use one big W_Q, W_K, W_V
- Then reshape into heads

Example shape:

W_Q ‚àà ‚Ñù^(d_model √ó d_model)

Then split into h chunks.

More efficient.

---

# 1Ô∏è‚É£5Ô∏è‚É£ Common Interview Questions

Q: Why use multiple heads instead of one large head?  
A: Allows learning diverse relationships in different subspaces.

Q: Does multi-head increase output size?  
A: No. After concatenation and projection, output size = d_model.

Q: Does it increase computational complexity?  
A: Still O(N¬≤), but more expressive.

Q: Why must d_model be divisible by number of heads?  
A: Because we split hidden dimension evenly across heads.

---

# 1Ô∏è‚É£6Ô∏è‚É£ Key Insight

Self-attention learns relationships.

Multi-head attention learns multiple types of relationships simultaneously.

---

# 1Ô∏è‚É£7Ô∏è‚É£ One-Line Summary

Multi-head attention runs multiple scaled dot-product attention mechanisms in parallel, concatenates their outputs, and projects them back to the original dimension to increase representational power.

---

# 1Ô∏è‚É£8Ô∏è‚É£ Ultra-Compressed Formula View

For head i:

head_i = softmax((XW_Q_i)(XW_K_i)·µÄ / ‚àöd_k)(XW_V_i)

MultiHead(X) = Concat(head‚ÇÅ, ..., head_h) W_O


# Positional Encoding ‚Äî Complete Master Notes

---

# 1Ô∏è‚É£ Why Positional Encoding Is Needed

Self-attention has **no built-in notion of order**.

It treats input as a set, not a sequence.

Example:

"dog bites man"
"man bites dog"

Without positional information,
self-attention would treat both as identical collections of tokens.

Therefore:

> We must inject position information into token embeddings.

---

# 2Ô∏è‚É£ Where Positional Encoding Is Applied

Before entering the Transformer layers:

Final input to model:

Input_Embedding = Token_Embedding + Positional_Encoding

Both have shape:

‚Ñù^(N √ó d_model)

Where:

N = sequence length  
d_model = hidden dimension  

---

# 3Ô∏è‚É£ Two Main Types of Positional Encoding

There are two primary approaches:

1Ô∏è‚É£ Fixed (Sinusoidal) Positional Encoding  
2Ô∏è‚É£ Learned Positional Embeddings  

---

# 4Ô∏è‚É£ Sinusoidal Positional Encoding (Original Transformer)

Introduced in:

"Attention Is All You Need"

It uses sine and cosine functions of different frequencies.

Formula:

For position pos and dimension i:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))  
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Properties:

- Even dimensions use sine
- Odd dimensions use cosine
- Different frequencies per dimension

---

# 5Ô∏è‚É£ Why Sine and Cosine?

Important properties:

‚úî Unique encoding for each position  
‚úî Allows model to learn relative positions  
‚úî Periodic structure  
‚úî Can generalize to longer sequences than seen in training  

Because:

sin(a + b) can be expressed using sin(a), cos(a)

So the model can infer relative distance.

---

# 6Ô∏è‚É£ Shape Example

Example:

Sequence length = 5  
d_model = 8  

Positional Encoding matrix shape:

(5 √ó 8)

Each row corresponds to a position.

Example (conceptual):

Position 0 ‚Üí [0, 1, 0, 1, 0, 1, 0, 1]  
Position 1 ‚Üí [sin(1), cos(1), sin(freq), cos(freq), ...]  
Position 2 ‚Üí different values  

---

# 7Ô∏è‚É£ Learned Positional Embeddings

Instead of fixed sinusoids:

We create a learnable embedding matrix:

P ‚àà ‚Ñù^(max_seq_length √ó d_model)

Each position has its own trainable vector.

Used in:

- GPT models
- BERT
- Most modern LLMs

Advantages:

‚úî More flexible  
‚úî Learns task-specific position patterns  

Disadvantages:

‚úñ Cannot extrapolate beyond trained sequence length  

---

# 8Ô∏è‚É£ Absolute vs Relative Positional Encoding

## Absolute Position Encoding
Encodes:

"This token is at position 5."

Used in:
- Original Transformer
- BERT
- GPT-2

---

## Relative Position Encoding

Encodes:

"This token is 3 positions away from another token."

Used in:
- Transformer-XL
- T5
- Modern architectures

More powerful for long sequences.

---

# 9Ô∏è‚É£ Why Addition (Not Concatenation)?

We add positional encoding instead of concatenating.

Why?

If concatenated:

Dimension would double ‚Üí computational cost increases.

By adding:

- Keeps dimension = d_model
- Forces model to integrate position into representation

---

# üîü Important Insight

Token embedding captures:

"What this word means."

Positional encoding captures:

"Where this word is."

The sum provides:

"What this word means at this position."

---

# 1Ô∏è‚É£1Ô∏è‚É£ How Position Information Propagates

After addition:

Multi-head attention processes position-aware embeddings.

Through layers:

- Relative distances become encoded
- Structural information emerges
- Word order relationships are learned

---

# 1Ô∏è‚É£2Ô∏è‚É£ What Happens Without Positional Encoding?

Model becomes permutation invariant.

Meaning:

Reordering tokens gives same result.

This would break language modeling.

---

# 1Ô∏è‚É£3Ô∏è‚É£ Computational Cost

Positional encoding:

- Very cheap
- O(N √ó d_model)

Compared to attention:

- O(N¬≤ √ó d_model)

Position encoding is negligible in cost.

---

# 1Ô∏è‚É£4Ô∏è‚É£ Modern Variants

Modern large models often use:

- Rotary Positional Embeddings (RoPE)
- ALiBi (Attention with Linear Biases)
- Relative bias encodings

These improve:

‚úî Long context performance  
‚úî Extrapolation  
‚úî Stability  

---

# 1Ô∏è‚É£5Ô∏è‚É£ Example: GPT vs BERT

BERT:
- Learned absolute position embeddings
- Encoder-only
- Bidirectional

GPT:
- Learned absolute position embeddings (earlier versions)
- Decoder-only
- Causal mask applied

Modern GPT variants use improved position schemes.

---

# 1Ô∏è‚É£6Ô∏è‚É£ Common Interview Questions

Q: Why do we need positional encoding?  
A: Self-attention has no inherent order awareness.

Q: Why sine and cosine?  
A: Enables relative position inference and generalization.

Q: Why add instead of concatenate?  
A: Keeps dimension fixed and computationally efficient.

Q: What is the difference between absolute and relative encoding?  
A: Absolute encodes index; relative encodes distance between tokens.

---

# 1Ô∏è‚É£7Ô∏è‚É£ One-Line Summary

Positional encoding injects order information into token embeddings so that self-attention can model sequence structure.

---

# 1Ô∏è‚É£8Ô∏è‚É£ Ultra-Compressed View

Input = TokenEmbedding + PositionEncoding

Without it ‚Üí model cannot distinguish word order.
