#  Attention


### Attention Mechanism Formula

The core equation for **scaled dot-product attention** is:

$$
\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V
$$

#### Components:
- **\( Q \) (Query)**: Represents what the model is "looking for" (e.g., a word in a sentence).  
- **\( K \) (Key)**: Encodes what each element in the input sequence "offers" (used to compute relevance).  
- **\( V \) (Value)**: Contains the actual content to be retrieved if the query matches the key.  
- **\( d_k \)**: Dimensionality of the key vectors. Scaling by \( \sqrt{d_k} \) prevents gradient vanishing in softmax for large dimensions.

---

### Intuition:
1. **\( QK^T \)**: Computes pairwise similarity scores between queries and keys.  
2. **Softmax**: Normalizes scores into probabilities (attention weights).  
3. **Weighted Sum (\( \cdot V \))**: Aggregates values based on attention weights.  

---

### Key Properties:
- **Efficiency**: Computes in parallel (unlike RNNs).  
- **Interpretability**: Attention weights reveal which inputs the model focuses on.  
- **Scalability**: Used in Transformers for tasks like translation (e.g., "hello" → "hola").  

---

### Example (PyTorch):
```python
import torch
import torch.nn.functional as F

# Input tensors (batch_size=1, seq_len=2, d_k=3)
Q = torch.tensor([[[1.0, 2.0, 3.0]]])  # Query
K = V = torch.tensor([[[4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])  # Key=Value

scores = torch.matmul(Q, K.transpose(-1, -2)) / (3 ** 0.5)  # QK^T / sqrt(d_k)
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, V)  # Weighted sum

print("Attention Output:", output)
```

**Output** (weighted combination of values):  
```
Attention Output: tensor([[[6.9999, 7.9999, 8.9999]]])
```

---

### Applications:
1. **Transformers**: Self-attention in encoder/decoder layers.  
2. **Vision**: Image captioning (attending to image regions).  
3. **NLP**: BERT, GPT (contextual word representations).  

**Reference**: [Vaswani et al. (2017), "Attention Is All You Need"](https://arxiv.org/abs/1706.03762).

In [1]:
import torch

# Input: 3 words with embedding size 4 (for simplicity)
X = torch.tensor([
    [1.0, 0.0, 1.0, 0.0],   # Word 1 ("I")
    [0.0, 1.0, 1.0, 0.0],    # Word 2 ("love")
    [1.0, 0.5, 0.0, 1.0]     # Word 3 ("ice")
])


In [2]:
# Step 1: Initialize random weight matrices for Q, K, V
d_k = X.shape[1]  # Dimension of embeddings (4)
W_Q = torch.randn(d_k, d_k)
W_K = torch.randn(d_k, d_k)
W_V = torch.randn(d_k, d_k)

In [6]:
Q = X @ W_Q.T   # (3, 4)
K = X @ W_K.T   # (3, 4)
V = X @ W_V.T   # (3, 4)

In [8]:
# Compute attention
scores = Q @ K.T               # (3, 3)

In [15]:
import torch.nn.functional as F

scaled_scores = scores / (2 ** 0.5)  # scale by sqrt(d_k)
attention_weights = F.softmax(scaled_scores, dim=-1)  # (3, 3)
output = attention_weights @ V   # (3, 4)
print("Attention Output:\n", output)

Attention Output:
 tensor([[ 2.2379,  0.5276, -1.0577,  0.5538],
        [ 0.5632, -0.0875, -1.0984, -0.2464],
        [ 0.4480, -0.1025, -1.1207, -0.3185]])


## Self-Attention as a PyTorch Module

In [18]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_size):
        super().__init__()
        self.embed_size = embed_size
        self.W_Q = nn.Linear(embed_size, embed_size)
        self.W_K = nn.Linear(embed_size, embed_size)
        self.W_V = nn.Linear(embed_size, embed_size)
        
    def forward(self, X):
        Q = self.W_Q(X)
        K = self.W_K(X)
        V = self.W_V(X)
        
        scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.embed_size))
        attention_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, V)
        
        return output

In [19]:
# Usage
attention = SelfAttention(embed_size=4)
output = attention(X)
print("Output with PyTorch Module:\n", output)

Output with PyTorch Module:
 tensor([[-0.8030, -0.0612, -0.5328,  1.0383],
        [-0.8797, -0.0825, -0.5166,  1.0086],
        [-0.8244, -0.0654, -0.5287,  1.0291]], grad_fn=<MmBackward0>)


## Multi Head attention

In [11]:
# Define projection matrices for 2 heads
W_q1 = torch.randn(d_k, d_k)
W_k1 = torch.randn(d_k, d_k)
W_v1 = torch.randn(d_k, d_k)

W_q2 = torch.randn(d_k, d_k)
W_k2 = torch.randn(d_k, d_k)
W_v2 = torch.randn(d_k, d_k)



In [12]:
# Head 1
Q1 = X @ W_q1.T
K1 = X @ W_k1.T
V1 = X @ W_v1.T
scores1 = Q1 @ K1.T / (2 ** 0.5)
attn1 = F.softmax(scores1, dim=-1)
out1 = attn1 @ V1

# Head 2
Q2 = X @ W_q2.T
K2 = X @ W_k2.T
V2 = X @ W_v2.T
scores2 = Q2 @ K2.T / (2 ** 0.5)
attn2 = F.softmax(scores2, dim=-1)
out2 = attn2 @ V2

In [13]:
# Concatenate heads
multi_head_output = torch.cat([out1, out2], dim=-1)

print("Multi-head Attention Output:\n", multi_head_output)

Multi-head Attention Output:
 tensor([[ 0.7570, -0.1414,  1.1894,  0.8090, -0.8680, -0.5871, -1.7324, -1.8429],
        [ 0.7806, -0.1700,  1.2400,  0.7191, -0.8476, -0.5985, -1.5904, -1.8747],
        [ 0.8430, -0.2384,  1.3093,  0.5228, -0.8673, -0.5944, -1.6964, -1.8339]])


**Each head sees the data differently — one might focus on similar directions, another on differences, etc.**

I have manually set different weight matrices (W_q1, W_q2, etc.) by hand to illustrate how multi-head attention works with different views. But in practice, **we initialize them as learnable parameters** and let the model optimize them during training.

When building a real multi-head attention module in PyTorch, we **initialize the projection matrices randomly using nn.Linear** (which includes bias and weight initialization).


### Multi-Head Attention Mechanism

The Multi-Head Attention extends the standard attention mechanism by running multiple attention heads in parallel, allowing the model to jointly attend to information from different representation subspaces. The key equations are:

#### **1. Single Attention Head**  
For each head \( i \):  

$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)  
$$

- **\( Q, K, V \)**: Input Query, Key, and Value matrices.  
- **\( W_i^Q, W_i^K, W_i^V \)**: Learnable weight matrices for head \( i \).  
- **Attention**: Scaled dot-product attention:  
  $$
  \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
  $$

#### **2. Concatenation and Projection**  
All heads are concatenated and linearly projected:  

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O  
$$

- **\( h \)**: Number of attention heads.  
- **\( W^O \)**: Learnable output projection matrix.  

---

### **Key Intuitions**  
1. **Parallel Processing**:  
   - Each head learns different attention patterns (e.g., local vs. global dependencies).  
   - Example: In translation, one head may focus on syntax, another on semantics.  

2. **Dimensionality Management**:  
   - Input embeddings are split into \( h \) subspaces (typically \( d_k = d_v = d_{\text{model}}/h \)).  

3. **Expressiveness**:  
   - Multi-head attention captures diverse relationships (e.g., coreference resolution + word order in NLP).  

---

### **PyTorch Implementation**  
```python
import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, h=8):
        super().__init__()
        self.d_model = d_model
        self.h = h
        self.d_k = d_model // h
        
        # Weight matrices
        self.W_Q = nn.Linear(d_model, d_model)  # Queries
        self.W_K = nn.Linear(d_model, d_model)  # Keys
        self.W_V = nn.Linear(d_model, d_model)  # Values
        self.W_O = nn.Linear(d_model, d_model)  # Output

    def forward(self, Q, K, V):
        batch_size = Q.size(0)
        
        # Linear projections split into h heads
        Q = self.W_Q(Q).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        K = self.W_K(K).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        V = self.W_V(V).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention per head
        scores = torch.matmul(Q, K.transpose(-1, -2)) / (self.d_k ** 0.5)
        weights = torch.softmax(scores, dim=-1)
        head = torch.matmul(weights, V)
        
        # Concatenate and project
        head = head.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_O(head)
```

---

### **Why Multi-Head?**  
| **Single-Head**       | **Multi-Head**                     |
|------------------------|-----------------------------------|
| Single attention focus | Diverse attention patterns        |
| Prone to over-smoothing| Robust to noise/ambiguity         |
| Limited expressiveness | Captures hierarchical dependencies|

---

### **Applications**  
1. **Transformers**:  
   - Self-attention in encoder/decoder layers (e.g., BERT, GPT).  
2. **Vision**:  
   - Vision Transformers (ViT) for image patches.  
3. **Speech**:  
   - Conformer models combine CNNs + multi-head attention.  

**Reference**: [Vaswani et al. (2017), "Attention Is All You Need"](https://arxiv.org/abs/1706.03762).  

In [None]:
class SimpleMultiHeadAttention(nn.modules):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        assert embed_dim%num_heads == 0
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(embed_dim, embed_dim)
        self.W_k = nn.Linear(embed_dim, embed_dim)
        self.W_v = nn.Linear(embed_dim, embed_dim)
        
        # Output projection after concatenation of all heads
        self.W_o = nn.Linear(embed_dim, embed_dim)
        
        