```{contents}
```
## Cross-Attention

**Cross-attention** is an attention mechanism where one sequence (the **query**) attends to a *different* sequence (the **context / source**) in order to extract relevant information.

It is the mechanism that allows models to **connect two different information sources**.

---

### **Core Intuition**

Imagine writing a summary while reading a document.

* Your **current sentence** = query
* The **document you're reading** = keys & values
* Your brain focuses on the most relevant parts of the document for each word you write

That focus process is **cross-attention**.

---

### **How It Works**

Given:

* Query matrix (Q) from one sequence
* Key (K) and Value (V) from another sequence

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V
$$

---

### **Architecture Placement**

#### Encoder–Decoder Transformers

```
Encoder Output ──► K, V
Decoder State ───► Q
```

The decoder queries the encoder’s output to generate each new token.

---

### **Why It Is Essential**

Cross-attention allows:

* Translation (target language attends to source language)
* Summarization (summary attends to document)
* Multimodal grounding (text attends to images, audio, video)
* Retrieval-Augmented Generation (LLM attends to retrieved documents)

---

### **Applications**

#### Machine Translation

Target sentence attends to source sentence.

#### Question Answering

Answer attends to the relevant parts of the passage.

#### RAG Systems

Generated output attends to retrieved documents.

#### Multimodal AI

Text attends to image features or audio embeddings.

#### Vision-Language Models

Captioning, visual question answering.

---

### **Benefits**

| Benefit               | Explanation                               |
| --------------------- | ----------------------------------------- |
| Information grounding | Connects output to external knowledge     |
| Precision             | Focuses on only the most relevant content |
| Modularity            | Decouples source and target sequences     |
| Scalability           | Efficient information fusion              |

---

### **Cross-Attention vs Self-Attention**

| Feature         | Self-Attention              | Cross-Attention             |
| --------------- | --------------------------- | --------------------------- |
| Source of Q,K,V | Same sequence               | Different sequences         |
| Purpose         | Understand internal context | Fuse external context       |
| Used in         | Encoder & Decoder           | Decoder & multimodal models |

---

**Intuition Summary**

Cross-attention is the model’s way of **looking up relevant information from another source while generating each output token**.