```{contents}
```
## Sparse Attention

**Sparse Attention** is an optimization of the standard attention mechanism where each token attends to **only a subset of other tokens** instead of all tokens.

It dramatically reduces computation while preserving model performance on long sequences.

---

### **Core Intuition**

In human reading, you do not attend to every word equally.
You focus on **nearby context** and a few **important distant points**.

Sparse attention does the same:

> **Attend locally, plus a few global anchors.**

---

### **Problem with Full Attention**

Standard attention cost:

[
O(n^2)
]

This becomes infeasible for long sequences.

Sparse attention reduces this to:

[
O(n \cdot k)
\quad \text{where } k \ll n
]

---

### **How Sparse Attention Works**

Instead of connecting every token to every other token, it uses **structured patterns**:

#### Common Patterns

* Local window attention
* Strided attention
* Block sparse attention
* Global tokens
* Random attention

These patterns together ensure information flow without full connectivity.

---

### **Architecture Example**

```
[Local Window]  [Local Window]  [Local Window]
      ↓              ↓              ↓
    Global Token  ← Strided Links →  Global Token
```

---

### **Applications**

#### Long Document Modeling

Books, research papers, legal contracts.

#### Genomic Modeling

DNA sequences with millions of tokens.

#### Video & Audio Processing

Long temporal dependencies.

#### Multimodal Models

Large images and long text.

#### Efficient LLMs

Handling extended context windows.

---

### **Benefits**

| Benefit                   | Explanation                |
| ------------------------- | -------------------------- |
| Scales to long sequences  | Avoids quadratic explosion |
| Lower memory usage        | Efficient inference        |
| Maintains performance     | Preserves long-range info  |
| Enables long-context LLMs | Practical deployment       |

---

### **Popular Sparse Attention Models**

* Longformer
* BigBird
* Reformer
* Sparse Transformer
* GPT-4 long-context variants

---

### **Sparse vs Dense Attention**

| Feature      | Dense Attention | Sparse Attention |
| ------------ | --------------- | ---------------- |
| Computation  | O(n²)           | O(nk)            |
| Memory       | High            | Low              |
| Long-context | Poor            | Excellent        |
| Accuracy     | High            | Comparable       |

---

### **Intuition Summary**

Sparse attention teaches the model **where to look**, not to look everywhere.

---

If you want, I can explain **how sparse attention and state space models are reshaping long-context AI systems**.
