```{contents}
```
## KV Cache

The **KV Cache** (Key–Value Cache) is an **inference-time optimization** in autoregressive language models.
It stores the **Key (K)** and **Value (V)** vectors computed during attention so they do **not** need to be recomputed at every new token generation step.

---

### Why KV Cache Is Needed

In an autoregressive model (GPT-like):

* Tokens are generated **one-by-one**.
* Each new token must attend to **all previous tokens**.

Without caching, when generating token *t*, the model recomputes:

```
K1, V1
K2, V2
...
K(t-1), V(t-1)
Kt , Vt
```

This means **repeating the same computation hundreds or thousands of times**.

This makes decoding **O(n²)** and extremely slow.

---

### What KV Cache Does

#### **It stores all previously computed K and V vectors.**

So during step *t*, the model computes **only the new token's** K and V:

```
K_cache = [K1, K2, ..., K(t-1)]
V_cache = [V1, V2, ..., V(t-1)]
```

At the new step:

1. Compute Kt, Vt
2. Append them to the cache
3. Run attention only between:

   * Query Qt (new token)
   * Cached Ks & Vs
   * Newly computed Kt & Vt

Thus, the model **never recomputes** K1…K(t−1) again.

This makes decoding **O(n)** instead of **O(n²)**.

---

### How KV Cache Works Internally

#### Prefill Phase

* The prompt (initial input) is processed *all at once*.
* K and V for each prompt token are computed and placed in cache.
* Fast because it is parallel.

#### Decode Phase

* Model predicts one token at a time.
* For each step:

  * Reuse cached Ks and Vs
  * Compute only K/V for the newest token
  * Append them to the cache

This step is memory-bandwidth heavy, not compute-heavy.

---

### Why KV Cache Improves Speed

#### Without KV cache:

To generate 100 tokens, attention is recomputed over:

```
1 + 2 + 3 + ... + 100 = 5050 token-steps
```

#### With KV cache:

We compute only:

```
100 new token-steps
```

This results in **10×–50× faster inference**, depending on hardware.

---

### Memory Tradeoff

KV cache grows with:

* Sequence length
* Batch size
* Model depth (layers)
* Attention heads

It can consume several GBs of GPU memory for long sequences.

Because of this, models use:

* **Multi-Query Attention (MQA)**
* **Grouped-Query Attention (GQA)**
* **PagedAttention (vLLM)**
  to reduce KV memory requirements.

---

### Simple PyTorch-Style Pseudocode

```python
class KVCache:
    def __init__(self):
        self.k = None
        self.v = None

    def append(self, k_new, v_new):
        if self.k is None:
            self.k = k_new      # (1, 1, dim)
            self.v = v_new
        else:
            self.k = torch.cat([self.k, k_new], dim=1)
            self.v = torch.cat([self.v, v_new], dim=1)

def generate_step(x_t, cache):
    K_t = W_K(x_t).unsqueeze(1)
    V_t = W_V(x_t).unsqueeze(1)
    cache.append(K_t, V_t)

    Q_t = W_Q(x_t).unsqueeze(1)

    scores = Q_t @ cache.k.transpose(-2, -1)
    attn = torch.softmax(scores, dim=-1)
    out = attn @ cache.v
    return out, cache
```

This snippet shows:

* Only K/V for the **new** token are computed
* Cached K/V supply all previous context

---

**One-Sentence Summary**

**The KV cache stores all past attention keys and values so the model never recomputes them, reducing autoregressive decoding from quadratic to linear time and enabling fast text generation.**