```{contents}
```
## Caching

### 1. Motivation

LLMs are computationally expensive because **each generated token depends on all previous tokens** through self-attention.
Caching avoids recomputing redundant intermediate results, reducing:

* **Latency**
* **Compute cost**
* **Memory bandwidth**
* **Energy consumption**

Caching is therefore fundamental to practical LLM deployment.

---

### 2. What Is Cached in LLMs?

At each transformer layer, self-attention computes:

[
Q = XW_Q,\quad K = XW_K,\quad V = XW_V
]

During autoregressive generation, **past keys and values never change**.
So we cache:

| Cached Object               | Purpose                            |
| --------------------------- | ---------------------------------- |
| **Key vectors (K)**         | Attention scoring                  |
| **Value vectors (V)**       | Context aggregation                |
| **Optional:** hidden states | For partial recomputation          |
| **Prompt embeddings**       | For repeated system / user prompts |

This is called the **KV Cache**.

---

### 3. Without vs With Cache

#### Without Cache

For each new token (t):

[
O(t^2 \cdot L)
]

All previous tokens are reprocessed at every layer (L).

#### With Cache

For each new token:

[
O(t \cdot L)
]

Only the **new token** is processed, while past K,V are reused.

---

### 4. Autoregressive Generation with KV Cache

**Workflow**

1. Encode prompt tokens.
2. For each layer:

   * Compute K,V and store in cache.
3. For each new token:

   * Compute K,V only for that token.
   * Append to cache.
   * Attend against cached K,V.

**Visualization**

```
Past tokens: [ t1  t2  t3 ... t(n-1) ]
Cached K,V:  [ K1  K2  K3 ... K(n-1) ]

New token tn:
    compute Kn, Vn
    append to cache
    attention over [K1 ... Kn]
```

---

### 5. Types of Caching in LLM Systems

| Type                | Scope              | Purpose                               |
| ------------------- | ------------------ | ------------------------------------- |
| **KV Cache**        | Model-internal     | Speed up generation                   |
| **Prompt Cache**    | Application-level  | Reuse repeated system/context prompts |
| **Embedding Cache** | Retrieval layer    | Avoid recomputing embeddings          |
| **Response Cache**  | API / app          | Reuse full outputs                    |
| **Prefix Cache**    | Multi-user servers | Share common prompt prefixes          |

---

### 6. Prompt Caching Example

If many requests start with the same system prompt:

```
"You are a financial assistant..."
```

The model can cache the internal representation of this prefix and only process the user-specific suffix.

---

### 7. Code Demonstration (KV Cache)

### PyTorch-style Pseudocode

```python
cache = [None] * num_layers

def generate_next_token(x, cache):
    new_cache = []
    for l in range(num_layers):
        K, V = compute_kv(x, layer=l)

        if cache[l] is not None:
            K = torch.cat([cache[l][0], K], dim=1)
            V = torch.cat([cache[l][1], V], dim=1)

        x = attention(x, K, V)
        new_cache.append((K, V))

    return x, new_cache
```

---

### 8. Memoryâ€“Performance Trade-off

| Effect                                | Impact                             |
| ------------------------------------- | ---------------------------------- |
| Cache size grows with sequence length | Higher VRAM usage                  |
| Large batch + long context            | Cache becomes dominant memory cost |
| Quantized cache                       | Lower memory, slight quality loss  |
| Cache eviction                        | Enables long conversations         |

---

### 9. Advanced Caching Strategies

| Technique                | Idea                                |
| ------------------------ | ----------------------------------- |
| **Paged KV Cache**       | GPU-friendly memory paging          |
| **Sliding Window Cache** | Keep only last N tokens             |
| **Speculative Cache**    | Cache draft model tokens            |
| **Shared Prefix Cache**  | Multi-tenant inference optimization |
| **Offloaded Cache**      | Move old cache to CPU / disk        |

---

### 10. Why Caching Is Fundamental

Without caching:

* Real-time chat is impossible
* Inference cost explodes quadratically
* Long-context models become unusable

With caching:

* Token generation becomes linear time
* Long conversations are feasible
* LLMs become economically deployable

---

### Summary Table

| Concept          | Role                                         |
| ---------------- | -------------------------------------------- |
| KV Cache         | Core acceleration of autoregressive decoding |
| Prompt Cache     | Avoid repeated prefix computation            |
| Embedding Cache  | Accelerates retrieval pipelines              |
| System Cache     | Scales multi-user LLM services               |
| Cache Management | Controls memory & performance                |

