
## Prompt Caching 


**Prompt caching** is a technique where **responses for previously seen prompts are stored** so that if the same (or very similar) prompt appears again, the system can return the cached result instead of calling the LLM.

This improves:

* **Latency**
* **Cost efficiency**
* **System throughput**

---

### Where Prompt Caching Fits

```
User Query
   ↓
Cache Lookup ── hit → return cached response
   ↓ miss
Prompt → LLM → Response → Store in Cache
```

---

### Why Prompt Caching Matters

| Benefit           | Impact                 |
| ----------------- | ---------------------- |
| Lower latency     | Faster responses       |
| Reduced cost      | Fewer LLM calls        |
| Higher throughput | Handles more users     |
| Stability         | Protects during spikes |

---

### Types of Prompt Caching

| Type              | Description                                           |
| ----------------- | ----------------------------------------------------- |
| Exact Match       | Same prompt returns same answer                       |
| Semantic Cache    | Similar prompts return similar answer                 |
| Partial Cache     | Cache parts of the pipeline (e.g., retrieved context) |
| Multi-layer Cache | Cache at retrieval, prompt, and response              |

---

### Demonstration

---

#### A. Simple Exact-Match Cache

```python
cache = {}

def get_answer(prompt):
    if prompt in cache:
        return cache[prompt]

    response = llm.invoke(prompt).content
    cache[prompt] = response
    return response
```

---

#### B. Semantic Prompt Cache (Embedding-Based)

```python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")
cache = []

def semantic_cache(prompt):
    emb = model.encode(prompt)
    for p, e, ans in cache:
        if np.dot(emb, e) > 0.9:
            return ans
    return None
```

---

#### C. Cache Invalidation (Important)

```python
def invalidate_cache(key):
    cache.pop(key, None)
```

Invalidate on:

* Prompt changes
* Model upgrades
* Knowledge base updates

---

#### D. TTL-Based Cache

```python
from cachetools import TTLCache

cache = TTLCache(maxsize=1000, ttl=3600)
```

---

### What Should Be Cached?

| Layer            | Cache Candidate     |
| ---------------- | ------------------- |
| User queries     | Full responses      |
| RAG retrieval    | Retrieved documents |
| Prompt templates | Compiled prompts    |
| Embeddings       | Query embeddings    |

---

### Production Best Practices

* Hash normalized prompts
* Include model + prompt version in cache key
* Use Redis for distributed cache
* Apply TTL and eviction policies
* Log cache hit ratio

---

### Mental Model

```
Prompt Caching = Memory for your LLM system
```

---

### Key Takeaways

* Prompt caching is one of the highest ROI optimizations
* Reduces both cost and latency dramatically
* Must be version-aware and invalidation-safe
* Essential for scalable LLM applications
