```{contents}
```
## Embeddings Cache 


An **embeddings cache** stores the **vector representation** of previously processed text so the system does **not recompute embeddings** for the same content.

This avoids repeated model calls and makes retrieval, search, and similarity operations much faster.

---

### Where It Fits in the Pipeline

```
Text Input
   ↓
Embedding Cache Lookup ── hit → Return Vector
   ↓ miss
Embedding Model → Vector → Store in Cache → Return
```

**Demonstration**

Only cache misses require calling the embedding model.

---

### Why Embeddings Caching Matters

| Benefit            | Impact                           |
| ------------------ | -------------------------------- |
| Lower latency      | Faster retrieval & search        |
| Lower cost         | Fewer embedding model calls      |
| Higher throughput  | More concurrent users            |
| Stable performance | Protection during traffic spikes |

---

### Simple Exact-Match Embedding Cache

#### Demonstration (Python)

```python
embedding_cache = {}

def get_embedding(text: str):
    key = text.strip().lower()

    if key in embedding_cache:
        print("Embedding Cache Hit")
        return embedding_cache[key]

    print("Embedding Cache Miss — Computing Embedding")
    vector = embedding_model.encode(text)
    embedding_cache[key] = vector
    return vector
```

---

### Persistent Cache Using Disk

#### Demonstration

```python
import pickle, os, hashlib

CACHE_FILE = "embeddings.cache"

if os.path.exists(CACHE_FILE):
    with open(CACHE_FILE, "rb") as f:
        embedding_cache = pickle.load(f)
else:
    embedding_cache = {}

def cache_key(text):
    return hashlib.sha256(text.encode()).hexdigest()

def get_embedding(text):
    key = cache_key(text)

    if key in embedding_cache:
        print("Disk Cache Hit")
        return embedding_cache[key]

    vector = embedding_model.encode(text)
    embedding_cache[key] = vector

    with open(CACHE_FILE, "wb") as f:
        pickle.dump(embedding_cache, f)

    return vector
```

---

### TTL-Based Embeddings Cache

#### Demonstration

```python
from cachetools import TTLCache

embedding_cache = TTLCache(maxsize=50000, ttl=86400)  # 24 hours

def get_embedding(text):
    if text in embedding_cache:
        return embedding_cache[text]

    vector = embedding_model.encode(text)
    embedding_cache[text] = vector
    return vector
```

---

### Distributed Cache (Production Pattern)

```
Application → Redis → Embedding Model
```

#### Demonstration (Conceptual)

```python
def get_embedding(text):
    key = hash(text)

    vector = redis.get(key)
    if vector:
        return deserialize(vector)

    vector = embedding_model.encode(text)
    redis.set(key, serialize(vector), ex=86400)
    return vector
```

---

### Invalidation Strategy

Invalidate when:

* Embedding model changes
* Tokenization rules change
* Text normalization rules change

---

### What Should Be Cached

| Layer      | Cached Data         |
| ---------- | ------------------- |
| Text input | Embedding vector    |
| Chunks     | Chunk embeddings    |
| Queries    | Query embeddings    |
| Documents  | Document embeddings |

---

### Mental Model

```
Embeddings Cache = Memory of all meaning your system has already computed
```

---

### Key Takeaways

* Embeddings are expensive — caching them is mandatory at scale
* Combine exact-match cache + TTL + persistent storage
* Include model version in the cache key
* One of the highest ROI optimizations in any RAG system
