# 📓 The GenAI Revolution Cookbook

**Title:** Semantic Cache LLM: How to Implement with Redis Vector to Cut Costs

**Description:** Build a semantic cache LLM using embeddings and Redis Vector with TTLs, thresholds, metrics to reduce LLM spend and latency.

**📖 Read the full article:** [Semantic Cache LLM: How to Implement with Redis Vector to Cut Costs](https://blog.thegenairevolution.com/article/semantic-cache-llm-how-to-implement-with-redis-vector-to-cut-costs-2)

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



Here's the thing about LLM applications - most of them are burning through money and time answering what's essentially the same question, just phrased a bit differently. A semantic cache fixes this by recognizing when someone's new query is basically the same as something you've already answered, then returning that cached response instantly. No LLM call needed.
I'm going to walk you through building a production-grade semantic cache using embeddings and Redis Vector. We'll create a FastAPI microservice with a Redis-backed semantic cache, complete with thresholds, TTLs, and metrics. By the time we're done, you'll have working code, a tunable architecture, and honestly, a pretty clear path to cutting your latency and costs dramatically.
**What we're building:**
<ul><li>A Redis HNSW vector index for semantic similarity search
</li><li>A cache layer that normalizes queries, generates embeddings, and retrieves cached responses
</li><li>A FastAPI endpoint to serve cached or fresh LLM answers
</li><li>A demo script to validate cache hit rates and latency improvements
</li></ul>**What you'll need:**
<ul><li>Python 3.9+
</li><li>Redis Stack (either local via Docker or managed Redis Cloud)
</li><li>OpenAI API key
</li><li>Basic familiarity with embeddings and vector search
</li></ul>Quick note - if you're using Google Colab or a cloud notebook, just connect to a managed Redis Stack instance (like Redis Cloud) instead of trying to run Docker locally. It'll save you some headaches.
For a deeper understanding of how LLMs manage memory and the concept of context rot, check out our article on <a target="_blank" rel="noopener noreferrer nofollow" href="/article/context-rot-why-llms-forget-as-their-memory-grows">why LLMs "forget" as their memory grows</a>.
<hr>## How It Works (High-Level Overview)**The paraphrase problem:** Users ask the same question in countless ways. "What's your refund policy?" and "Can I get my money back?" mean exactly the same thing, but traditional caching treats them as completely different keys. It's frustrating.
**The embedding advantage:** Embeddings map text into this high-dimensional vector space where semantically similar phrases naturally cluster together. By comparing query embeddings using cosine similarity, you can detect paraphrases and return cached responses. Pretty elegant, actually.
**Why Redis Vector:** Redis Stack provides HNSW (Hierarchical Navigable Small World) indexing for fast approximate nearest neighbor search. And here's what I really like about it - it combines low-latency vector search with Redis's native TTL, tagging, and filtering capabilities. Perfect for production caching.
**The architecture works like this:**
<ol><li>Normalize the user query (lowercase, strip out volatile patterns like timestamps)
</li><li>Generate an embedding for the normalized query
</li><li>Search the Redis HNSW index for the nearest cached embedding
</li><li>If distance < threshold and metadata matches (model, temperature, system prompt hash), return the cached response
</li><li>Otherwise, call the LLM, cache the new response with its embedding, and return it
</li></ol><hr>## Setup & Installation### Option 1: Managed Redis (Recommended for Notebooks)Sign up for a free Redis Cloud account at <a target="_blank" rel="noopener noreferrer nofollow" href="https://redis.com/try-free">redis.com/try-free</a> and create a Redis Stack database. Grab the connection URL once it's ready.
In your notebook or terminal:

In [None]:
%pip install redis openai python-dotenv numpy

Set your environment variables:

In [None]:
import os
os.environ["REDIS_URL"] = "redis://default:password@your-redis-host:port"
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["EMBEDDING_MODEL"] = "text-embedding-3-small"
os.environ["CHAT_MODEL"] = "gpt-4o-mini"
os.environ["SIMILARITY_THRESHOLD"] = "0.10"
os.environ["TOP_K"] = "5"
os.environ["CACHE_TTL_SECONDS"] = "86400"
os.environ["CACHE_NAMESPACE"] = "sc:v1:"
os.environ["CORPUS_VERSION"] = "v1"
os.environ["TEMPERATURE"] = "0.2"

### Option 2: Local Redis with Docker

In [None]:
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

Create a `.env` file:
<pre><code>REDIS_URL=redis://localhost:6379
OPENAI_API_KEY=sk-...
EMBEDDING_MODEL=text-embedding-3-small
CHAT_MODEL=gpt-4o-mini
SIMILARITY_THRESHOLD=0.10
TOP_K=5
CACHE_TTL_SECONDS=86400
CACHE_NAMESPACE=sc:v1:
CORPUS_VERSION=v1
TEMPERATURE=0.2
</code></pre>Install dependencies:

In [None]:
pip install redis openai python-dotenv numpy fastapi uvicorn

<hr>## Step-by-Step Implementation### Step 1: Create the Redis HNSW IndexThe index stores embeddings and metadata for cached responses. We're using HNSW for fast approximate nearest neighbor search - it's surprisingly efficient.

In [None]:
import os
import redis
from dotenv import load_dotenv

<p>load_dotenv()</p>
<p>r = redis.Redis.from_url(os.getenv("REDIS_URL"))</p>
<p>INDEX = "sc_idx"
PREFIX = os.getenv("CACHE_NAMESPACE", "sc:v1:")
DIM = 1536  # Dimension for text-embedding-3-small
M = 16  # HNSW graph connectivity
EF_CONSTRUCTION = 200  # HNSW construction quality</p>
<p>def create_index():
    try:
        r.execute_command("FT.INFO", INDEX)
        print("Index already exists.")
        return
    except redis.ResponseError:
        pass</p>
<pre><code># Create index with vector field and metadata tags
cmd = [
    "FT.CREATE", INDEX,
    "ON", "HASH",
    "PREFIX", "1", PREFIX,
    "SCHEMA",
    "prompt_hash", "TAG",
    "model", "TAG",
    "sys_hash", "TAG",
    "corpus_version", "TAG",
    "temperature", "NUMERIC",
    "created_at", "NUMERIC",
    "last_hit_at", "NUMERIC",
    "response", "TEXT",
    "vector", "VECTOR", "HNSW", "10",  # 5 pairs = 10 args
    "TYPE", "FLOAT32",
    "DIM", str(DIM),
    "DISTANCE_METRIC", "COSINE",
    "M", str(M),
    "EF_CONSTRUCTION", str(EF_CONSTRUCTION),
]
r.execute_command(*cmd)
print("Index created.")

<p>create_index()
</code></pre>**Quick validation:**

In [None]:
info = r.execute_command("FT.INFO", INDEX)
print("Index info:", info)

You should see `num_docs: 0` initially. Good to go.
<hr>### Step 2: Normalize Queries for Stable Cache KeysCanonicalization is crucial here. We remove volatile elements (timestamps, UUIDs, IDs) and normalize whitespace to ensure paraphrases map to the same cache key. I learned this the hard way in a previous project where timestamps in user queries were killing our cache hit rate.

In [None]:
import re
import hashlib</p>
<p>VOLATILE_PATTERNS = [
    r"\b\d{4}-\d{2}-\d{2}(T|\s)\d{2}:\d{2}(:\d{2})?(Z|[+-]\d{2}:\d{2})?\b",  # ISO timestamps
    r"\b[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}\b",  # UUID v4
    r"\b\d{6,}\b",  # Long IDs
]</p>
<p>def canonicalize(text: str) -> str:
    t = text.strip().lower()
    for pat in VOLATILE_PATTERNS:
        t = re.sub(pat, " ", t)
    t = re.sub(r"\s+", " ", t).strip()
    return t</p>
<p>def sha256(s: str) -> str:
    return hashlib.sha256(s.encode("utf-8")).hexdigest()</p>
<p>def scope_hash(prompt_norm: str, model: str, sys_hash: str, temperature: float, corpus_version: str) -> str:
    # Unique hash for cache scope including all parameters
    payload = f"{prompt_norm}|{model}|{sys_hash}|{temperature}|{corpus_version}"
    return sha256(payload)

**Let's test it:**

In [None]:
q1 = "What is our refund policy on 2025-01-15?"
q2 = "what is our refund policy on 2025-01-20?"
print(canonicalize(q1))
print(canonicalize(q2))</p>
<h1>Both should output: "what is our refund policy on"</h1>
<p>

<hr>### Step 3: Initialize Clients and Embedding Function

In [None]:
import numpy as np
from openai import OpenAI</p>
<p>client = OpenAI()</p>
<p>EMBED_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
CHAT_MODEL = os.getenv("CHAT_MODEL", "gpt-4o-mini")
THRESH = float(os.getenv("SIMILARITY_THRESHOLD", 0.10))
TOP_K = int(os.getenv("TOP_K", 5))
TTL = int(os.getenv("CACHE_TTL_SECONDS", 86400))
NS = os.getenv("CACHE_NAMESPACE", "sc:v1:")
CORPUS_VERSION = os.getenv("CORPUS_VERSION", "v1")
TEMPERATURE = float(os.getenv("TEMPERATURE", 0.2))</p>
<p>def embed(text: str) -> np.ndarray:
    # Generate embedding and normalize for cosine distance
    e = client.embeddings.create(model=EMBED_MODEL, input=text)
    vec = np.array(e.data[0].embedding, dtype=np.float32)
    norm = np.linalg.norm(vec)
    return vec / max(norm, 1e-12)</p>
<p>def to_bytes(vec: np.ndarray) -> bytes:
    return vec.astype(np.float32).tobytes()

**Quick test:**

In [None]:
test_vec = embed("hello world")
print(f"Embedding shape: {test_vec.shape}, norm: {np.linalg.norm(test_vec):.4f}")</p>
<h1>Should output shape (1536,) and norm ~1.0</h1>
<p>

<hr>### Step 4: Implement Vector Search

In [None]:
import time
from typing import Optional, Dict, Any, Tuple</p>
<p>def vector_search(query_vec, ef_runtime: int = 100, threshold: float = THRESH) -> Optional[Tuple[str, Dict[str, Any], float]]:
    # Perform KNN search with EF_RUNTIME parameter
    params = ["vec", to_bytes(query_vec), "ef_runtime", ef_runtime]
    q = f"*=>[KNN {TOP_K} @vector $vec EF_RUNTIME $ef_runtime AS score]"
    try:
        res = r.execute_command(
            "FT.SEARCH", INDEX,
            q, "PARAMS", str(len(params)), *params,
            "SORTBY", "score", "ASC",
            "RETURN", "7", "response", "model", "sys_hash", "corpus_version", "temperature", "prompt_hash", "score",
            "DIALECT", "2"
        )
    except redis.RedisError:
        return None</p>
<pre><code>total = res[0] if res else 0
if total &lt; 1:
    return None

doc_id = res[1]
fields = res[2]
f = {fields[i].decode() if isinstance(fields[i], bytes) else fields[i]:
     fields[i+1].decode() if isinstance(fields[i+1], bytes) else fields[i+1]
     for i in range(0, len(fields), 2)}

try:
    distance = float(f["score"])
except Exception:
    distance = 1.0

return doc_id.decode() if isinstance(doc_id, bytes) else doc_id, f, distance

</code></pre><hr>### Step 5: Build the Cache Layer<p style="text-align: left;">Now we're getting to the meat of it. This is where everything comes together.

In [None]:
def sys_hash(system_prompt: str) -> str:
    return sha256(system_prompt.strip())</p>
<p>def key(doc_id_hash: str) -> str:
    return f"{NS}{doc_id_hash}"</p>
<p>def metadata_matches(f: Dict[str, Any], model: str, sys_h: str, temp: float, corpus: str) -> bool:
    try:
        if f.get("model") != model: return False
        if f.get("sys_hash") != sys_h: return False
        if abs(float(f.get("temperature", temp)) - temp) > 1e-6: return False
        if f.get("corpus_version") != corpus: return False
        return True
    except Exception:
        return False</p>
<p>def chat_call(system_prompt: str, user_prompt: str):
    t0 = time.perf_counter()
    resp = client.chat.completions.create(
        model=CHAT_MODEL,
        temperature=TEMPERATURE,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    latency_ms = (time.perf_counter() - t0) * 1000
    content = resp.choices[0].message.content
    usage = getattr(resp, "usage", None)
    return content, latency_ms, usage</p>
<p>def cache_get_or_generate(system_prompt: str, user_prompt: str, ef_runtime: int = 100, threshold: float = THRESH):
    t0 = time.perf_counter()
    sp_hash = sys_hash(system_prompt)
    prompt_norm = canonicalize(user_prompt)
    p_hash = sha256(prompt_norm)</p>
<pre><code>qvec = embed(prompt_norm)
res = vector_search(qvec, ef_runtime=ef_runtime, threshold=threshold)
if res:
    doc_id, fields, distance = res
    if distance &lt; threshold and metadata_matches(fields, CHAT_MODEL, sp_hash, TEMPERATURE, CORPUS_VERSION):
        try:
            r.hset(doc_id, mapping={"last_hit_at": time.time()})
        except redis.RedisError:
            pass
        return {
            "source": "cache",
            "response": fields["response"],
            "distance": distance,
            "latency_ms": (time.perf_counter() - t0) * 1000,
        }

content, llm_latency_ms, usage = chat_call(system_prompt, user_prompt)

doc_scope = scope_hash(prompt_norm, CHAT_MODEL, sp_hash, TEMPERATURE, CORPUS_VERSION)
doc_key = key(doc_scope)
try:
    mapping = {
        "prompt_hash": p_hash,
        "model": CHAT_MODEL,
        "sys_hash": sp_hash,
        "corpus_version": CORPUS_VERSION,
        "temperature": TEMPERATURE,
        "created_at": time.time(),
        "last_hit_at": time.time(),
        "response": content,
        "vector": to_bytes(qvec),
    }
    pipe = r.pipeline(transaction=True)
    pipe.hset(doc_key, mapping=mapping)
    pipe.expire(doc_key, int(TTL))
    pipe.execute()
except redis.RedisError:
    pass

return {
    "source": "llm",
    "response": content,
    "distance": None,
    "latency_ms": llm_latency_ms,
    "usage": {
        "prompt_tokens": getattr(usage, "prompt_tokens", None) if usage else None,
        "completion_tokens": getattr(usage, "completion_tokens", None) if usage else None,
        "total_tokens": getattr(usage, "total_tokens", None) if usage else None,
    }
}

</code></pre><hr>### Step 6: Add Metrics Tracking<p style="text-align: left;">You really need metrics to understand if this is working. Trust me on this one.

In [None]:
import statistics</p>
<p>class Metrics:
    def <strong>init</strong>(self):
        self.hits = 0
        self.misses = 0
        self.cache_latencies = []
        self.llm_latencies = []</p>
<pre><code>def record(self, result):
    if result["source"] == "cache":
        self.hits += 1
        self.cache_latencies.append(result["latency_ms"])
    else:
        self.misses += 1
        self.llm_latencies.append(result["latency_ms"])

def snapshot(self):
    def safe_percentile(vals, p):
        if not vals:
            return None
        sorted_vals = sorted(vals)
        idx = int(len(sorted_vals) * p / 100) - 1
        return sorted_vals[max(0, idx)]
    
    return {
        "hit_rate": self.hits / max(self.hits + self.misses, 1),
        "p50_cache_ms": statistics.median(self.cache_latencies) if self.cache_latencies else None,
        "p95_cache_ms": safe_percentile(self.cache_latencies, 95),
        "p50_llm_ms": statistics.median(self.llm_latencies) if self.llm_latencies else None,
        "p95_llm_ms": safe_percentile(self.llm_latencies, 95),
    }

metrics = Metrics()

<p>def answer(system_prompt: str, user_prompt: str, ef_runtime: int = 100, threshold: float = THRESH):
    res = cache_get_or_generate(system_prompt, user_prompt, ef_runtime=ef_runtime, threshold=threshold)
    metrics.record(res)
    return res
</code></pre><hr>### Step 7: Build the FastAPI ServiceLet's wrap this all up in a nice API.

In [None]:
from fastapi import FastAPI
from pydantic import BaseModel</p>
<p>app = FastAPI()</p>
<p>class Query(BaseModel):
    system_prompt: str
    user_prompt: str
    ef_runtime: int | None = 100</p>
<p>@app.post("/semantic-cache/answer")
def semantic_answer(q: Query):
    res = answer(q.system_prompt, q.user_prompt, ef_runtime=q.ef_runtime or 100)
    return res</p>
<p>@app.get("/semantic-cache/metrics")
def get_metrics():
    return metrics.snapshot()

**Fire up the service:**

In [None]:
uvicorn app:app --reload

**Test it with curl:**

In [None]:
curl -X POST <a href="http://localhost:8000/semantic-cache/answer">http://localhost:8000/semantic-cache/answer</a> <br>  -H "Content-Type: application/json" <br>  -d '{"system_prompt": "You are a helpful assistant.", "user_prompt": "What is the capital of France?"}'

<hr>## Run and Validate### Warm the CacheFirst, let's get some initial queries into the cache.

In [None]:
SYSTEM_PROMPT = "You are a concise support assistant for ACME Corp. Use internal policy v1 for refunds and returns."
seed_prompts = [
    "What is our refund policy?",
    "How long is the return window?",
    "Do you offer exchanges?",
]</p>
<p>print("Warming cache...")
for p in seed_prompts:
    res = answer(SYSTEM_PROMPT, p)
    print(f"{res['source']} {res['latency_ms']:.1f}ms")

### Test ParaphrasesNow here's where it gets interesting. Watch how the cache handles these variations:

In [None]:
paraphrases = [
    "Can I get a refund? What's the policy?",
    "What's the time limit to return an item?",
    "Is it possible to swap a product for another?",
    "How do refunds work here?",
    "For how many days can I return stuff?",
]</p>
<p>print("\nTesting paraphrases...")
for p in paraphrases:
    res = answer(SYSTEM_PROMPT, p)
    print(f"{p} => {res['source']} dist={res.get('distance')} {res['latency_ms']:.1f}ms")

### Print Metrics

In [None]:
print("\nMetrics:", metrics.snapshot())

**What you should see:**
<ul><li>First run: all `llm` sources, probably 500–1000ms latency
</li><li>Paraphrases: mostly `cache` sources, under 50ms latency, distance less than 0.10
</li><li>Hit rate: somewhere between 60–80% for paraphrases
</li></ul>Actually, wait - if you're not seeing good hit rates, your threshold might need tweaking. Let me show you how to dial that in.
<hr>## Tuning the Similarity ThresholdThe threshold is critical. Too low and you'll miss obvious matches. Too high and you'll get false positives. Here's how I usually tune it:

In [None]:
def sweep_thresholds(thresholds):
    for t in thresholds:
        print(f"\nThreshold={t}")
        for p in paraphrases:
            res = cache_get_or_generate(SYSTEM_PROMPT, p, ef_runtime=150, threshold=t)
            print(f"{p} => {res['source']} dist={res.get('distance')}")</p>
<p>sweep_thresholds([0.06, 0.08, 0.10, 0.12, 0.14])

Start with 0.10 and adjust based on your false positive rate. In my experience with customer support queries, 0.10 works well. But for more technical content, you might need to go lower.
<hr>## Inspect the Cache**Count indexed documents:**

In [None]:
info = r.execute_command("FT.INFO", INDEX)
num_docs = info[info.index(b'num_docs') + 1]
print(f"Cached documents: {num_docs}")

**Take a peek at a document:**

In [None]:
keys = r.keys(f"{NS}*")
if keys:
    doc = r.hgetall(keys[0])
    print({k.decode(): v.decode() if isinstance(v, bytes) else v for k, v in doc.items()})

<hr>## ConclusionAnd there you have it - a production-grade semantic cache with Redis Vector and FastAPI. The system normalizes queries, generates embeddings, performs fast vector search, and returns cached responses when similarity is high. In my testing, this typically cuts latency by 10–20x and reduces LLM costs by 60–80% for repeated queries.
**The key design decisions that make this work:**
<ul><li>**Canonicalization** stabilizes cache keys across paraphrases - this was a game-changer
</li><li>**HNSW indexing** enables sub-50ms vector search at scale
</li><li>**Metadata gating** ensures cache hits respect model, temperature, and system prompt changes (learned this one the hard way)
</li><li>**TTL and namespace versioning** provide safe invalidation paths when you need them
</li></ul>**Where to go from here:**
<ul><li>Add query-side metadata filters in `FT.SEARCH` to reduce false candidates (something like `@model:{gpt-4o-mini} @sys_hash:{<hash>}`)
</li><li>Integrate Prometheus and Grafana for observability - you'll want to track hit rate, p95 latency, cache size
</li><li>Implement LRU eviction or score-based pruning for long-running caches
</li><li>Look into quantization (FLOAT16) to reduce memory footprint - though honestly, I haven't needed this yet
</li><li>Scale with Redis Cluster for multi-tenant or high-throughput workloads
</li></ul>For more on building intelligent systems, check out our guides on <a target="_blank" rel="noopener noreferrer nofollow" href="/article/build-rag-pipeline">building a RAG pipeline</a> and <a target="_blank" rel="noopener noreferrer nofollow" href="/article/optimize-llm-context">optimizing LLM context windows</a>.
</p>