# 📓 The GenAI Revolution Cookbook

**Title:** Semantic Cache LLM: How to Implement with Redis Vector to Cut Costs

**Description:** Build a semantic cache LLM using embeddings and Redis Vector with TTLs, thresholds, metrics to reduce LLM spend and latency.

**📖 Read the full article:** [Semantic Cache LLM: How to Implement with Redis Vector to Cut Costs](https://blog.thegenairevolution.com/article/semantic-cache-llm-how-to-implement-with-redis-vector-to-cut-costs)

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## Why This MattersMost LLM applications waste money and time answering the same question phrased slightly differently. A semantic cache solves this by recognizing when a new query is semantically similar to a previous one and returning the cached response instantly—no LLM call required.
This guide walks you through building a production-grade semantic cache using embeddings and Redis Vector. You'll create a FastAPI microservice with a Redis-backed semantic cache, complete with thresholds, TTLs, and metrics. By the end, you'll have working code, a tunable architecture, and a clear path to immediate latency and cost reductions.
**What you'll build:**
<ul><li>A Redis HNSW vector index for semantic similarity search
</li><li>A cache layer that normalizes queries, generates embeddings, and retrieves cached responses
</li><li>A FastAPI endpoint to serve cached or fresh LLM answers
</li><li>A demo script to validate cache hit rates and latency improvements
</li></ul>**Prerequisites:**
<ul><li>Python 3.9+
</li><li>Redis Stack (local via Docker or managed Redis Cloud)
</li><li>OpenAI API key
</li><li>Basic familiarity with embeddings and vector search
</li></ul>If you're using Google Colab or a cloud notebook, connect to a managed Redis Stack instance (e.g., Redis Cloud) instead of running Docker locally.
For a deeper understanding of how LLMs manage memory and the concept of context rot, see our article on <a target="_blank" rel="noopener noreferrer nofollow" href="/article/context-rot-why-llms-forget-as-their-memory-grows">why LLMs "forget" as their memory grows</a>.
<hr>## How It Works (High-Level Overview)**The paraphrase problem:**<br>Users ask the same question in many ways. "What's your refund policy?" and "Can I get my money back?" are semantically identical, but traditional caching treats them as different keys.
**The embedding advantage:**<br>Embeddings map text into a high-dimensional vector space where semantically similar phrases cluster together. By comparing query embeddings using cosine similarity, you can detect paraphrases and return cached responses.
**Why Redis Vector:**<br>Redis Stack provides HNSW (Hierarchical Navigable Small World) indexing for fast approximate nearest neighbor search. It combines low-latency vector search with Redis's native TTL, tagging, and filtering capabilities—ideal for production caching.
**Architecture:**
<ol><li>Normalize the user query (lowercase, strip volatile patterns like timestamps)
</li><li>Generate an embedding for the normalized query
</li><li>Search the Redis HNSW index for the nearest cached embedding
</li><li>If distance < threshold and metadata matches (model, temperature, system prompt hash), return the cached response
</li><li>Otherwise, call the LLM, cache the new response with its embedding, and return it
</li></ol><hr>## Setup & Installation### Option 1: Managed Redis (Recommended for Notebooks)Sign up for a free Redis Cloud account at <a target="_blank" rel="noopener noreferrer nofollow" href="https://redis.com/try-free">redis.com/try-free</a> and create a Redis Stack database. Copy the connection URL.
In your notebook or terminal:

In [None]:
%pip install redis openai python-dotenv numpy

Set environment variables:

In [None]:
import os
# os.environ["REDIS_URL"] = "redis://default:password@your-redis-host:port" # Replace with your actual Redis URL
# os.environ["OPENAI_API_KEY"] = "sk-..." # OpenAI API key
os.environ["EMBEDDING_MODEL"] = "text-embedding-3-small" # OpenAI embedding model to use
os.environ["CHAT_MODEL"] = "gpt-4o-mini" # OpenAI chat model to use
os.environ["SIMILARITY_THRESHOLD"] = "0.10" # Cosine similarity threshold for cache hit
os.environ["TOP_K"] = "5" # Number of nearest neighbors to retrieve in vector search
os.environ["CACHE_TTL_SECONDS"] = "86400" # Time-to-live for cache entries in seconds (1 day)
os.environ["CACHE_NAMESPACE"] = "sc:v1:" # Namespace for cache keys in Redis
os.environ["CORPUS_VERSION"] = "v1" # Version of the underlying data corpus (for cache invalidation)
os.environ["TEMPERATURE"] = "0.2" # Temperature parameter for the LLM

### Option 2: Local Redis with Docker

In [None]:
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

Create a `.env` file:
<pre><code>REDIS_URL=redis://localhost:6379
OPENAI_API_KEY=sk-...
EMBEDDING_MODEL=text-embedding-3-small
CHAT_MODEL=gpt-4o-mini
SIMILARITY_THRESHOLD=0.10
TOP_K=5
CACHE_TTL_SECONDS=86400
CACHE_NAMESPACE=sc:v1:
CORPUS_VERSION=v1
TEMPERATURE=0.2
</code></pre>Install dependencies:

In [None]:
pip install redis openai python-dotenv numpy fastapi uvicorn

<hr>## Step-by-Step Implementation### Step 1: Create the Redis HNSW IndexThe index stores embeddings and metadata for cached responses. We use HNSW for fast approximate nearest neighbor search.

In [None]:
import os
import redis
from dotenv import load_dotenv

<p>load_dotenv()</p>
<p>r = redis.Redis.from_url(os.getenv("REDIS_URL"))</p>
<p>INDEX = "sc_idx"
PREFIX = os.getenv("CACHE_NAMESPACE", "sc:v1:")
DIM = 1536  # Dimension for text-embedding-3-small
M = 16  # HNSW graph connectivity
EF_CONSTRUCTION = 200  # HNSW construction quality</p>
<p>def create_index():
    try:
        r.execute_command("FT.INFO", INDEX)
        print("Index already exists.")
        return
    except redis.ResponseError:
        pass</p>
<pre><code># Create index with vector field and metadata tags
cmd = [
    "FT.CREATE", INDEX,  # Command to create a full-text search index with the given name
    "ON", "HASH",  # Index applies to Redis Hash data structures
    "PREFIX", "1", PREFIX,  # Only index keys starting with the defined prefix
    "SCHEMA",  # Define the schema of the index
    "prompt_hash", "TAG",  # Tag field for hashing the canonicalized prompt
    "model", "TAG",  # Tag field for the LLM model used
    "sys_hash", "TAG",  # Tag field for hashing the system prompt
    "corpus_version", "TAG",  # Tag field for tracking the version of the underlying corpus
    "temperature", "NUMERIC",  # Numeric field for the temperature parameter used by the LLM
    "created_at", "NUMERIC",  # Numeric field for the creation timestamp
    "last_hit_at", "NUMERIC",  # Numeric field for the timestamp of the last cache hit
    "response", "TEXT",  # Text field for the LLM's response
    "user_question", "TEXT", # Text field for the original user question
    "vector", "VECTOR", "HNSW", "10",  # Define a vector field named "vector" using the HNSW algorithm. "10" specifies the number of pairs for the HNSW vector definition.
    "TYPE", "FLOAT32",  # Specify the data type of the vector embeddings
    "DIM", str(DIM),  # Specify the dimension of the vector embeddings
    "DISTANCE_METRIC", "COSINE",  # Specify the distance metric to use for vector similarity search
    "M", str(M),  # HNSW parameter: number of established connections for each element during graph construction
    "EF_CONSTRUCTION", str(EF_CONSTRUCTION),  # HNSW parameter: size of the dynamic list for heuristic search during graph construction
]
r.execute_command(*cmd)
print("Index created.")

create_index()</code></pre><p style="text-align: left;">**Validation:**

In [None]:
info = r.execute_command("FT.INFO", INDEX)
print("Index info:", info)

You should see `num_docs: 0` initially.
<hr>### Step 2: Normalize Queries for Stable Cache KeysCanonicalization removes volatile elements (timestamps, UUIDs, IDs) and normalizes whitespace to ensure paraphrases map to the same cache key.

In [None]:
import re
import hashlib</p>
<h1>Note: Normalization adequacy depends on expected query variations and embedding model robustness.</h1>
<p>VOLATILE_PATTERNS = [
    # ISO timestamps and variations
    r"\b\d{4}-\d{2}-\d{2}(T|\s)\d{2}:\d{2}(:\d{2})?(Z|[+-]\d{2}:\d{2})?\b",
    # Common date formats (MM/DD/YYYY, DD/MM/YYYY, YYYY/MM/DD, YYYY-MM-DD)
    r"\b\d{1,4}[-/.]?\d{1,2}[-/.]?\d{2,4}\b", # Updated to be more flexible with separators and year length
    # UUID v4
    r"\b[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}\b",
    # Long IDs (6+ digits)
    r"\b\d{6,}\b",
    # Email addresses (often contain volatile parts or personally identifiable info)
    r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}\b",
]</p>
<p>def canonicalize(text: str) -> str:
    t = text.strip().lower()
    for pat in VOLATILE_PATTERNS:
        t = re.sub(pat, " ", t)
    t = re.sub(r"\s+", " ", t).strip()
    return t</p>
<p>def sha256(s: str) -> str:
    return hashlib.sha256(s.encode("utf-8")).hexdigest()</p>
<p>def scope_hash(prompt_norm: str, model: str, sys_hash: str, temperature: float, corpus_version: str) -> str:
    # Unique hash for cache scope including all parameters
    payload = f"{prompt_norm}|{model}|{sys_hash}|{temperature}|{corpus_version}"
    return sha256(payload)

**Test:**

In [None]:
q1 = "What is our refund policy on 2025-01-15?"
q2 = "what is our refund policy on 2025-01-20?"
print(canonicalize(q1))
print(canonicalize(q2))</p>
<h1>Both should output: "what is our refund policy on"

<hr>### Step 3: Initialize Clients and Embedding Function

In [None]:
import numpy as np</h1>
<p>from openai import OpenAI</p>
<p>client = OpenAI()</p>
<p>EMBED_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
CHAT_MODEL = os.getenv("CHAT_MODEL", "gpt-4o-mini")
THRESH = float(os.getenv("SIMILARITY_THRESHOLD", 0.10))
TOP_K = int(os.getenv("TOP_K", 5))
TTL = int(os.getenv("CACHE_TTL_SECONDS", 86400))
NS = os.getenv("CACHE_NAMESPACE", "sc:v1:")
CORPUS_VERSION = os.getenv("CORPUS_VERSION", "v1")
TEMPERATURE = float(os.getenv("TEMPERATURE", 0.2))</p>
<p>def embed(text: str) -> np.ndarray:
    # Generate embedding and normalize for cosine distance
    e = client.embeddings.create(model=EMBED_MODEL, input=text)
    vec = np.array(e.data[0].embedding, dtype=np.float32)
    norm = np.linalg.norm(vec)
    return vec / max(norm, 1e-12)</p>
<p>def to_bytes(vec: np.ndarray) -> bytes:
    return vec.astype(np.float32).tobytes()

<hr>### Step 4: Implement Vector Search

In [None]:
test_vec = embed("hello world")
print(f"Embedding shape: {test_vec.shape}, norm: {np.linalg.norm(test_vec):.4f}")</p>
<h1>Should output shape (1536,) and norm ~1.0

<hr>### Step 5: Build the Cache Layer

In [None]:
import time</h1>
<p>from typing import Optional, Dict, Any, Tuple</p>
<p>def vector_search(query_vec, ef_runtime: int = 100, threshold: float = THRESH) -> Optional[Tuple[str, Dict[str, Any], float]]:
    # Perform KNN search with EF_RUNTIME parameter
    params = ["vec", to_bytes(query_vec), "ef_runtime", ef_runtime]
    q = f"*=>[KNN {TOP_K} @vector $vec EF_RUNTIME $ef_runtime AS score]"
    try:
        res = r.execute_command(
            "FT.SEARCH", INDEX,
            q, "PARAMS", str(len(params)), *params,
            "SORTBY", "score", "ASC",
            "RETURN", "8", "response", "model", "sys_hash", "corpus_version", "temperature", "prompt_hash", "user_question", "score",
            "DIALECT", "2"
        )
    except redis.RedisError:
        return None</p>
<pre><code>total = res[0] if res else 0
if total < 1:
    return None

doc_id = res[1]
fields = res[2]
f = {fields[i].decode() if isinstance(fields[i], bytes) else fields[i]:
     fields[i+1].decode() if isinstance(fields[i+1], bytes) else fields[i+1]
     for i in range(0, len(fields), 2)}

try:
    distance = float(f["score"])
except Exception:
    distance = 1.0

return doc_id.decode() if isinstance(doc_id, bytes) else doc_id, f, distance</code></pre><hr><h3 style="text-align: left;">Step 6: Add Metrics Tracking</h3><pre><code class="language-python">import time

from typing import Optional, Dict, Any, Tuple

<p>def sys_hash(system_prompt: str) -> str:
    return sha256(system_prompt.strip())</p>
<p>def key(doc_id_hash: str) -> str:
    return f"{NS}{doc_id_hash}"</p>
<p>def metadata_matches(f: Dict[str, Any], model: str, sys_h: str, temp: float, corpus: str) -> bool:
    try:
        if f.get("model") != model: return False
        if f.get("sys_hash") != sys_h: return False
        if abs(float(f.get("temperature", temp)) - temp) > 1e-6: return False
        if f.get("corpus_version") != corpus: return False
        return True
    except Exception:
        return False</p>
<p>def chat_call(system_prompt: str, user_prompt: str):
    t0 = time.perf_counter()
    resp = client.chat.completions.create(
        model=CHAT_MODEL,
        temperature=TEMPERATURE,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    latency_ms = (time.perf_counter() - t0) * 1000
    content = resp.choices[0].message.content
    usage = getattr(resp, "usage", None)
    return content, latency_ms, usage</p>
<p>def cache_get_or_generate(system_prompt: str, user_prompt: str, ef_runtime: int = 100, threshold: float = THRESH):
    # Start timing the cache lookup process
    t0 = time.perf_counter()</p>
<pre><code># Generate hashes and canonicalize prompt for consistent caching
sp_hash = sys_hash(system_prompt)
prompt_norm = canonicalize(user_prompt)
p_hash = sha256(prompt_norm)

# Generate embedding for the normalized prompt
qvec = embed(prompt_norm)

# Perform vector search in Redis to find similar cached entries
res = vector_search(qvec, ef_runtime=ef_runtime, threshold=threshold)

# Check if a cached result was found and meets criteria
if res:
    doc_id, fields, distance = res
    # Check if the semantic distance is within the threshold and metadata matches
    if distance < threshold and metadata_matches(fields, CHAT_MODEL, sp_hash, TEMPERATURE, CORPUS_VERSION):
        try:
            # If cache hit, update the last hit timestamp in Redis
            r.hset(doc_id, mapping={"last_hit_at": time.time()})
        except redis.RedisError:
            # Handle potential Redis errors during update
            pass
        # Return the cached response with details
        return {
            "source": "cache",
            "response": fields["response"],
            "distance": distance,
            "latency_ms": (time.perf_counter() - t0) * 1000,
            "user_question": fields.get("user_question"), # Include user_question in cache hit response
        }

# If no cache hit, call the LLM to generate a new response
content, llm_latency_ms, usage = chat_call(system_prompt, user_prompt)

# Generate a unique key for the new cache entry based on prompt scope
doc_scope = scope_hash(prompt_norm, CHAT_MODEL, sp_hash, TEMPERATURE, CORPUS_VERSION)
doc_key = key(doc_scope)

# Prepare data to be cached
try:
    mapping = {
        "prompt_hash": p_hash,
        "model": CHAT_MODEL,
        "sys_hash": sp_hash,
        "corpus_version": CORPUS_VERSION,
        "temperature": TEMPERATURE,
        "created_at": time.time(),
        "last_hit_at": time.time(),
        "response": content,
        "user_question": user_prompt,
        "vector": to_bytes(qvec),
    }
    # Use a Redis pipeline for atomic HSET and EXPIRE commands
    pipe = r.pipeline(transaction=True)
    pipe.hset(doc_key, mapping=mapping)
    pipe.expire(doc_key, int(TTL))
    pipe.execute()
except redis.RedisError:
    # Handle potential Redis errors during caching
    pass

# Return the LLM-generated response with details
return {
    "source": "llm",
    "response": content,
    "distance": None,
    "latency_ms": llm_latency_ms,
    "usage": {
        "prompt_tokens": getattr(usage, "prompt_tokens", None) if usage else None,
        "completion_tokens": getattr(usage, "completion_tokens", None) if usage else None,
        "total_tokens": getattr(usage, "total_tokens", None) if usage else None,
    },
    "user_question": user_prompt, # Include user_question in LLM response
}</code></pre><hr>### Step 7: Build the FastAPI Service<pre><code class="language-python">import statistics
</code></pre>
<p>class Metrics:
    def **init**(self):
        self.hits = 0
        self.misses = 0
        self.cache_latencies = []
        self.llm_latencies = []</p>
<pre><code>def record(self, result):
    if result["source"] == "cache":
        self.hits += 1
        self.cache_latencies.append(result["latency_ms"])
    else:
        self.misses += 1
        self.llm_latencies.append(result["latency_ms"])

def snapshot(self):
    def safe_percentile(vals, p):
        if not vals:
            return None
        sorted_vals = sorted(vals)
        idx = int(len(sorted_vals) * p / 100) - 1
        return sorted_vals[max(0, idx)]

    return {
        "hit_rate": self.hits / max(self.hits + self.misses, 1),
        "p50_cache_ms": statistics.median(self.cache_latencies) if self.cache_latencies else None,
        "p95_cache_ms": safe_percentile(self.cache_latencies, 95),
        "p50_llm_ms": statistics.median(self.llm_latencies) if self.llm_latencies else None,
        "p95_llm_ms": safe_percentile(self.llm_latencies, 95),
    }
</code></pre>
metrics = Metrics()

<p>def answer(system_prompt: str, user_prompt: str, ef_runtime: int = 100, threshold: float = THRESH):
    res = cache_get_or_generate(system_prompt, user_prompt, ef_runtime=ef_runtime, threshold=threshold)
    metrics.record(res)
    return res</code></pre>

In [None]:
from fastapi import FastAPI
from pydantic import BaseModel</p>
<p>app = FastAPI()</p>
<p>class Query(BaseModel):
    system_prompt: str
    user_prompt: str
    ef_runtime: int | None = 100</p>
<p>@app.post("/semantic-cache/answer")
def semantic_answer(q: Query):
    res = answer(q.system_prompt, q.user_prompt, ef_runtime=q.ef_runtime or 100)
    return res</p>
<p>@app.get("/semantic-cache/metrics")
def get_metrics():
    return metrics.snapshot()

**Run the service:**

In [None]:
%%writefile app.py
from fastapi import FastAPI
from pydantic import BaseModel
import os
import redis
import time
import re
import hashlib
import numpy as np
from openai import OpenAI
import statistics
from dotenv import load_dotenv
from typing import Optional, Dict, Any, Tuple # Import Optional, Dict, Any, Tuple</p>
<h1>Load environment variables (assuming .env or environment variables are set)</h1>
<p>load_dotenv()</p>
<h1>Define constants from environment variables - Ensure these are defined before app is created</h1>
<p>EMBED_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
CHAT_MODEL = os.getenv("CHAT_MODEL", "gpt-4o-mini")
THRESH = float(os.getenv("SIMILARITY_THRESHOLD", 0.10))
TOP_K = int(os.getenv("TOP_K", 5))
TTL = int(os.getenv("CACHE_TTL_SECONDS", 86400))
NS = os.getenv("CACHE_NAMESPACE", "sc:v1:")
CORPUS_VERSION = os.getenv("CORPUS_VERSION", "v1")
TEMPERATURE = float(os.getenv("TEMPERATURE", 0.2))
INDEX = "sc_idx" # Define INDEX here as it's used in create_index</p>
<h1>Initialize Redis client - Ensure REDIS_URL is set in environment variables</h1>
<p>try:
    r = redis.Redis.from_url(os.getenv("REDIS_URL"))
    # Check connection
    r.ping()
    print("Connected to Redis successfully!")
except redis.exceptions.ConnectionError as e:
    print(f"Could not connect to Redis: {e}")
    r = None # Set r to None if connection fails</p>
<h1>Define HNSW parameters (should match index creation)</h1>
<p>DIM = 1536
M = 16
EF_CONSTRUCTION = 200</p>
<h1>--- Index Creation Function (Moved from earlier cell) ---</h1>
<h1>This function should ideally be run once during setup, not every time the app starts</h1>
<h1>For this notebook demo, we include it, but in a production app, index creation</h1>
<h1>is usually handled separately.</h1>
<p>def create_index():
    if not r: return # Skip if Redis connection failed
    try:
        r.execute_command("FT.INFO", INDEX)
        print("Index already exists.")
        return
    except redis.ResponseError:
        pass</p>
<pre><code>cmd = [
    "FT.CREATE", INDEX,
    "ON", "HASH",
    "PREFIX", "1", NS, # Use NS here as defined from env var
    "SCHEMA",
    "prompt_hash", "TAG",
    "model", "TAG",
    "sys_hash", "TAG",
    "corpus_version", "TAG",
    "temperature", "NUMERIC",
    "created_at", "NUMERIC",
    "last_hit_at", "NUMERIC",
    "response", "TEXT",
    "user_question", "TEXT",
    "vector", "VECTOR", "HNSW", "10",
    "TYPE", "FLOAT32",
    "DIM", str(DIM),
    "DISTANCE_METRIC", "COSINE",
    "M", str(M),
    "EF_CONSTRUCTION", str(EF_CONSTRUCTION),
]
try:
    r.execute_command(*cmd)
    print("Index created.")
except redis.RedisError as e:
    print(f"Error creating index: {e}")

# Call create index when the app file is written/imported
create_index()

# --- Normalization Functions (Moved from earlier cell) ---
<p>VOLATILE_PATTERNS = [
    r"\b\d{4}-\d{2}-\d{2}(T|\s)\d{2}:\d{2}(:\d{2})?(Z|[+-]\d{2}:\d{2})?\b",
    r"\b\d{1,4}[-/.]?\d{1,2}[-/.]?\d{2,4}\b",
    r"\b[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}\b",
    r"\b\d{6,}\b",
    r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}\b",
]</p>
<p>def canonicalize(text: str) -> str:
    t = text.strip().lower()
    for pat in VOLATILE_PATTERNS:
        t = re.sub(pat, " ", t)
    t = re.sub(r"\s+", " ", t).strip()
    return t</p>
<p>def sha256(s: str) -> str:
    return hashlib.sha256(s.encode("utf-8")).hexdigest()</p>
<p>def scope_hash(prompt_norm: str, model: str, sys_hash: str, temperature: float, corpus_version: str) -> str:
    payload = f"{prompt_norm}|{model}|{sys_hash}|{temperature}|{corpus_version}"
    return sha256(payload)</p>
<p>def sys_hash(system_prompt: str) -> str:
    return sha256(system_prompt.strip())</p>
<p>def key(doc_id_hash: str) -> str:
    return f"{NS}{doc_id_hash}"</p>
# --- OpenAI Client & Embedding Function (Moved from earlier cell) ---
# Initialize OpenAI client - Ensure OPENAI_API_KEY is set in environment variables
<p>try:
    client = OpenAI()
    # Optional: check if client can connect/authenticate
    # client.models.list()
    print("OpenAI client initialized.")
except Exception as e:
    print(f"Error initializing OpenAI client: {e}")
    client = None # Set client to None if initialization fails</p>
<p>def embed(text: str) -> np.ndarray:
    if not client: raise ConnectionError("OpenAI client not initialized.")
    # Generate embedding and normalize for cosine distance
    e = client.embeddings.create(model=EMBED_MODEL, input=text)
    vec = np.array(e.data[0].embedding, dtype=np.float32)
    norm = np.linalg.norm(vec)
    return vec / max(norm, 1e-12)</p>
<p>def to_bytes(vec: np.ndarray) -> bytes:
    return vec.astype(np.float32).tobytes()</p>
# --- Vector Search Function (Moved from earlier cell) ---
<p>def vector_search(query_vec, ef_runtime: int = 100, threshold: float = THRESH) -> Optional[Tuple[str, Dict[str, Any], float]]:
    if not r: return None # Skip if Redis connection failed
    # Perform KNN search with EF_RUNTIME parameter
    params = ["vec", to_bytes(query_vec), "ef_runtime", ef_runtime]
    # Correct query syntax: Use $vec for vector parameter in KNN search
    q = f"*=>[KNN {TOP_K} @vector $vec EF_RUNTIME $ef_runtime AS score]"
    try:
        res = r.execute_command(
            "FT.SEARCH", INDEX,
            q, "PARAMS", str(len(params)), *params,
            "SORTBY", "score", "ASC",
            "RETURN", "8", "response", "model", "sys_hash", "corpus_version", "temperature", "prompt_hash", "user_question", "score",
            "DIALECT", "2"
        )
    except redis.RedisError as e:
        print(f"Redis search error: {e}")
        return None</p>
<pre><code>total = res[0] if res else 0
if total < 1:
    return None

# Process search results
# The result format is [total_results, doc1_id, [field1, value1, field2, value2, ...], doc2_id, [...], ...]
# We expect the first result to be the best match
if len(res) > 1:
    doc_id = res[1]
    fields = res[2]
    # Decode bytes to string for keys and values where appropriate
    f = {fields[i].decode() if isinstance(fields[i], bytes) else fields[i]:
         fields[i+1].decode() if isinstance(fields[i+1], bytes) else fields[i+1]
         for
</code></pre>