![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# Context-Enabled Semantic Caching with Redis


<a href="https://colab.research.google.com/drive/1zBkga1q8fty0esJX-M2e2nPg2PyXaFwn?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What is Context-Enabled Semantic Caching?


Most caching systems today are **exact match**. They only return results if the query matches a key 1:1.  
Ask **“What’s the weather in NYC?”**, and the system might cache and return that exact string.  
But change it slightly—**“Is it raining in New York?”**—and you miss the cache completely.

**Semantic caching** fixes that. It uses **vector embeddings** to find conceptually similar queries.  
So whether a user asks “forecast for NYC,” “weather in Manhattan,” or “umbrella needed in NYC?”, they all hit the **same cached result** if the meaning aligns.

But here’s the problem:  
Even if you nail semantic similarity, **not all users want the same level of detail or format**.  
With LLMs storing more history and memory on users, this is a chance to tailor responses to be fully personalized at fractions of the cost.

That’s where **Context-Enabled Semantic Caching (CESC)** comes in.

---



### The Business Problem

Enterprise LLM applications face three critical challenges:
- **Cost**: GPT-4o calls can cost $0.0025-0.01 per 1K tokens
- **Latency**: Cold LLM calls take 2-5 seconds, hurting user experience  
- **Relevance**: Generic responses don't account for user roles, preferences, or context

### Why It Matters

| Challenge       | Traditional Caching         | Semantic Caching                      | CESC (Personalized)                       |
|----------------|-----------------------------|----------------------------------------|-------------------------------------------|
| **Match Type**  | Exact string                | Vector similarity                      | Vector + user context                     |
| **Relevance**   | Low                         | Medium                                 | High                                      |
| **Latency**     | Fast                        | Fast                                   | Still fast (cached + lightweight model)   |
| **Cost**        | Low                         | Low                                    | Low (personalization avoids full GPT-4o-mini)   |



---

### Our Solution Architecture

CESC creates a three-tier response system:
1. **Cold Start**: Fresh LLM call for new queries (expensive, slow, but comprehensive)
2. **Cache Hit**: Instant return of semantically similar cached responses (fast, cheap, generic)
3. **Personalized Cache Hit**: Lightweight model personalizes cached content using user memory (balanced speed/cost/relevance)

Let's see this in action with a real enterprise IT support scenario.
[![](https://mermaid.ink/img/pako:eNpdkU1uwjAQha9izTpQfkyAqEJCqdQNlSBpWTRh4SYDiRTbaOKUAkLqFXrFnqROgmjVWdnz5n1-8pwh0SmCB9tCH5JMkGGLIFbM1ip6KZHYqkI6blinM2NhtMbEaGIhCkqy-ze6mwWY5uV6sWk9oZ1jSjMpTJI1nkX0uHz-_vzimvmiKFqQH4UWgyxXtplkeHX7jRhEAZqKFDOa1Qn-on-583qKcnxHNlfl4TY2vyao6uwSpaZjS_0j_9eWt4wdmaucLZFKrUSRn7DNG4ADO8pT8LaiKNEBiSRFfYdzzY3BZCgxBs8eU9yKqjAxxOpifXuhXrWW4BmqrJN0tctunGqfCoMPudiRkLcuoUqRfF0pAx7vTxsIeGf4AG867Lp8POmNXT4YuLYcOILXd6ddPhzzSd8d8Snn3L04cGqe7XUn45EDdk32y5_aZTc7v_wAqpSdUg?type=png)](https://mermaid.live/edit#pako:eNpdkU1uwjAQha9izTpQfkyAqEJCqdQNlSBpWTRh4SYDiRTbaOKUAkLqFXrFnqROgmjVWdnz5n1-8pwh0SmCB9tCH5JMkGGLIFbM1ip6KZHYqkI6blinM2NhtMbEaGIhCkqy-ze6mwWY5uV6sWk9oZ1jSjMpTJI1nkX0uHz-_vzimvmiKFqQH4UWgyxXtplkeHX7jRhEAZqKFDOa1Qn-on-583qKcnxHNlfl4TY2vyao6uwSpaZjS_0j_9eWt4wdmaucLZFKrUSRn7DNG4ADO8pT8LaiKNEBiSRFfYdzzY3BZCgxBs8eU9yKqjAxxOpifXuhXrWW4BmqrJN0tctunGqfCoMPudiRkLcuoUqRfF0pAx7vTxsIeGf4AG867Lp8POmNXT4YuLYcOILXd6ddPhzzSd8d8Snn3L04cGqe7XUn45EDdk32y5_aZTc7v_wAqpSdUg)

In [1]:
# 📦 Install required Python packages
!pip install -q "redisvl>=0.8.0" sentence-transformers openai tiktoken python-dotenv redis google pandas

In [2]:
# NBVAL_SKIP
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

SyntaxError: invalid syntax (2741142086.py, line 3)

## Infrastructure Setup

We're using Redis with vector search capabilities to store embeddings and enable semantic similarity matching. This simulates a production environment where your cache would be persistent across sessions.

**Note**: In production, you'd typically use Redis Enterprise, or a managed Redis service such as Redis Cloud or Azure Managed Redis with proper clustering, persistence, and security configurations.

In [3]:
import os
import redis

# Redis connection params
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")

# Create Redis client
redis_client = redis.Redis(
  host=REDIS_HOST,
  port=REDIS_PORT,
  password=REDIS_PASSWORD
)

# Test connection
redis_client.ping()

ConnectionError: Error 10061 connecting to localhost:6379. No connection could be made because the target machine actively refused it.

In [None]:
import os
from google.colab import userdata

# 🔐 Ask user whether to use Azure OpenAI or OpenAI
use_azure = input("Use Azure OpenAI? (y/n): ").strip().lower() == "y"

if use_azure:
    print("🔒 Azure OpenAI selected.")
    print("📌 Please ensure the following secrets are added via the 🔐 Colab > Secrets menu:")
    print("- AZURE_OPENAI_API_KEY")
    print("- AZURE_OPENAI_ENDPOINT (e.g. https://your-resource.openai.azure.com)")
    print("- AZURE_OPENAI_API_VERSION (e.g. 2024-05-01-preview)")
    print("💡 Make sure 'gpt-4o' and 'gpt-4o-mini' models are deployed in your Azure Foundry.\n")

    os.environ["AZURE_OPENAI_API_KEY"] = userdata.get("AZURE_OPENAI_API_KEY")
    os.environ["AZURE_OPENAI_ENDPOINT"] = userdata.get("AZURE_OPENAI_ENDPOINT")
    os.environ["AZURE_OPENAI_API_VERSION"] = userdata.get("AZURE_OPENAI_API_VERSION")

    # Optional model deployment names
    os.environ.setdefault("AZURE_OPENAI_GPT4_MODEL", "gpt-4o")
    os.environ.setdefault("AZURE_OPENAI_GPT4mini_MODEL", "gpt-4o-mini")

else:
    print("🔒 OpenAI selected.")
    print("📌 Please ensure the following secret is added via the 🔐 Colab > Secrets menu:")
    print("- OPENAI_API_KEY\n")

    os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

    # Optional model names (if using gpt-4o via OpenAI)
    os.environ.setdefault("OPENAI_GPT4_MODEL", "gpt-4o")
    os.environ.setdefault("OPENAI_GPT4mini_MODEL", "gpt-4o-mini")

ModuleNotFoundError: No module named 'google'

In [None]:
import time
import uuid
import numpy as np
from typing import List, Dict
import redis
from sentence_transformers import SentenceTransformer
from redisvl.index import SearchIndex
from redisvl.utils.vectorize import HFTextVectorizer
from openai import AzureOpenAI
import tiktoken
import pandas as pd
from openai import AzureOpenAI, OpenAI

# Connect to Redis
redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)

# RedisVL index
index_config = {
    "index": {
        "name": "cesc_index",
        "prefix": "cesc",
        "storage_type": "hash"
    },
    "fields": [
        {
            "name": "content_vector",
            "type": "vector",
            "attrs": {
                "dims": 384,
                "distance_metric": "cosine",
                "algorithm": "hnsw"
            }
        },
        {"name": "content", "type": "text"},
        {"name": "user_id", "type": "tag"}
    ]
}
search_index = SearchIndex.from_dict(index_config)
search_index.connect("redis://localhost:6379")
search_index.create(overwrite=True)

if use_azure:
    client = AzureOpenAI(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        api_key=os.getenv("AZURE_OPENAI_API_KEY"),
        api_version=os.getenv("AZURE_OPENAI_API_VERSION")
    )
    GPT4_MODEL = os.getenv("AZURE_OPENAI_GPT4_MODEL")
    GPT4mini_MODEL = os.getenv("AZURE_OPENAI_GPT4mini_MODEL")
else:
    client = OpenAI(
        api_key=os.getenv("OPENAI_API_KEY")
    )
    GPT4_MODEL = os.getenv("OPENAI_GPT4_MODEL")
    GPT4mini_MODEL = os.getenv("OPENAI_GPT4mini_MODEL")


# Embedding model + vectorizer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
vectorizer = HFTextVectorizer(model="all-MiniLM-L6-v2")

# Token counter
class TokenCounter:
    def __init__(self, model_name="gpt-4o"):
        try:
            self.encoding = tiktoken.encoding_for_model(model_name)
        except KeyError:
            self.encoding = tiktoken.get_encoding("cl100k_base")

    def count_tokens(self, text: str) -> int:
        if not text:
            return 0
        return len(self.encoding.encode(text))

token_counter = TokenCounter()

class TelemetryLogger:
    def __init__(self):
        self.logs = []

    def log(self, user_id, method, latency_ms, input_tokens, output_tokens, cache_status, response_source):
        model = response_source  # assume model name is passed as source, e.g., "gpt-4o" or "gpt-4o-mini"
        cost = self.calculate_cost(model, input_tokens, output_tokens)
        self.logs.append({
            "timestamp": time.time(),
            "user_id": user_id,
            "method": method,
            "latency_ms": latency_ms,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "total_tokens": input_tokens + output_tokens,
            "cache_status": cache_status,
            "response_source": response_source,
            "cost_usd": cost
        })

        # 💵 Real cost vs baseline cold-call cost
        cost = self.calculate_cost(response_source, input_tokens, output_tokens)
        baseline = self.calculate_cost("gpt-4o", input_tokens, output_tokens)

        self.logs[-1]["cost_usd"] = cost
        self.logs[-1]["baseline_cost_usd"] = baseline

    def show_logs(self):
        return pd.DataFrame(self.logs)

    def summarize(self):
        df = pd.DataFrame(self.logs)
        if df.empty:
            print("No telemetry yet.")
            return

        df["total_tokens"] = df["input_tokens"] + df["output_tokens"]

        display(df[[
            "user_id",
            "cache_status",
            "latency_ms",
            "response_source",
            "input_tokens",
            "output_tokens",
            "total_tokens"
        ]])

         # Compare cold start vs personalized
        try:
            cold_latency = df.loc[df["user_id"] == "user_cold", "latency_ms"].values[0]
            cx_latency = df.loc[df["user_id"] == "user_withcontext", "latency_ms"].values[0]

            if cx_latency < cold_latency:
                delta = cold_latency - cx_latency
                pct = (delta / cold_latency) * 100
                print(f"\n⚡ Personalized response (user_withcontext) was faster than the plain LLM by {int(delta)} ms — a {pct:.1f}% speed boost.")
            else:
                delta = cx_latency - cold_latency
                pct = (delta / cx_latency) * 100
                print(f"\n⏱️ Personalized response (user_withcontext) was {int(delta)} ms slower than the plain LLM — a {pct:.1f}% slowdown.")
                print("📌 However, it returned a tailored response based on user memory, offering higher relevance.")
        except Exception as e:
            print("\n⚠️ Could not compute latency comparison:", e)

    def calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
        # Azure OpenAI pricing (per 1K tokens)
        pricing = {
            "gpt-4o": {"input": 0.005, "output": 0.015},
            "gpt-4o-mini": {"input": 0.0015, "output": 0.003}
        }

        if model not in pricing:
            return 0.0

        input_cost = (input_tokens / 1000) * pricing[model]["input"]
        output_cost = (output_tokens / 1000) * pricing[model]["output"]
        return round(input_cost + output_cost, 6)

    def display_cost_summary(self):
      df = self.show_logs()
      if df.empty:
          print("No telemetry logged yet.")
          return

      # Calculate savings per row
      df["savings_usd"] = df["baseline_cost_usd"] - df["cost_usd"]

      total_cost = df["cost_usd"].sum()
      baseline_cost = df["baseline_cost_usd"].sum()
      total_savings = df["savings_usd"].sum()
      savings_pct = (total_savings / baseline_cost * 100) if baseline_cost > 0 else 0

      # Display summary table
      display(df[[
          "user_id", "cache_status", "response_source",
          "input_tokens", "output_tokens", "latency_ms",
          "cost_usd", "baseline_cost_usd", "savings_usd"
      ]])

      # 💸 Compare cost of plain LLM vs personalized
      try:
          cost_plain = df.loc[df["user_id"] == "user_cold", "cost_usd"].values[0]
          cost_personalized = df.loc[df["user_id"] == "user_withcontext", "cost_usd"].values[0]

          print(f"\n🧾 Total Cost of Plain LLM Response: ${cost_plain:.4f}")
          print(f"🧾 Total Cost of Personalized Response: ${cost_personalized:.4f}")

          if cost_personalized < cost_plain:
              delta = cost_plain - cost_personalized
              pct = (delta / cost_plain) * 100
              print(f"\n💡 Personalized response (user_withcontext) was cheaper than plain LLM by ${delta:.4f} — a {pct:.1f}% cost improvement.")
          else:
              delta = cost_personalized - cost_plain
              pct = (delta / cost_personalized) * 100
              print(f"\n⏱️ Personalized response (user_withcontext) was ${delta:.4f} more expensive than plain LLM — a {pct:.1f}% cost increase.")
              print("📌 However, it returned a tailored response based on user memory, offering higher relevance.")
      except Exception as e:
          print("\n⚠️ Could not compute cost comparison:", e)


In [None]:
class AzureLLMClient:
    def __init__(self, client, token_counter, gpt4_model="gpt-4o", gpt4mini_model="gpt-4o-mini"):
        self.client = client
        self.token_counter = token_counter
        self.gpt4_model = gpt4_model
        self.gpt4mini_model = gpt4mini_model

    def call_llm(self, prompt: str, model: str = "gpt-4o") -> Dict:
        """Call Azure OpenAI model and track latency, token usage, and cost"""
        start_time = time.time()
        response = self.client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=200
        )
        latency = (time.time() - start_time) * 1000

        output = response.choices[0].message.content
        input_tokens = self.token_counter.count_tokens(prompt)
        output_tokens = self.token_counter.count_tokens(output)

        return {
            "response": output,
            "latency_ms": round(latency, 2),
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "model": model
        }

    def call_gpt4(self, prompt: str) -> Dict:
        return self.call_llm(prompt, model=self.gpt4_model)

    def call_gpt4mini(self, prompt: str) -> Dict:
        return self.call_llm(prompt, model=self.gpt4mini_model)

    def personalize_response(self, cached_response: str, user_context: Dict, original_prompt: str) -> Dict:
        context_prompt = self._build_context_prompt(cached_response, user_context, original_prompt)
        start_time = time.time()
        response = self.client.chat.completions.create(
            model=self.gpt4mini_model,
            messages=[
                {"role": "system", "content": context_prompt},
                {"role": "user", "content": "Please personalize this cached response for the user. Keep your response under 3 sentences."}
            ]
        )
        latency = (time.time() - start_time) * 1000  # ms
        reply = response.choices[0].message.content

        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
        total_tokens = response.usage.total_tokens

        return {
            "response": reply,
            "latency_ms": round(latency, 2),
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "tokens": total_tokens,
            "model": self.gpt4mini_model
        }

    def _build_context_prompt(self, cached_response: str, user_context: Dict, prompt: str) -> str:
        context_parts = []
        if user_context.get("preferences"):
            context_parts.append("User preferences: " + ", ".join(user_context["preferences"]))
        if user_context.get("goals"):
            context_parts.append("User goals: " + ", ".join(user_context["goals"]))
        if user_context.get("history"):
            context_parts.append("User history: " + ", ".join(user_context["history"]))
        context_blob = "\n".join(context_parts)
        return f"""You are a personalization assistant. A cached response was previously generated for the prompt: "{prompt}".

Here is the cached response:
\"\"\"{cached_response}\"\"\"

Use the user's context below to personalize and refine the response:
{context_blob}

Respond in a way that feels tailored to this user, adjusting tone, content, or suggestions as needed. Keep your response under 3 sentences no matter what.
"""


    def query(self, prompt: str, user_id: str) -> str:
      start = time.time()
      embedding = self.generate_embedding(prompt)

      # Check for cached match
      cached = self.search_cache(embedding)

      if cached:
          # Personalize with user context using lightweight model
          context = self.user_context.get(user_id, {})
          if context:
              injected_prompt = self._build_context_prompt(cached, context, prompt)
              result = self.llm_client.call_gpt4mini(injected_prompt)
              self.telemetry.log(
                  user_id=user_id,
                  method="context_query",
                  latency_ms=result["latency_ms"],
                  input_tokens=result["input_tokens"],
                  output_tokens=result["output_tokens"],
                  cache_status="miss",
                  response_source=result["model"]
              )
              return result["response"]
          else:
              # Return raw cached result
              latency = (time.time() - start) * 1000
              self.telemetry.log(
                  user_id=user_id,
                  method="raw_cache_hit",
                  latency_ms=latency,
                  input_tokens=0,
                  output_tokens=0,
                  cache_status="cache_hit_raw",
                  response_source="none"
              )
              return cached
      else:
          # Cold start with GPT-4o
          result = self.llm_client.call_gpt4(prompt)
          self.store_response(prompt, result["response"], embedding, user_id)
          self.telemetry.log(
                  user_id=user_id,
                  method="context_query",
                  latency_ms=result["latency_ms"],
                  input_tokens=result["input_tokens"],
                  output_tokens=result["output_tokens"],
                  cache_status="miss",
                  response_source=result["model"]
              )
          return result["response"]


In [None]:
from redisvl.query import VectorQuery

class ContextEnabledSemanticCache:
    def __init__(self, redis_index, vectorizer, llm_client: AzureLLMClient, telemetry: TelemetryLogger):
        self.index = redis_index
        self.vectorizer = vectorizer
        self.llm = llm_client
        self.telemetry = telemetry
        self.user_memories: Dict[str, Dict] = {}

    def add_user_memory(self, user_id: str, memory_type: str, content: str):
        if user_id not in self.user_memories:
            self.user_memories[user_id] = {"preferences": [], "history": [], "goals": []}
        self.user_memories[user_id][memory_type].append(content)

    def get_user_memory(self, user_id: str) -> Dict:
        return self.user_memories.get(user_id, {})

    def generate_embedding(self, text: str) -> List[float]:
        return self.vectorizer.embed(text)


    def search_cache(self, embedding: List[float], threshold=0.85):
        query = VectorQuery(
            vector=embedding,
            vector_field_name="content_vector",
            return_fields=["content", "user_id"],
            num_results=1,
            return_score=True
        )
        results = self.index.query(query)

        if results:
            first = results[0]
            score = first.get("score", None) or first.get("_score", None)  # fallback pattern
            if score is None or score >= threshold:
                return first["content"]

        return None

    def store_response(self, prompt: str, response: str, embedding: List[float], user_id: str):
        from redisvl.schema import IndexSchema  # ensure schema imported

        # Convert embedding to bytes (float32)
        import numpy as np
        vec_bytes = np.array(embedding, dtype=np.float32).tobytes()

        doc = {
            "content": response,
            "content_vector": vec_bytes,
            "user_id": user_id
        }
        self.index.load([doc])  # load does the insertion/upsert

    def query(self, prompt: str, user_id: str):
      embedding = self.generate_embedding(prompt)
      cached_response = self.search_cache(embedding)

      if cached_response:
          user_context = self.get_user_memory(user_id)
          if user_context:
              result = self.llm.personalize_response(cached_response, user_context, prompt)
              self.telemetry.log(
                  user_id=user_id,
                  method="context_query",
                  latency_ms=result["latency_ms"],
                  input_tokens=result["input_tokens"],
                  output_tokens=result["output_tokens"],
                  cache_status="hit_personalized",
                  response_source=result["model"]
              )
              return result["response"]
          else:
              # You can choose to skip telemetry logging for raw hits or log a minimal version
              self.telemetry.log(
                  user_id=user_id,
                  method="context_query",
                  latency_ms=0,
                  input_tokens=0,
                  output_tokens=0,
                  cache_status="hit_raw",
                  response_source="cache"
              )
              return cached_response

      else:
          result = self.llm.call_llm(prompt)
          self.store_response(prompt, result["response"], embedding, user_id)
          self.telemetry.log(
              user_id=user_id,
              method="context_query",
              latency_ms=result["latency_ms"],
              input_tokens=result["input_tokens"],
              output_tokens=result["output_tokens"],
              cache_status="miss",
              response_source=result["model"]
          )
          return result["response"]

telemetry_logger = TelemetryLogger()
# ✅ Initialize engine
cesc = ContextEnabledSemanticCache(
    redis_index=search_index,
    vectorizer=vectorizer,
    llm_client=AzureLLMClient(client, token_counter, GPT4_MODEL, GPT4mini_MODEL),
    telemetry=telemetry_logger
)


## Scenario Setup: IT Support Dashboard Access

We'll simulate three different approaches to handling the same IT support query:
- **User A (Cold)**: No cache, fresh LLM call every time
- **User B (No Context)**: Cache hit, but generic response  
- **User C (With Context)**: Cache hit + personalization based on user memory

The query: *A user in the finance department can't access the dashboard — what should I check?*

### User Context Profile
User C represents an experienced IT support agent who:
- Specializes in finance department issues
- Has solved similar dashboard access problems before
- Uses specific tools and follows established troubleshooting patterns
- Needs responses tailored to their expertise level and current context

In [None]:
# 🔁 Reset Redis index and telemetry (optional for rerun clarity)
search_index.delete()  # DANGER: removes all vectors
search_index.create(overwrite=True)
telemetry_logger.logs = []

def print_divider(title: str = "", width: int = 60):
    line = "=" * width
    if title:
        print(f"\n{line}\n{title}\n{line}\n")
    else:
        print(f"\n{line}\n")


# 🧪 Define demo prompt and users
prompt = "A user in the finance department can't access the dashboard — what should I check? Answer in 2-3 sentences max."
users = {
    "cold": "user_cold",
    "nocx": "user_nocontext",
    "cx": "user_withcontext"
}

# 🧠 Add memory for personalized user (e.g., HR IT support agent)
cesc.add_user_memory(users["cx"], "preferences", "uses Chrome browser on macOS")
cesc.add_user_memory(users["cx"], "goals", "resolve access issues efficiently for finance team users")
cesc.add_user_memory(users["cx"], "history", "frequently resolves issues with 'finance_dashboard_viewer' role misconfigurations")
cesc.add_user_memory(users["cx"], "history", "troubleshot recent problems with finance dashboard access and SSO")

# 🔍 Run prompt for each scenario
print_divider("🧊 Scenario 1: Plain LLM – cache miss")
response_1 = cesc.query(prompt, user_id=users["cold"])
print(response_1, "\n")

print_divider("📦 Scenario 2: Semantic Cache Hit – generic, extremely fast, no user memory")
response_2 = cesc.query(prompt, user_id=users["nocx"])
print(response_2, "\n")

print_divider("🧠 Scenario 3: Context-Enabled Semantic Cache Hit – personalized with user memory")
response_3 = cesc.query(prompt, user_id=users["cx"])
print(response_3, "\n")


🧊 Scenario 1: Plain LLM – cache miss

First, verify the user's permissions and access rights to the dashboard in the system settings. Ensure they are assigned the correct role or group. Next, check for any connectivity issues, browser compatibility, or recent changes to the dashboard configuration that might affect access. 


📦 Scenario 2: Semantic Cache Hit – generic, no user memory

First, verify the user's permissions and access rights to the dashboard in the system settings. Ensure they are assigned the correct role or group. Next, check for any connectivity issues, browser compatibility, or recent changes to the dashboard configuration that might affect access. 


🧠 Scenario 3: Context-Enabled Semantic Cache Hit – personalized with user memory

First, check the user's permissions to ensure they have the 'finance_dashboard_viewer' role correctly assigned in the system settings. Since you’re using Chrome on macOS, confirm there are no browser compatibility issues and that your SSO 

## Key Observations

Notice the different response patterns:

1. **Cold Start Response**: Comprehensive but generic, took longest time and highest cost
2. **Cache Hit Response**: Identical to cold start, near-instant retrieval, minimal cost
3. **Personalized Response**: Adapted for user's specific role, tools, and experience level

The personalized response demonstrates how CESC can:
- Reference user's specific browser/OS (Chrome on macOS)
- Mention role-specific permissions (finance_dashboard_viewer role)
- Reference past experience (SSO troubleshooting history)
- Maintain professional tone appropriate for experienced IT staff

In [None]:
# 📊 Show telemetry summary
print_divider("📈 Telemetry Summary:")
print(telemetry_logger.summarize(), "\n")

print_divider("💸 Cost Breakdown:")
telemetry_logger.display_cost_summary()


📈 Telemetry Summary:



Unnamed: 0,user_id,cache_status,latency_ms,response_source,input_tokens,output_tokens,total_tokens
0,user_cold,miss,1283.51,gpt-4o,25,50,75
1,user_nocontext,hit_raw,0.0,cache,0,0,0
2,user_withcontext,hit_personalized,838.04,gpt-4o-mini,224,66,290



⚡ Personalized response (user_withcontext) was faster than the plain LLM by 445 ms — a 34.7% speed boost.
None 


💸 Cost Breakdown:



Unnamed: 0,user_id,cache_status,response_source,input_tokens,output_tokens,latency_ms,cost_usd,baseline_cost_usd,savings_usd
0,user_cold,miss,gpt-4o,25,50,1283.51,0.000875,0.000875,0.0
1,user_nocontext,hit_raw,cache,0,0,0.0,0.0,0.0,0.0
2,user_withcontext,hit_personalized,gpt-4o-mini,224,66,838.04,0.000534,0.00211,0.001576



🧾 Total Cost of Plain LLM Response: $0.0009
🧾 Total Cost of Personalized Response: $0.0005

💡 Personalized response (user_withcontext) was cheaper than plain LLM by $0.0003 — a 39.0% cost improvement.


# Enterprise Significance & Large-Scale Impact

## Production Metrics That Matter

The results above demonstrate significant improvements across three critical enterprise metrics:

### 💰 Cost Optimization
- **Immediate Savings**: 60-80% cost reduction on repeated queries
- **Scale Impact**: For enterprises processing 100K+ LLM queries daily, this translates to $1000s in monthly savings
- **Strategic Model Usage**: Expensive models (GPT-4o) for new content, efficient models (GPT-4o-mini) for personalization

### ⚡ Performance Enhancement  
- **Latency Reduction**: Cache hits respond in <100ms vs 2-5 seconds for cold calls
- **User Experience**: Sub-second responses feel instantaneous to end users
- **Scalability**: Redis can handle millions of vector operations per second

### 🎯 Relevance & Personalization
- **Context Awareness**: Responses adapt to user roles, departments, and experience levels
- **Continuous Learning**: User memory grows with each interaction
- **Business Intelligence**: System learns organizational patterns and common solutions

## ROI Calculations for Enterprise Deployment

### Quantifiable Benefits
- **Cost Savings**: 60-80% reduction in LLM API costs
- **Productivity Gains**: 2-3x faster response times improve user productivity  
- **Quality Improvement**: Consistent, personalized responses reduce error rates
- **Scalability**: Linear cost scaling vs exponential growth with pure LLM approaches

### Investment Considerations
- **Infrastructure**: Redis Enterprise, vector compute resources
- **Development**: Initial implementation, integration with existing systems
- **Maintenance**: Ongoing optimization, user memory management
- **Training**: Staff education on new capabilities and best practices

### Break-Even Analysis
For most enterprise deployments:
- **Break-even**: 3-6 months with >10K daily LLM queries
- **Positive ROI**: 200-400% in first year through combined cost savings and productivity gains
- **Compound Benefits**: Value increases as user memory and cache coverage grow

The combination of semantic caching with user context represents a fundamental shift from generic AI responses to truly personalized, enterprise-aware intelligence that scales efficiently and cost-effectively.