![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# LangCache: Semantic Caching with Redis Cloud

This notebook demonstrates end-to-end semantic caching using **LangCache** - a managed Redis Cloud service accessed through the RedisVL library. LangCache provides enterprise-grade semantic caching with zero infrastructure management, making it ideal for production LLM applications.

<a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/semantic-cache/04_langcache_semantic_caching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Introduction

**LangCache** is a fully managed semantic cache service built on Redis Cloud. It was integrated into RedisVL in version 0.11.0 as an `LLMCache` interface implementation, making it easy for RedisVL users to:

- Transition to a fully managed caching service
- Reduce LLM API costs by caching similar queries
- Improve application response times
- Access enterprise features without managing infrastructure

### What You'll Learn

In this tutorial, you will:
1. Set up LangCache with Redis Cloud
2. Load and process a knowledge base (PDF documents)
3. Generate FAQs using the Doc2Cache technique
4. Pre-populate a semantic cache with tagged FAQs
5. Test different cache matching strategies and thresholds
6. Optimize cache performance using evaluation datasets
7. Use the `langcache-embed` cross-encoder model
8. Integrate the cache into a RAG pipeline
9. Measure performance improvements


## 1. Environment Setup

First, we'll install the required packages and set up our environment.


### Install Required Packages


In [None]:
%pip install -q "redisvl>=0.11.0" "openai>=1.0.0" "langchain>=0.3.0" "langchain-community" "langchain-openai"
%pip install -q "pypdf" "sentence-transformers" "redis-retrieval-optimizer>=0.2.0"

print("Packages installed. Please restart the kernel if prompted, then continue with the next cell.")



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
✓ Packages installed. Please restart the kernel if prompted, then continue with the next cell.


In [2]:
# Note: If you see import errors, restart the kernel and continue from this cell


### Import Dependencies


In [None]:
import os
import time
import json
from getpass import getpass
from typing import List, Dict, Any

# RedisVL imports
from redisvl.extensions.cache.llm import SemanticCache
from redisvl.utils.vectorize import HFTextVectorizer

# LangChain imports
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from pydantic import BaseModel, Field

# Optimization
from redis_retrieval_optimizer.threshold_optimization import CacheThresholdOptimizer

print("All imports successful")


  from .autonotebook import tqdm as notebook_tqdm


11:13:29 numexpr.utils INFO   NumExpr defaulting to 10 threads.
✓ All imports successful


## 2. LangCache Setup

### Sign Up for LangCache

If you haven't already, sign up for a free LangCache account:

**[Sign up for LangCache →](https://langcache.io/signup)**

After signing up:
1. Create a new cache instance
2. Copy your **Endpoint URL** (looks like: `redis-xxxxx.langcache.io:xxxxx`)
3. Copy your **Access Token/Password**
4. Note your **Cache ID** (you'll use this as a prefix for organizing caches)

> **Note:** For this workshop, you can alternatively use a standard Redis Cloud instance with Redis Stack. Simply provide your Redis Cloud connection details instead.


### Configure Environment Variables


In [None]:
# Redis/LangCache credentials
if "REDIS_URL" not in os.environ:
    redis_host = input("Enter your Redis/LangCache host (e.g., redis-xxxxx.langcache.io or localhost): ")
    redis_port = input("Enter your Redis port (default: 6379): ") or "6379"
    redis_password = getpass("Enter your Redis password (leave empty for local): ")
    
    # Build Redis URL
    if redis_password:
        os.environ["REDIS_URL"] = f"rediss://:{redis_password}@{redis_host}:{redis_port}"
    else:
        os.environ["REDIS_URL"] = f"redis://{redis_host}:{redis_port}"

# OpenAI API key for LLM and embeddings
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

print("Environment variables configured")
print(f"  Redis URL: {os.environ['REDIS_URL'].split('@')[-1] if '@' in os.environ['REDIS_URL'] else os.environ['REDIS_URL'].split('//')[1]}")


✓ Environment variables configured
  Redis URL: :6379


### Initialize Semantic Cache with LangCache-Embed Model

We'll create a cache instance using the `redis/langcache-embed-v1` model, which is specifically optimized for semantic caching tasks.


In [None]:
# Initialize the vectorizer with the LangCache embedding model
# This model is specifically optimized for semantic caching with better precision/recall
vectorizer = HFTextVectorizer(
    model="redis/langcache-embed-v1"
)

# Create Semantic Cache instance
cache = SemanticCache(
    name="rag_faq_cache",
    redis_url=os.environ["REDIS_URL"],
    vectorizer=vectorizer,
    distance_threshold=0.15  # Initial threshold, we'll optimize this later
)

print(f"Semantic Cache initialized: {cache.index.name}")
print(f"  Using model: redis/langcache-embed-v1")
print(f"  Distance threshold: {cache.distance_threshold}")


11:14:02 datasets INFO   PyTorch version 2.7.0 available.
11:14:03 sentence_transformers.SentenceTransformer INFO   Use pytorch device_name: mps
11:14:03 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: redis/langcache-embed-v1
✓ Semantic Cache initialized: rag_faq_cache
  Using model: redis/langcache-embed-v1
  Distance threshold: 0.15


### Initialize OpenAI LLM


In [None]:
# Initialize OpenAI LLM for FAQ generation and RAG
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.3,
    max_tokens=2000
)

print("OpenAI LLM initialized")


✓ OpenAI LLM initialized


## 3. Load and Prepare Datasets

We'll work with three types of data:
1. **Knowledge Base**: PDF document(s) that contain factual information
2. **FAQs**: Derived from the knowledge base using Doc2Cache technique
3. **Test Dataset**: For evaluating and optimizing cache performance


### Load PDF Knowledge Base


In [None]:
# Download sample PDF if not already present
!mkdir -p data
!wget -q -O data/nvidia-10k.pdf https://raw.githubusercontent.com/redis-developer/redis-ai-resources/main/python-recipes/RAG/resources/nvd-10k-2023.pdf

print("Sample PDF downloaded")


✓ Sample PDF downloaded


In [None]:
# Load and chunk the PDF
pdf_path = "data/nvidia-10k.pdf"

# Configure text splitter for optimal chunk sizes
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Load and split the document
loader = PyPDFLoader(pdf_path)
documents = loader.load()
chunks = text_splitter.split_documents(documents)

print(f"Loaded PDF: {pdf_path}")
print(f"  Total pages: {len(documents)}")
print(f"  Created chunks: {len(chunks)}")
print(f"\nSample chunk preview:")
print(f"{chunks[10].page_content[:300]}...")


✓ Loaded PDF: data/nvidia-10k.pdf
  Total pages: 169
  Created chunks: 388

Sample chunk preview:
Table of Contents
The world’s leading cloud service providers, or CSPs, and consumer internet companies use our GPUs and broader data center-scale
accelerated computing platforms to enable, accelerate or enrich the services they deliver to billions of end-users, including search,
recommendations, so...


### Generate FAQs Using Doc2Cache Technique

The Doc2Cache approach uses an LLM to generate frequently asked questions from document chunks. These FAQs are then used to pre-populate the semantic cache with high-quality, factual responses.


In [9]:
# Define the FAQ data model
class QuestionAnswer(BaseModel):
    question: str = Field(description="A frequently asked question derived from the document content")
    answer: str = Field(description="A factual answer to the question based on the document")
    category: str = Field(description="Category of the question (e.g., 'financial', 'products', 'operations')")

class FAQList(BaseModel):
    faqs: List[QuestionAnswer] = Field(description="List of question-answer pairs extracted from the document")

# Set up JSON output parser
json_parser = JsonOutputParser(pydantic_object=FAQList)


In [None]:
# Create the FAQ generation prompt
faq_prompt = PromptTemplate(
    template="""You are a document analysis expert. Extract 3-5 high-quality FAQs from the following document chunk.

Guidelines:
- Generate diverse, specific questions that users would realistically ask
- Provide accurate, complete answers based ONLY on the document content
- Assign each FAQ to a category: 'financial', 'products', 'operations', 'technology', or 'general'
- Avoid vague or overly generic questions
- If the chunk lacks substantial content, return fewer FAQs

{format_instructions}

Document Chunk:
{doc_content}

FAQs JSON:""",
    input_variables=["doc_content"],
    partial_variables={"format_instructions": json_parser.get_format_instructions()}
)

# Create the FAQ generation chain
faq_chain = faq_prompt | llm | json_parser

print("FAQ generation chain configured")


✓ FAQ generation chain configured


In [11]:
# Test FAQ generation on a single chunk
print("Testing FAQ generation on sample chunk...\n")
test_faqs = faq_chain.invoke({"doc_content": chunks[10].page_content})

print(f"Generated {len(test_faqs.get('faqs', []))} FAQs:")
for i, faq in enumerate(test_faqs.get('faqs', [])[:3], 1):
    print(f"\n{i}. Q: {faq['question']}")
    print(f"   Category: {faq['category']}")
    print(f"   A: {faq['answer'][:150]}...")


Testing FAQ generation on sample chunk...

11:14:29 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Generated 5 FAQs:

1. Q: What industries are leveraging NVIDIA's GPUs and software for automation?
   Category: operations
   A: A rapidly growing number of enterprises and startups across a broad range of industries, including transportation for autonomous driving, healthcare f...

2. Q: What was the reason for the termination of the Arm Share Purchase Agreement between NVIDIA and SoftBank?
   Category: general
   A: The termination of the Share Purchase Agreement was due to significant regulatory challenges that prevented the completion of the transaction....

3. Q: What acquisition termination cost did NVIDIA record in fiscal year 2023?
   Category: financial
   A: NVIDIA recorded an acquisition termination cost of $1.35 billion in fiscal year 2023, reflecting the write-off of the prepayment provided at signing f...


In [None]:
# Generate FAQs from all chunks (limited to first 25 for demo purposes)
def extract_faqs_from_chunks(chunks: List[Any], max_chunks: int = 25) -> List[Dict]:
    """Extract FAQs from document chunks using LLM."""
    all_faqs = []
    
    for i, chunk in enumerate(chunks[:max_chunks]):
        if i % 5 == 0:
            print(f"Processing chunk {i+1}/{min(len(chunks), max_chunks)}...", flush=True)
        
        try:
            result = faq_chain.invoke({"doc_content": chunk.page_content})
            if result and result.get("faqs"):
                all_faqs.extend(result["faqs"])
        except Exception as e:
            print(f"  Warning: Skipped chunk {i+1} due to error: {str(e)[:100]}")
            continue
    
    return all_faqs

# Extract FAQs
print("\nGenerating FAQs from document chunks...\n")
faqs = extract_faqs_from_chunks(chunks, max_chunks=25)

print(f"\nGenerated {len(faqs)} FAQs total")
print(f"\nCategory distribution:")
categories = {}
for faq in faqs:
    cat = faq.get('category', 'unknown')
    categories[cat] = categories.get(cat, 0) + 1
for cat, count in sorted(categories.items(), key=lambda x: x[1], reverse=True):
    print(f"  {cat}: {count}")



Generating FAQs from document chunks...

Processing chunk 1/25...
11:14:36 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:14:45 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:14:54 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:15:00 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:15:07 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Processing chunk 6/25...
11:15:14 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:15:16 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:15:22 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:15:29 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.

### Create Test/Evaluation Dataset

We'll create a test dataset with:
- **Positive examples**: Questions that should match cached FAQs
- **Negative examples**: Questions that should NOT match cached FAQs
- **Edge cases**: Slightly different phrasings to test threshold sensitivity


In [13]:
# Select representative FAQs for test set
sample_faqs = faqs[:10]  # Take first 10 FAQs

print("Sample FAQs for testing:")
for i, faq in enumerate(sample_faqs[:3], 1):
    print(f"\n{i}. {faq['question'][:100]}...")


Sample FAQs for testing:

1. What is the fiscal year end date for NVIDIA Corporation as reported in the 10-K?...

2. What is the trading symbol for NVIDIA Corporation's common stock?...

3. Where is the principal executive office of NVIDIA Corporation located?...


In [None]:
# Create test dataset with negative examples (off-topic questions)
negative_examples = [
    {"query": "What is the weather today?", "expected_match": False, "category": "off-topic"},
    {"query": "How do I cook pasta?", "expected_match": False, "category": "off-topic"},
    {"query": "What is the capital of France?", "expected_match": False, "category": "off-topic"},
    {"query": "Tell me a joke", "expected_match": False, "category": "off-topic"},
    {"query": "What time is it?", "expected_match": False, "category": "off-topic"},
]

print(f"Test dataset created")
print(f"  Negative examples: {len(negative_examples)}")


✓ Test dataset created
  Negative examples: 5


## 4. Pre-Load Semantic Cache with FAQs

Now we'll populate the cache instance with our generated FAQs. We'll use the `store()` API with metadata tags for filtering and organization.


In [None]:
# Clear any existing cache entries
cache.clear()
print("Cache cleared")


✓ Cache cleared


In [None]:
# Store FAQs in cache with metadata tags
print("Storing FAQs in cache...\n")

stored_count = 0
cache_keys = {}  # Map questions to their cache keys

for i, faq in enumerate(faqs):
    if i % 20 == 0:
        print(f"  Stored {i}/{len(faqs)} FAQs...", flush=True)
    
    try:
        # Store with metadata - note that metadata is stored but not used for filtering in basic SemanticCache
        # In production, you can use this for analytics and tracking
        key = cache.store(
            prompt=faq['question'],
            response=faq['answer']
        )
        cache_keys[faq['question']] = key
        stored_count += 1
    except Exception as e:
        print(f"  Warning: Failed to store FAQ {i+1}: {str(e)[:100]}")

print(f"\nStored {stored_count} FAQs in cache")
print(f"  Cache index: {cache.index.name}")
print(f"\nExample cache entries:")
for i, (q, k) in enumerate(list(cache_keys.items())[:2], 1):
    print(f"\n{i}. Key: {k}")
    print(f"   Q: {q[:80]}...")


Storing FAQs in cache...

  Stored 0/114 FAQs...
  Stored 20/114 FAQs...
  Stored 40/114 FAQs...
  Stored 60/114 FAQs...
  Stored 80/114 FAQs...
  Stored 100/114 FAQs...

✓ Stored 114 FAQs in cache
  Cache index: rag_faq_cache

Example cache entries:

1. Key: rag_faq_cache:abd9b974d9eedebc62332adbfd10ab5ff96e9d65dbd4476a27a487dd82b46002
   Q: What is the fiscal year end date for NVIDIA Corporation as reported in the 10-K?...

2. Key: rag_faq_cache:8aa719b5f3d105fdcd9048d2b6c14e04bd60e8501a27ed80481c97adafb01ea7
   Q: What is the trading symbol for NVIDIA Corporation's common stock?...


## 5. Test Cache Retrieval with Different Strategies

Let's test how the cache performs with different types of queries and matching thresholds.


### Test Exact Match Queries


In [None]:
# Test with exact questions from cache
print("Testing exact match queries:\n")

for i, faq in enumerate(faqs[:3], 1):
    result = cache.check(prompt=faq['question'])
    
    if result:
        print(f"{i}. Cache HIT")
        print(f"   Query: {faq['question'][:80]}...")
        print(f"   Answer: {result[0]['response'][:100]}...\n")
    else:
        print(f"{i}. ✗ Cache MISS")
        print(f"   Query: {faq['question'][:80]}...\n")


Testing exact match queries:

1. ✓ Cache HIT
   Query: What is the fiscal year end date for NVIDIA Corporation as reported in the 10-K?...
   Answer: The fiscal year ended January 29, 2023....

2. ✓ Cache HIT
   Query: What is the trading symbol for NVIDIA Corporation's common stock?...
   Answer: The trading symbol for NVIDIA Corporation's common stock is NVDA....

3. ✓ Cache HIT
   Query: Where is the principal executive office of NVIDIA Corporation located?...
   Answer: The principal executive office of NVIDIA Corporation is located at 2788 San Tomas Expressway, Santa ...



### Test Semantic Similarity


In [None]:
# Test with semantically similar queries
print("Testing semantic similarity:\n")

similar_queries = [
    "Tell me about NVIDIA's revenue",
    "What products does the company make?",
    "How is the company performing financially?",
]

for i, query in enumerate(similar_queries, 1):
    result = cache.check(prompt=query, return_fields=["prompt", "response", "distance"])
    
    if result:
        print(f"{i}. Cache HIT (distance: {result[0].get('vector_distance', 'N/A'):.4f})")
        print(f"   Query: {query}")
        print(f"   Matched: {result[0]['prompt'][:80]}...")
        print(f"   Answer: {result[0]['response'][:100]}...\n")
    else:
        print(f"{i}. ✗ Cache MISS")
        print(f"   Query: {query}\n")


Testing semantic similarity:

1. ✗ Cache MISS
   Query: Tell me about NVIDIA's revenue

2. ✗ Cache MISS
   Query: What products does the company make?

3. ✗ Cache MISS
   Query: How is the company performing financially?



### Test Cache with Sample Query


In [None]:
# Test cache behavior with a sample query
test_query = "What is NVIDIA's main business?"

print(f"Testing query: '{test_query}'")
print(f"Current threshold: {cache.distance_threshold:.4f}\n")

result = cache.check(prompt=test_query, return_fields=["prompt", "vector_distance"])

if result:
    print(f"Cache HIT")
    print(f"  Distance: {result[0].get('vector_distance', 0):.6f}")
    print(f"  Matched: {result[0]['prompt'][:80]}...")
else:
    print(f"✗ Cache MISS - No match found within threshold")


Testing query: 'What is NVIDIA's main business?'
Current threshold: 0.1500

✗ Cache MISS - No match found within threshold


### Test Negative Examples (Should Not Match)


In [None]:
# Test with off-topic queries that should NOT match
print("Testing negative examples (should NOT match):\n")

for i, test_case in enumerate(negative_examples, 1):
    result = cache.check(prompt=test_case['query'], return_fields=["prompt", "vector_distance"])
    
    if result:
        print(f"{i}. ⚠️  UNEXPECTED HIT (distance: {result[0].get('vector_distance', 'N/A'):.4f})")
        print(f"   Query: {test_case['query']}")
        print(f"   Matched: {result[0]['prompt'][:80]}...\n")
    else:
        print(f"{i}. Correct MISS")
        print(f"   Query: {test_case['query']}\n")


Testing negative examples (should NOT match):

1. ✓ Correct MISS
   Query: What is the weather today?

2. ✓ Correct MISS
   Query: How do I cook pasta?

3. ✓ Correct MISS
   Query: What is the capital of France?

4. ✓ Correct MISS
   Query: Tell me a joke

5. ✓ Correct MISS
   Query: What time is it?



## 6. Optimize Cache Threshold

Using the `CacheThresholdOptimizer`, we can automatically find the optimal distance threshold based on our test dataset.


In [None]:
# Create optimization test data
# Format: [{"query": "...", "query_match": "cache_key_or_empty_string"}, ...]

optimization_test_data = []

# Add positive examples (should match specific cache entries)
for faq in faqs[:5]:
    if faq['question'] in cache_keys:
        optimization_test_data.append({
            "query": faq['question'],
            "query_match": cache_keys[faq['question']]
        })

# Add negative examples (should not match anything)
for neg_example in negative_examples:
    optimization_test_data.append({
        "query": neg_example['query'],
        "query_match": ""  # Empty string means it should NOT match
    })

print(f"Created optimization test data:")
print(f"  Total examples: {len(optimization_test_data)}")
print(f"  Positive (should match): {sum(1 for x in optimization_test_data if x['query_match'])}")
print(f"  Negative (should not match): {sum(1 for x in optimization_test_data if not x['query_match'])}")


✓ Created optimization test data:
  Total examples: 10
  Positive (should match): 5
  Negative (should not match): 5


In [None]:
# Optimize threshold based on F1 score
print(f"\nCurrent distance threshold: {cache.distance_threshold}")
print("\nOptimizing threshold...\n")

optimizer = CacheThresholdOptimizer(
    cache,
    optimization_test_data,
    eval_metric="f1"  # Can also use "precision" or "recall"
)

results = optimizer.optimize()

print(f"\nOptimization complete!")
print(f"  Original threshold: 0.15")
print(f"  Optimized threshold: {cache.distance_threshold:.6f}")
if results and 'f1' in results:
    print(f"  F1 Score: {results['f1']:.4f}")



Current distance threshold: 0.01

Optimizing threshold...


✓ Optimization complete!
  Original threshold: 0.15
  Optimized threshold: 0.010000


In [None]:
# Re-test with optimized threshold
print("\nRe-testing negative examples with optimized threshold:\n")

for i, test_case in enumerate(negative_examples, 1):
    result = cache.check(prompt=test_case['query'], return_fields=["prompt", "vector_distance"])
    
    if result:
        print(f"{i}. ⚠️  HIT (distance: {result[0].get('vector_distance', 'N/A'):.4f})")
        print(f"   Query: {test_case['query']}")
        print(f"   Matched: {result[0]['prompt'][:80]}...\n")
    else:
        print(f"{i}. MISS (correct)")
        print(f"   Query: {test_case['query']}\n")



Re-testing negative examples with optimized threshold:

1. ✓ MISS (correct)
   Query: What is the weather today?

2. ✓ MISS (correct)
   Query: How do I cook pasta?

3. ✓ MISS (correct)
   Query: What is the capital of France?

4. ✓ MISS (correct)
   Query: Tell me a joke

5. ✓ MISS (correct)
   Query: What time is it?



## 7. LangCache-Embed Model Deep Dive

The `redis/langcache-embed-v1` model is specifically optimized for semantic caching. Let's examine its characteristics and performance.


In [28]:
# Show information about the langcache-embed model
print("LangCache-Embed Model Information:")
print("="*60)
print(f"Model: redis/langcache-embed-v1")
print(f"Purpose: Optimized for semantic caching tasks")
print(f"Dimension: 768")
print(f"Distance Metric: cosine")
print("\nKey Features:")
print("  • Trained specifically on query-response pairs")
print("  • Balanced precision/recall for optimal cache hit rates")
print("  • Fast inference time suitable for production")
print("  • Optimized for short-form text (queries/prompts)")
print("\nAdvantages for Caching:")
print("  • Better semantic understanding of query intent")
print("  • More robust to paraphrasing and rewording")
print("  • Lower false positive rate compared to general embeddings")
print("  • Optimized threshold ranges for cache decisions")


LangCache-Embed Model Information:
Model: redis/langcache-embed-v1
Purpose: Optimized for semantic caching tasks
Dimension: 768
Distance Metric: cosine

Key Features:
  • Trained specifically on query-response pairs
  • Balanced precision/recall for optimal cache hit rates
  • Fast inference time suitable for production
  • Optimized for short-form text (queries/prompts)

Advantages for Caching:
  • Better semantic understanding of query intent
  • More robust to paraphrasing and rewording
  • Lower false positive rate compared to general embeddings
  • Optimized threshold ranges for cache decisions


In [29]:
# Compare semantic similarities between related and unrelated questions
test_questions = [
    "What is NVIDIA's revenue?",
    "Tell me about NVIDIA's earnings",  # Semantically similar
    "How much money does NVIDIA make?",  # Semantically similar
    "What is the weather today?",  # Unrelated
]

print("\nComparing semantic similarities:\n")
print(f"Base question: {test_questions[0]}\n")

base_embedding = vectorizer.embed(test_questions[0])

import numpy as np

for query in test_questions[1:]:
    query_embedding = vectorizer.embed(query)
    
    # Calculate cosine similarity
    similarity = np.dot(base_embedding, query_embedding) / (
        np.linalg.norm(base_embedding) * np.linalg.norm(query_embedding)
    )
    distance = 1 - similarity
    
    print(f"Query: {query}")
    print(f"  Similarity: {similarity:.4f}")
    print(f"  Distance: {distance:.4f}")
    print(f"  Would cache hit (threshold={cache.distance_threshold:.4f})? {distance < cache.distance_threshold}\n")



Comparing semantic similarities:

Base question: What is NVIDIA's revenue?

Query: Tell me about NVIDIA's earnings
  Similarity: 0.8725
  Distance: 0.1275
  Would cache hit (threshold=0.0100)? False

Query: How much money does NVIDIA make?
  Similarity: 0.9004
  Distance: 0.0996
  Would cache hit (threshold=0.0100)? False

Query: What is the weather today?
  Similarity: 0.3141
  Distance: 0.6859
  Would cache hit (threshold=0.0100)? False



## 8. RAG Pipeline Integration

Now let's integrate the semantic cache into a complete RAG pipeline and measure the performance improvements.


### Build Simple RAG Chain


In [None]:
# Create a simple RAG prompt template
rag_template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant answering questions about NVIDIA based on their 10-K filing. Provide accurate, concise answers."),
    ("user", "{question}")
])

# Create RAG chain
rag_chain = rag_template | llm

print("RAG chain created")


✓ RAG chain created


### Create Cached RAG Function


In [None]:
def rag_with_cache(question: str, use_cache: bool = True) -> tuple:
    """
    Process a question through RAG pipeline with optional semantic caching.
    
    Returns: (answer, cache_hit, response_time)
    """
    start_time = time.time()
    cache_hit = False
    
    # Check cache first if enabled
    if use_cache:
        cached_result = cache.check(prompt=question)
        if cached_result:
            answer = cached_result[0]['response']
            cache_hit = True
            response_time = time.time() - start_time
            return answer, cache_hit, response_time
    
    # Cache miss - use LLM
    answer = rag_chain.invoke({"question": question})
    response_time = time.time() - start_time
    
    # Store in cache for future use
    if use_cache and hasattr(answer, 'content'):
        cache.store(prompt=question, response=answer.content)
    elif use_cache:
        cache.store(prompt=question, response=str(answer))
    
    return answer.content if hasattr(answer, 'content') else str(answer), cache_hit, response_time

print("Cached RAG function ready")


✓ Cached RAG function ready


### Performance Comparison: With vs Without Cache


In [None]:
# Test questions for RAG evaluation
test_questions_rag = [
    "What is NVIDIA's primary business?",
    "How much revenue did NVIDIA generate?",
    "What are NVIDIA's main products?",
]

print("\n" + "="*80)
print("PERFORMANCE COMPARISON: With Cache vs Without Cache")
print("="*80)

# First pass - populate cache (cache misses, must call LLM)
print("\n[FIRST PASS - Populating Cache]\n")
first_pass_times = []

for i, question in enumerate(test_questions_rag, 1):
    answer, cache_hit, response_time = rag_with_cache(question, use_cache=True)
    first_pass_times.append(response_time)
    print(f"{i}. {question}")
    print(f"   Cache: {'HIT' if cache_hit else 'MISS'} | Time: {response_time:.3f}s")
    print(f"   Answer: {answer[:100]}...\n")

# Second pass - test cache hits with similar questions
print("\n[SECOND PASS - Cache Hits with Paraphrased Questions]\n")
second_pass_times = []

similar_questions = [
    "What does NVIDIA do as a business?",
    "Can you tell me NVIDIA's revenue figures?",
    "What products does NVIDIA sell?",
]

for i, question in enumerate(similar_questions, 1):
    answer, cache_hit, response_time = rag_with_cache(question, use_cache=True)
    second_pass_times.append(response_time)
    print(f"{i}. {question}")
    print(f"   Cache: {'HIT ✓' if cache_hit else 'MISS ✗'} | Time: {response_time:.3f}s")
    print(f"   Answer: {answer[:100]}...\n")

# Third pass - without cache (baseline)
print("\n[THIRD PASS - Without Cache (Baseline)]\n")
no_cache_times = []

for i, question in enumerate(test_questions_rag, 1):
    answer, _, response_time = rag_with_cache(question, use_cache=False)
    no_cache_times.append(response_time)
    print(f"{i}. {question}")
    print(f"   Cache: DISABLED | Time: {response_time:.3f}s\n")

# Summary
print("\n" + "="*80)
print("PERFORMANCE SUMMARY")
print("="*80)
avg_first = sum(first_pass_times)/len(first_pass_times)
avg_second = sum(second_pass_times)/len(second_pass_times)
avg_no_cache = sum(no_cache_times)/len(no_cache_times)

print(f"Average time - First pass (cache miss):  {avg_first:.3f}s")
print(f"Average time - Second pass (cache hit):  {avg_second:.3f}s")
print(f"Average time - Without cache:            {avg_no_cache:.3f}s")

if avg_second > 0:
    speedup = avg_first / avg_second
    print(f"\nSpeedup with cache: {speedup:.1f}x faster")

cache_hit_count = sum(1 for i, _ in enumerate(similar_questions) if second_pass_times[i] < 0.1)
cache_hit_rate = cache_hit_count / len(similar_questions)
print(f"  Cache hit rate: {cache_hit_rate*100:.0f}%")



PERFORMANCE COMPARISON: With Cache vs Without Cache

[FIRST PASS - Populating Cache]

15:52:18 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
1. What is NVIDIA's primary business?
   Cache: MISS | Time: 2.232s
   Answer: NVIDIA's primary business is the design and manufacture of graphics processing units (GPUs) for gami...

15:52:20 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2. How much revenue did NVIDIA generate?
   Cache: MISS | Time: 2.188s
   Answer: As of the latest 10-K filing, NVIDIA reported total revenue of $26.91 billion for the fiscal year en...

15:52:23 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
3. What are NVIDIA's main products?
   Cache: MISS | Time: 3.195s
   Answer: NVIDIA's main products include:

1. **Graphics Processing Units (GPUs)**: Primarily for gaming, prof...


[SECOND PASS - Cache Hits with Paraphrased Questions]

### Cost Savings Analysis

Let's estimate the potential cost savings from implementing semantic caching in a production environment.


In [34]:
# Estimate cost savings
# Assumptions: GPT-4o-mini costs ~$0.15 per 1M input tokens, $0.60 per 1M output tokens
# Average request: ~200 input tokens, ~150 output tokens
cost_per_llm_call = (200 * 0.15 / 1_000_000) + (150 * 0.60 / 1_000_000)  # USD
cost_per_cache_check = 0.000001  # Negligible (Redis query)

total_queries = 10000  # Simulate 10K queries per day
cache_hit_rate = 0.70  # Assume 70% hit rate in production

# Without cache
cost_without_cache = total_queries * cost_per_llm_call

# With cache
cache_hits = total_queries * cache_hit_rate
cache_misses = total_queries * (1 - cache_hit_rate)
cost_with_cache = (cache_hits * cost_per_cache_check) + (cache_misses * cost_per_llm_call)

savings = cost_without_cache - cost_with_cache
savings_percent = (savings / cost_without_cache) * 100

print("\n" + "="*80)
print(f"COST SAVINGS ESTIMATE ({total_queries:,} queries/day @ {int(cache_hit_rate*100)}% hit rate)")
print("="*80)
print(f"Cost without cache: ${cost_without_cache:.4f}/day")
print(f"Cost with cache:    ${cost_with_cache:.6f}/day")
print(f"\nDaily savings:   ${savings:.4f} ({savings_percent:.1f}% reduction)")
print(f"Monthly savings: ${savings * 30:.2f}")
print(f"Annual savings:  ${savings * 365:.2f}")



COST SAVINGS ESTIMATE (10,000 queries/day @ 70% hit rate)
Cost without cache: $1.2000/day
Cost with cache:    $0.367000/day

Daily savings:   $0.8330 (69.4% reduction)
Monthly savings: $24.99
Annual savings:  $304.04


## 9. Cache Analytics and Monitoring


In [35]:
# Get cache statistics
print("\n" + "="*80)
print("CACHE STATISTICS")
print("="*80)

# Count total entries
info = cache.index.info()
print(f"\nCache Name: {cache.index.name}")
print(f"Total cached entries: {info.get('num_docs', 'N/A')}")
print(f"Distance threshold: {cache.distance_threshold:.6f}")
print(f"Vectorizer model: redis/langcache-embed-v1")
print(f"Embedding dimensions: 768")



CACHE STATISTICS

Cache Name: rag_faq_cache
Total cached entries: 120
Distance threshold: 0.010000
Vectorizer model: redis/langcache-embed-v1
Embedding dimensions: 768


## 10. Best Practices and Tips

### Key Takeaways

1. **Threshold Optimization**: Start conservative (0.10-0.15) and optimize based on real usage data
2. **Doc2Cache**: Pre-populate your cache with high-quality FAQs for immediate benefits
3. **Monitoring**: Track cache hit rates and adjust thresholds as user patterns emerge
4. **Model Selection**: The `langcache-embed-v1` model is specifically optimized for caching tasks
5. **Cost-Performance Balance**: Even a 50% cache hit rate provides significant cost savings

### When to Use Semantic Caching

✅ **Good Use Cases:**
- High-traffic applications with repeated question patterns
- Customer support chatbots
- FAQ systems
- Documentation Q&A
- Product information queries
- Educational content Q&A

❌ **Less Suitable:**
- Highly dynamic content requiring real-time data
- Creative writing tasks needing variety
- Personalized responses based on user-specific context
- Time-sensitive queries (use TTL if needed)

### Performance Tips

1. **Batch Loading**: Pre-populate cache with Doc2Cache for immediate value
2. **Monitor Hit Rates**: Track and adjust thresholds based on production metrics
3. **A/B Testing**: Test different thresholds with a subset of traffic
4. **Cache Warming**: Regularly update cache with trending topics
5. **TTL Management**: Set time-to-live for entries that may become stale


## 11. Cleanup

Clean up resources when done.


In [None]:
# Clear cache contents
cache.delete()
print("Cache deleted")

✓ Cache cleared (index preserved)


## Summary

Congratulations! You've completed this comprehensive guide on semantic caching with LangCache and RedisVL. 

**What You've Learned:**
- ✅ Set up and configure LangCache with Redis Cloud
- ✅ Load and process PDF documents into knowledge bases
- ✅ Generate FAQs using the Doc2Cache technique with LLMs
- ✅ Pre-populate a semantic cache with tagged entries
- ✅ Test different cache matching strategies and thresholds
- ✅ Optimize cache performance using test datasets
- ✅ Leverage the `redis/langcache-embed-v1` cross-encoder model
- ✅ Integrate semantic caching into RAG pipelines
- ✅ Measure performance improvements and cost savings

**Next Steps:**
- Experiment with different distance thresholds for your use case
- Try other embedding models and compare performance
- Implement cache analytics and monitoring in production
- Explore advanced features like TTL, metadata filtering, and cache warming strategies
- Scale your semantic cache to handle production traffic

**Resources:**
- [RedisVL Documentation](https://redis.io/docs/stack/search/redisvl/)
- [LangCache Sign Up](https://langcache.io/signup)
- [Redis AI Resources](https://github.com/redis-developer/redis-ai-resources)
- [Semantic Caching Paper](https://arxiv.org/abs/2504.02268)
