![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# LangCache: Semantic Caching with Redis Cloud

This notebook demonstrates end-to-end semantic caching using **LangCache** - a managed Redis Cloud service accessed through the RedisVL library. LangCache provides enterprise-grade semantic caching with zero infrastructure management, making it ideal for production LLM applications.

<a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/semantic-cache/04_langcache_semantic_caching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Introduction

**LangCache** is a fully managed semantic cache service built on Redis Cloud. It was integrated into RedisVL in version 0.11.0 as an `LLMCache` interface implementation, making it easy for RedisVL users to:

- Transition to a fully managed caching service
- Reduce LLM API costs by caching similar queries
- Improve application response times
- Access enterprise features without managing infrastructure

### What You'll Learn

In this tutorial, you will:
1. Set up LangCache with Redis Cloud
2. Load and process a knowledge base (PDF documents)
3. Generate FAQs using the Doc2Cache technique
4. Pre-populate a semantic cache with tagged FAQs
5. Test different cache matching strategies and thresholds
6. Optimize cache performance using evaluation datasets
7. Use the `langcache-embed` cross-encoder model
8. Integrate the cache into a RAG pipeline
9. Measure performance improvements


## 1. Environment Setup

First, we'll install the required packages and set up our environment.


### Install Required Packages


In [1]:
%pip install -q "redisvl>=0.11.0" "openai>=1.0.0" "langchain>=0.3.0" "langchain-community" "langchain-openai" "langcache"
%pip install -q "pypdf" "sentence-transformers" "redis-retrieval-optimizer>=0.2.0"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Import Dependencies


In [2]:
import os
import time
import json
from typing import List, Dict, Any

# RedisVL imports
from redisvl.extensions.cache.llm import LangCacheSemanticCache

# Optimization
from redis_retrieval_optimizer.threshold_optimization import CacheThresholdOptimizer

16:36:26 numexpr.utils INFO   NumExpr defaulting to 10 threads.


## 2. LangCache Setup

### Sign Up for LangCache

If you haven't already, sign up for a free Redis Cloud account:

**[Log in or sign up for Redis Cloud →](https://cloud.redis.io/#/)**

After signing up:
1. Create a new database
2. Create a new LangCache service (Select 'LangCache' on the left menu bar)
3. Copy your **API Key**
4. Copy your **Cache ID**
5. Copy your **URL**


### Configure Environment Variables


In [3]:
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

### Initialize Semantic Cache with LangCache-Embed Model

We'll create a cache instance using the `redis/langcache-embed-v1` model, which is specifically optimized for semantic caching tasks.


In [4]:
langcache_api_key = os.environ.get('LANGCACHE_API_KEY') # found on your cloud console
langcache_id = os.environ.get('LANGCACHE_ID') # found on your cloud console
server_url = "https://aws-us-east-1.langcache.redis.io" # found on your cloud console

# Create Semantic Cache instance
cache = LangCacheSemanticCache(
    server_url=server_url,
    cache_id=langcache_id,
    api_key=langcache_api_key,
)

In [5]:
# Check your cache is workign
r = cache.check('hello world')
print(r) # should be empty on first run
cache.store('hello world', 'hello world from langcache')
r = cache.check('hi world')
print(r)

16:36:28 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
[{'entry_id': '5eb63bbbe01eeed093cb22bb8f5acdc3', 'prompt': 'hello world', 'response': 'hello world from langcache', 'vector_distance': 0.0, 'inserted_at': 0.0, 'updated_at': 0.0}]
16:36:28 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries "HTTP/1.1 201 Created"
16:36:28 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
[{'entry_id': '5eb63bbbe01eeed093cb22bb8f5acdc3', 'prompt': 'hello world', 'response': 'hello world from langcache', 'vector_distance': 0.07242219999999999, 'inserted_at': 0.0, 'updated_at': 0.0}]


## 3. Load and Prepare Datasets

We'll work with three types of data:
1. **Knowledge Base**: PDF document(s) that contain factual information
2. **FAQs**: Derived from the knowledge base using Doc2Cache technique
3. **Test Dataset**: For evaluating and optimizing cache performance


In [6]:
# LangChain imports
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from pydantic import BaseModel, Field

  from .autonotebook import tqdm as notebook_tqdm


### Initialize OpenAI LLM


In [7]:
# Initialize OpenAI LLM for FAQ generation and RAG
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.3,
    max_tokens=2000
)

### Load PDF Knowledge Base


In [8]:
# Download sample PDF if not already present
!mkdir -p data
!wget -q -O data/nvidia-10k.pdf https://raw.githubusercontent.com/redis-developer/redis-ai-resources/main/python-recipes/RAG/resources/nvd-10k-2023.pdf

In [9]:
# Load and chunk the PDF
pdf_path = "data/nvidia-10k.pdf"

# Configure text splitter for optimal chunk sizes
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Load and split the document
loader = PyPDFLoader(pdf_path)
documents = loader.load()
chunks = text_splitter.split_documents(documents)

print(f"Loaded PDF: {pdf_path}")
print(f"  Total pages: {len(documents)}")
print(f"  Created chunks: {len(chunks)}")
print(f"\nSample chunk preview:")
print(f"{chunks[10].page_content[:300]}...")


Loaded PDF: data/nvidia-10k.pdf
  Total pages: 169
  Created chunks: 388

Sample chunk preview:
Table of Contents
The world’s leading cloud service providers, or CSPs, and consumer internet companies use our GPUs and broader data center-scale
accelerated computing platforms to enable, accelerate or enrich the services they deliver to billions of end-users, including search,
recommendations, so...


### Generate FAQs Using Doc2Cache Technique

The Doc2Cache approach uses an LLM to generate frequently asked questions from document chunks. These FAQs are then used to pre-populate the semantic cache with high-quality, factual responses.


In [10]:
# Define the FAQ data model
class QuestionAnswer(BaseModel):
    question: str = Field(description="A frequently asked question derived from the document content")
    answer: str = Field(description="A factual answer to the question based on the document")
    category: str = Field(description="Category of the question (e.g., 'financial', 'products', 'operations')")

class FAQList(BaseModel):
    faqs: List[QuestionAnswer] = Field(description="List of question-answer pairs extracted from the document")

# Set up JSON output parser
json_parser = JsonOutputParser(pydantic_object=FAQList)


In [11]:
# Create the FAQ generation prompt
faq_prompt = PromptTemplate(
    template="""You are a document analysis expert. Extract 3-5 high-quality FAQs from the following document chunk.

Guidelines:
- Generate diverse, specific questions that users would realistically ask
- Provide accurate, complete answers based ONLY on the document content
- Assign each FAQ to a category: 'financial', 'products', 'operations', 'technology', or 'general'
- Avoid vague or overly generic questions
- If the chunk lacks substantial content, return fewer FAQs

{format_instructions}

Document Chunk:
{doc_content}

FAQs JSON:""",
    input_variables=["doc_content"],
    partial_variables={"format_instructions": json_parser.get_format_instructions()}
)

# Create the FAQ generation chain
faq_chain = faq_prompt | llm | json_parser

print("FAQ generation chain configured")


FAQ generation chain configured


In [12]:
# Test FAQ generation on a single chunk
print("Testing FAQ generation on sample chunk...\n")
test_faqs = faq_chain.invoke({"doc_content": chunks[10].page_content})

print(f"Generated {len(test_faqs.get('faqs', []))} FAQs:")
for i, faq in enumerate(test_faqs.get('faqs', [])[:3], 1):
    print(f"\n{i}. Q: {faq['question']}")
    print(f"   Category: {faq['category']}")
    print(f"   A: {faq['answer'][:150]}...")


Testing FAQ generation on sample chunk...

16:36:51 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Generated 5 FAQs:

1. Q: What industries are leveraging NVIDIA's GPUs for automation?
   Category: operations
   A: A rapidly growing number of enterprises and startups across a broad range of industries, including transportation for autonomous driving, healthcare f...

2. Q: What was the reason for the termination of the Arm Share Purchase Agreement?
   Category: financial
   A: The Share Purchase Agreement between NVIDIA and SoftBank was terminated due to significant regulatory challenges that prevented the completion of the ...

3. Q: What types of products do professional designers create using NVIDIA's technology?
   Category: products
   A: Professional designers use NVIDIA's GPUs and software to create visual effects in movies and to design a variety of products, including cell phones an...


In [13]:
# Generate FAQs from all chunks (limited to first 25 for demo purposes)
def extract_faqs_from_chunks(chunks: List[Any], max_chunks: int = 25) -> List[Dict]:
    """Extract FAQs from document chunks using LLM."""
    all_faqs = []
    
    for i, chunk in enumerate(chunks[:max_chunks]):
        if i % 5 == 0:
            print(f"Processing chunk {i+1}/{min(len(chunks), max_chunks)}...", flush=True)
        
        try:
            result = faq_chain.invoke({"doc_content": chunk.page_content})
            if result and result.get("faqs"):
                all_faqs.extend(result["faqs"])
        except Exception as e:
            print(f"  Warning: Skipped chunk {i+1} due to error: {str(e)[:100]}")
            continue
    
    return all_faqs

# Extract FAQs
print("\nGenerating FAQs from document chunks...\n")
faqs = extract_faqs_from_chunks(chunks, max_chunks=25)

print(f"\nGenerated {len(faqs)} FAQs total")
print(f"\nCategory distribution:")
categories = {}
for faq in faqs:
    cat = faq.get('category', 'unknown')
    categories[cat] = categories.get(cat, 0) + 1
for cat, count in sorted(categories.items(), key=lambda x: x[1], reverse=True):
    print(f"  {cat}: {count}")



Generating FAQs from document chunks...

Processing chunk 1/25...
16:36:57 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
16:37:04 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
16:37:09 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
16:37:17 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
16:37:22 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Processing chunk 6/25...
16:37:28 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
16:37:30 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
16:37:36 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
16:37:41 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.

### Create Test/Evaluation Dataset

We'll create a test dataset with:
- **Positive examples**: Questions that should match cached FAQs
- **Negative examples**: Questions that should NOT match cached FAQs
- **Edge cases**: Slightly different phrasings to test threshold sensitivity


In [14]:
# Select representative FAQs for test set
sample_faqs = faqs[:10]  # Take first 10 FAQs

print("Sample FAQs for testing:")
for i, faq in enumerate(sample_faqs[:3], 1):
    print(f"\n{i}. {faq['question'][:100]}...")


Sample FAQs for testing:

1. What is the fiscal year end date for NVIDIA Corporation as reported in the Form 10-K?...

2. What is the trading symbol for NVIDIA Corporation's common stock?...

3. Where is NVIDIA Corporation's principal executive office located?...


In [15]:
# Create test dataset with negative examples (off-topic questions)
negative_examples = [
    {"query": "What is the weather today?", "expected_match": False, "category": "off-topic"},
    {"query": "How do I cook pasta?", "expected_match": False, "category": "off-topic"},
    {"query": "What is the capital of France?", "expected_match": False, "category": "off-topic"},
    {"query": "Tell me a joke", "expected_match": False, "category": "off-topic"},
    {"query": "What time is it?", "expected_match": False, "category": "off-topic"},
]

print(f"Test dataset created")
print(f"  Negative examples: {len(negative_examples)}")


Test dataset created
  Negative examples: 5


## 4. Pre-Load Semantic Cache with FAQs

Now we'll populate the cache instance with our generated FAQs. We'll use the `store()` API with metadata tags for filtering and organization.


In [16]:
# Clear any existing cache entries
r = cache.check('hello world')
cache.store('hello world', 'hello world from langcache')


16:39:29 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
16:39:29 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries "HTTP/1.1 201 Created"


'5eb63bbbe01eeed093cb22bb8f5acdc3'

In [17]:
# Store FAQs in cache with metadata tags
print("Storing FAQs in cache...\n")

stored_count = 0
cache_keys = {}  # Map questions to their cache keys

for i, faq in enumerate(faqs):
    if i % 20 == 0:
        print(f"  Stored {i}/{len(faqs)} FAQs...", flush=True)
    
    try:
        # Store with metadata - note that metadata is stored but not used for filtering in basic SemanticCache
        # In production, you can use this for analytics and tracking
        key = cache.store(
            prompt=faq['question'],
            response=faq['answer']
        )
        cache_keys[faq['question']] = key
        stored_count += 1
    except Exception as e:
        print(f"  Warning: Failed to store FAQ {i+1}: {str(e)[:100]}")

print(f"\nStored {stored_count} FAQs in cache")
print(f"\nExample cache entries:")
for i, (q, k) in enumerate(list(cache_keys.items())[:2], 1):
    print(f"\n{i}. Key: {k}")
    print(f"   Q: {q[:80]}...")


Storing FAQs in cache...

  Stored 0/112 FAQs...
16:39:29 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries "HTTP/1.1 201 Created"
16:39:29 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries "HTTP/1.1 201 Created"
16:39:30 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries "HTTP/1.1 201 Created"
16:39:30 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries "HTTP/1.1 201 Created"
16:39:30 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries "HTTP/1.1 201 Created"
16:39:30 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries "HTTP/1.1 201 Created"
16:39:30 httpx 

## 5. Test Cache Retrieval with Different Strategies

Let's test how the cache performs with different types of queries and matching thresholds.


### Test Exact Match Queries


In [18]:
# Test with exact questions from cache
print("Testing exact match queries:\n")

for i, faq in enumerate(faqs[:3], 1):
    result = cache.check(prompt=faq['question'])
    
    if result:
        print(f"{i}. Cache HIT")
        print(f"   Query: {faq['question'][:80]}...")
        print(f"   Answer: {result[0]['response'][:100]}...\n")
    else:
        print(f"{i}. ✗ Cache MISS")
        print(f"   Query: {faq['question'][:80]}...\n")


Testing exact match queries:

16:39:42 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
1. Cache HIT
   Query: What is the fiscal year end date for NVIDIA Corporation as reported in the Form ...
   Answer: The fiscal year ended January 29, 2023....

16:39:42 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
2. Cache HIT
   Query: What is the trading symbol for NVIDIA Corporation's common stock?...
   Answer: The trading symbol for NVIDIA Corporation's common stock is NVDA....

16:39:42 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
3. Cache HIT
   Query: Where is NVIDIA Corporation's principal executive office located?...
   Answer: NVIDIA Corporation's principal executive office is located

### Test Semantic Similarity


In [19]:
# Test with semantically similar queries
print("Testing semantic similarity:\n")

similar_queries = [
    "Tell me about NVIDIA's revenue",
    "What products does the company make?",
    "How is the company performing financially?",
]

for i, query in enumerate(similar_queries, 1):
    result = cache.check(prompt=query, return_fields=["prompt", "response", "distance"])
    
    if result:
        print(f"{i}. Cache HIT (distance: {result[0].get('vector_distance', 'N/A'):.4f})")
        print(f"   Query: {query}")
        print(f"   Matched: {result[0]['prompt'][:80]}...")
        print(f"   Answer: {result[0]['response'][:100]}...\n")
    else:
        print(f"{i}. ✗ Cache MISS")
        print(f"   Query: {query}\n")


Testing semantic similarity:

16:39:42 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
1. Cache HIT (distance: 0.0629)
   Query: Tell me about NVIDIA's revenue
   Matched: How much revenue did NVIDIA generate?...
   Answer: As of the latest available data in NVIDIA's 10-K filing for the fiscal year ended January 29, 2023, ...

16:39:42 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
2. ✗ Cache MISS
   Query: What products does the company make?

16:39:42 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
3. ✗ Cache MISS
   Query: How is the company performing financially?



### Test Cache with Sample Query


In [20]:
# Test cache behavior with a sample query
test_query = "What is NVIDIA's main business?"

print(f"Testing query: '{test_query}'")

result = cache.check(prompt=test_query, return_fields=["prompt", "vector_distance"])

if result:
    print(f"Cache HIT")
    print(f"  Distance: {result[0].get('vector_distance', 0):.6f}")
    print(f"  Matched: {result[0]['prompt'][:80]}...")
else:
    print(f"✗ Cache MISS - No match found within threshold")


Testing query: 'What is NVIDIA's main business?'
16:39:43 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
Cache HIT
  Distance: 0.070051
  Matched: What are the main business segments reported by NVIDIA?...


### Test Negative Examples (Should Not Match)


In [21]:
# Test with off-topic queries that should NOT match
print("Testing negative examples (should NOT match):\n")

for i, test_case in enumerate(negative_examples, 1):
    result = cache.check(prompt=test_case['query'], return_fields=["prompt", "vector_distance"])
    
    if result:
        print(f"{i}. ⚠️  UNEXPECTED HIT (distance: {result[0].get('vector_distance', 'N/A'):.4f})")
        print(f"   Query: {test_case['query']}")
        print(f"   Matched: {result[0]['prompt'][:80]}...\n")
    else:
        print(f"{i}. Correct MISS")
        print(f"   Query: {test_case['query']}\n")


Testing negative examples (should NOT match):

16:39:43 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
1. Correct MISS
   Query: What is the weather today?

16:39:43 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
2. Correct MISS
   Query: How do I cook pasta?

16:39:43 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
3. Correct MISS
   Query: What is the capital of France?

16:39:43 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
4. Correct MISS
   Query: Tell me a joke

16:39:43 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1

## 6. Optimize Cache Threshold

Using the `CacheThresholdOptimizer`, we can automatically find the optimal distance threshold based on our test dataset.


In [22]:
# Create optimization test data
# Format: [{"query": "...", "query_match": "cache_key_or_empty_string"}, ...]

optimization_test_data = []

# Add positive examples (should match specific cache entries)
for faq in faqs[:5]:
    if faq['question'] in cache_keys:
        optimization_test_data.append({
            "query": faq['question'],
            "query_match": cache_keys[faq['question']]
        })

# Add negative examples (should not match anything)
for neg_example in negative_examples:
    optimization_test_data.append({
        "query": neg_example['query'],
        "query_match": ""  # Empty string means it should NOT match
    })

print(f"Created optimization test data:")
print(f"  Total examples: {len(optimization_test_data)}")
print(f"  Positive (should match): {sum(1 for x in optimization_test_data if x['query_match'])}")
print(f"  Negative (should not match): {sum(1 for x in optimization_test_data if not x['query_match'])}")


Created optimization test data:
  Total examples: 10
  Positive (should match): 5
  Negative (should not match): 5


In [23]:
# Re-test with optimized threshold
print("\nRe-testing negative examples with optimized threshold:\n")

for i, test_case in enumerate(negative_examples, 1):
    result = cache.check(prompt=test_case['query'], return_fields=["prompt", "vector_distance"])

    if result:
        print(f"{i}. ⚠️  HIT (distance: {result[0].get('vector_distance', 'N/A'):.4f})")
        print(f"   Query: {test_case['query']}")
        print(f"   Matched: {result[0]['prompt'][:80]}...\n")
    else:
        print(f"{i}. MISS (correct)")
        print(f"   Query: {test_case['query']}\n")



Re-testing negative examples with optimized threshold:

16:39:43 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
1. MISS (correct)
   Query: What is the weather today?

16:39:43 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
2. MISS (correct)
   Query: How do I cook pasta?

16:39:43 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
3. MISS (correct)
   Query: What is the capital of France?

16:39:44 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
4. MISS (correct)
   Query: Tell me a joke

16:39:44 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9be

## 7. RAG Pipeline Integration

Now let's integrate the semantic cache into a complete RAG pipeline and measure the performance improvements.


### Build Simple RAG Chain


In [24]:
# Create a simple RAG prompt template
rag_template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant answering questions about NVIDIA based on their 10-K filing. Provide accurate, concise answers."),
    ("user", "{question}")
])

# Create RAG chain
rag_chain = rag_template | llm

print("RAG chain created")


RAG chain created


### Create Cached RAG Function


In [25]:
def rag_with_cache(question: str, use_cache: bool = True) -> tuple:
    """
    Process a question through RAG pipeline with optional semantic caching.
    
    Returns: (answer, cache_hit, response_time)
    """
    start_time = time.time()
    cache_hit = False
    
    # Check cache first if enabled
    if use_cache:
        cached_result = cache.check(prompt=question)
        if cached_result:
            answer = cached_result[0]['response']
            cache_hit = True
            response_time = time.time() - start_time
            return answer, cache_hit, response_time
    
    # Cache miss - use LLM
    answer = rag_chain.invoke({"question": question})
    response_time = time.time() - start_time
    
    # Store in cache for future use
    if use_cache and hasattr(answer, 'content'):
        cache.store(prompt=question, response=answer.content)
    elif use_cache:
        cache.store(prompt=question, response=str(answer))
    
    return answer.content if hasattr(answer, 'content') else str(answer), cache_hit, response_time

print("Cached RAG function ready")


Cached RAG function ready


### Performance Comparison: With vs Without Cache


In [26]:
# Test questions for RAG evaluation
test_questions_rag = [
    "What is NVIDIA's primary business?",
    "How much revenue did NVIDIA generate?",
    "What are NVIDIA's main products?",
]

print("\n" + "="*80)
print("PERFORMANCE COMPARISON: With Cache vs Without Cache")
print("="*80)

# First pass - populate cache (cache misses, must call LLM)
print("\n[FIRST PASS - Populating Cache]\n")
first_pass_times = []

for i, question in enumerate(test_questions_rag, 1):
    answer, cache_hit, response_time = rag_with_cache(question, use_cache=True)
    first_pass_times.append(response_time)
    print(f"{i}. {question}")
    print(f"   Cache: {'HIT' if cache_hit else 'MISS'} | Time: {response_time:.3f}s")
    print(f"   Answer: {answer[:100]}...\n")

# Second pass - test cache hits with similar questions
print("\n[SECOND PASS - Cache Hits with Paraphrased Questions]\n")
second_pass_times = []

similar_questions = [
    "What does NVIDIA do as a business?",
    "Can you tell me NVIDIA's revenue figures?",
    "What products does NVIDIA sell?",
]

for i, question in enumerate(similar_questions, 1):
    answer, cache_hit, response_time = rag_with_cache(question, use_cache=True)
    second_pass_times.append(response_time)
    print(f"{i}. {question}")
    print(f"   Cache: {'HIT ✓' if cache_hit else 'MISS ✗'} | Time: {response_time:.3f}s")
    print(f"   Answer: {answer[:100]}...\n")

# Third pass - without cache (baseline)
print("\n[THIRD PASS - Without Cache (Baseline)]\n")
no_cache_times = []

for i, question in enumerate(test_questions_rag, 1):
    answer, _, response_time = rag_with_cache(question, use_cache=False)
    no_cache_times.append(response_time)
    print(f"{i}. {question}")
    print(f"   Cache: DISABLED | Time: {response_time:.3f}s\n")

# Summary
print("\n" + "="*80)
print("PERFORMANCE SUMMARY")
print("="*80)
avg_first = sum(first_pass_times)/len(first_pass_times)
avg_second = sum(second_pass_times)/len(second_pass_times)
avg_no_cache = sum(no_cache_times)/len(no_cache_times)

print(f"Average time - First pass (cache miss):  {avg_first:.3f}s")
print(f"Average time - Second pass (cache hit):  {avg_second:.3f}s")
print(f"Average time - Without cache:            {avg_no_cache:.3f}s")

if avg_second > 0:
    speedup = avg_first / avg_second
    print(f"\nSpeedup with cache: {speedup:.1f}x faster")

cache_hit_count = sum(1 for i, _ in enumerate(similar_questions) if second_pass_times[i] < 0.1)
cache_hit_rate = cache_hit_count / len(similar_questions)
print(f"  Cache hit rate: {cache_hit_rate*100:.0f}%")



PERFORMANCE COMPARISON: With Cache vs Without Cache

[FIRST PASS - Populating Cache]

16:39:44 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
1. What is NVIDIA's primary business?
   Cache: HIT | Time: 0.109s
   Answer: NVIDIA reports its business results in two segments: the Compute & Networking segment, which include...

16:39:44 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
2. How much revenue did NVIDIA generate?
   Cache: HIT | Time: 0.103s
   Answer: As of the latest available data in NVIDIA's 10-K filing for the fiscal year ended January 29, 2023, ...

16:39:44 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries/search "HTTP/1.1 200 OK"
3. What are NVIDIA's main products?
   Cache: HIT | Time: 0.104s
   An

## 8. Best Practices and Tips


### Key Takeaways

1. **Threshold Optimization**: Start conservative (0.10-0.15) and optimize based on real usage data
2. **Doc2Cache**: Pre-populate your cache with high-quality FAQs for immediate benefits
3. **Monitoring**: Track cache hit rates and adjust thresholds as user patterns emerge
4. **Model Selection**: The `langcache-embed-v1` model is specifically optimized for caching tasks
5. **Cost-Performance Balance**: Even a 50% cache hit rate provides significant cost savings

### When to Use Semantic Caching

✅ **Good Use Cases:**
- High-traffic applications with repeated question patterns
- Customer support chatbots
- FAQ systems
- Documentation Q&A
- Product information queries
- Educational content Q&A

❌ **Less Suitable:**
- Highly dynamic content requiring real-time data
- Creative writing tasks needing variety
- Personalized responses based on user-specific context
- Time-sensitive queries (use TTL if needed)

### Performance Tips

1. **Batch Loading**: Pre-populate cache with Doc2Cache for immediate value
2. **Monitor Hit Rates**: Track and adjust thresholds based on production metrics
3. **A/B Testing**: Test different thresholds with a subset of traffic
4. **Cache Warming**: Regularly update cache with trending topics
5. **TTL Management**: Set time-to-live for entries that may become stale


## 9. Cleanup

Clean up resources when done.


In [27]:
# Clear cache contents
cache.clear()
print("Cache contents cleared")

16:39:52 httpx INFO   HTTP Request: DELETE https://aws-us-east-1.langcache.redis.io/v1/caches/56f7ba9bee374701a1253f21cd1ac35e/entries "HTTP/1.1 400 Bad Request"


BadRequestErrorResponseContent: {"detail":"attributes: cannot be blank.","status":400,"title":"Invalid Request","type":"/errors/invalid-data"}

## Summary

Congratulations! You've completed this comprehensive guide on semantic caching with LangCache and RedisVL. 

**What You've Learned:**
- ✅ Set up and configure LangCache with Redis Cloud
- ✅ Load and process PDF documents into knowledge bases
- ✅ Generate FAQs using the Doc2Cache technique with LLMs
- ✅ Pre-populate a semantic cache with tagged entries
- ✅ Test different cache matching strategies and thresholds
- ✅ Optimize cache performance using test datasets
- ✅ Leverage the `redis/langcache-embed-v1` cross-encoder model
- ✅ Integrate semantic caching into RAG pipelines
- ✅ Measure performance improvements and cost savings

**Next Steps:**
- Experiment with different distance thresholds for your use case
- Try other embedding models and compare performance
- Implement cache analytics and monitoring in production
- Explore advanced features like TTL, metadata filtering, and cache warming strategies
- Scale your semantic cache to handle production traffic

**Resources:**
- [RedisVL Documentation](https://docs.redisvl.com/en/stable/index.html)
- [LangCache Sign Up](https://redis.io/langcache/)
- [Redis AI Resources](https://github.com/redis-developer/redis-ai-resources)
- [Semantic Caching Paper](https://arxiv.org/abs/2504.02268)
