![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# LangCache: Semantic Caching with Redis Cloud

This notebook demonstrates end-to-end semantic caching using **LangCache** - a managed Redis Cloud service accessed through the RedisVL library. LangCache provides enterprise-grade semantic caching with zero infrastructure management, making it ideal for production LLM applications.

<a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/semantic-cache/04_langcache_semantic_caching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Introduction

**LangCache** is a fully managed semantic cache service built on Redis Cloud. It was integrated into RedisVL in version 0.11.0 as an `LLMCache` interface implementation, making it easy for RedisVL users to:

- Transition to a fully managed caching service
- Reduce LLM API costs by caching similar queries
- Improve application response times
- Access enterprise features without managing infrastructure

### What You'll Learn

In this tutorial, you will:
1. Set up LangCache with Redis Cloud
2. Load and process a knowledge base (PDF documents)
3. Generate FAQs using the Doc-to-Cache technique
4. Pre-populate a semantic cache with tagged FAQs
5. Test different cache matching strategies and thresholds
6. Integrate the cache into a RAG pipeline
7. Measure performance improvements


## 1. Environment Setup

First, we'll install the required packages and set up our environment.


### Install Required Packages


In [1]:
%pip install -q "redisvl>=0.11.0" "langcache" "sentence-transformers"
%pip install -q "pypdf" "openai>=1.0.0" "langchain>=0.3.0" "langchain-community" "langchain-openai"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Import Dependencies


In [2]:
import os
import time
import json
from typing import List, Dict, Any

# RedisVL imports
from redisvl.extensions.cache.llm import LangCacheSemanticCache

## 2. LangCache setup

### Sign up for LangCache

If you haven't already, sign up for a free Redis Cloud account:

**[Log in or sign up for Redis Cloud →](https://cloud.redis.io/#/)**

After signing up:
1. Create a new database
2. Create a new LangCache service (Select 'LangCache' on the left menu bar)
3. Copy your **API Key**
4. Copy your **Cache ID**
5. Copy your **URL**


### Configure Environment Variables
You'll need the LangCache API Key, Cache ID, URL
You will also need access to an LLM. In this notebook we'll be using OpenAI

### Initialize Semantic Cache with LangCache-Embed Model

We'll create a cache instance using the `redis/langcache-embed-v1` model, which is specifically optimized for semantic caching tasks.


In [3]:
langcache_api_key = os.environ.get('LANGCACHE_API_KEY') # found on your cloud console
langcache_id = os.environ.get('LANGCACHE_ID') # found on your cloud console
server_url = "https://aws-us-east-1.langcache.redis.io" # found on your cloud console


print(langcache_api_key)
print(langcache_id)
print(server_url)

# Create Semantic Cache instance
cache = LangCacheSemanticCache(
    server_url=server_url,
    cache_id=langcache_id,
    api_key=langcache_api_key,
)

wy4ECQMIVUCcYGbZr_Lg007Cifh4GkgiIRNAf3S4ITMWQ4puuq-OStyjMvH-iD1m0oIB6hg5EVYQye5r1xajEFL7e0AUw5Gn_UEksTQdSm-Hwzu3wXsJJ4emhp8OopEJfHx6JnPlW36LDkCf6ne4Kj8CWiQkphQHqaEeKV9mdgbml-8qOv19AFr0y5vmTtkU_Xt5ByfGMTO-mI9wMKXNLOfwZixM1kiE8KAL_JM7dJN_EHQh
50eb6a09acf5415d8b68619b1ccffd9a
https://aws-us-east-1.langcache.redis.io


In [4]:
# Check your cache is working
r = cache.check('hello world')
print(r) # should be empty on first run

cache.store('hello world', 'hello world from langcache')
result = cache.check('hi world')

print(result)

18:25:09 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
[]
18:25:09 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries "HTTP/1.1 201 Created"
18:25:09 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
[{'entry_id': '5eb63bbbe01eeed093cb22bb8f5acdc3', 'prompt': 'hello world', 'response': 'hello world from langcache', 'vector_distance': 0.07242219999999999, 'inserted_at': 0.0, 'updated_at': 0.0}]


# RAG with semantic caching

Now that we have a working semantic cache service running and we're connected to it, let's use it in an application.

We'll build a simple Retrieval Augmented Generation (RAG) app using a PDF of NVidia's 2023 10k filing report.

To get the full benefit of semantic caching we'll preload our cache with Frequently Asked Questions (FAQs) generated by an LLM about our PDF.

## 3. Generate FAQs Using Doc-to-Cache Technique

The Doc-to-Cache approach uses an LLM to generate frequently asked questions from document chunks. These FAQs are then used to pre-populate the semantic cache with high-quality, factual responses.

We'll work with three types of data:
1. **Knowledge Base**: PDF document(s) that contain factual information
2. **FAQs**: Derived from the knowledge base using Doc-to-Cache technique
3. **Test Dataset**: For evaluating and optimizing cache performance


In [5]:
# LangChain imports
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

from pydantic import BaseModel, Field
import getpass

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

# Initialize OpenAI LLM for FAQ generation and RAG
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.3,
    max_tokens=2000
)

### Load PDF Knowledge Base


In [7]:
# Download sample PDF if not already present
!mkdir -p data
!wget -q -O data/nvidia-10k.pdf https://raw.githubusercontent.com/redis-developer/redis-ai-resources/main/python-recipes/RAG/resources/nvd-10k-2023.pdf

In [8]:
# Load and chunk the PDF
pdf_path = "data/nvidia-10k.pdf"

# Configure text splitter for optimal chunk sizes
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Load and split the document
loader = PyPDFLoader(pdf_path)
documents = loader.load()
chunks = text_splitter.split_documents(documents)

print(f"Loaded PDF: {pdf_path}")
print(f"  Total pages: {len(documents)}")
print(f"  Created chunks: {len(chunks)}")
print(f"\nSample chunk preview:")
print(f"{chunks[10].page_content[:300]}...")


Loaded PDF: data/nvidia-10k.pdf
  Total pages: 169
  Created chunks: 388

Sample chunk preview:
Table of Contents
The world’s leading cloud service providers, or CSPs, and consumer internet companies use our GPUs and broader data center-scale
accelerated computing platforms to enable, accelerate or enrich the services they deliver to billions of end-users, including search,
recommendations, so...


In [9]:
# Define the FAQ data model
class QuestionAnswer(BaseModel):
    question: str = Field(description="A frequently asked question derived from the document content")
    answer: str = Field(description="A factual answer to the question based on the document")
    category: str = Field(description="Category of the question (e.g., 'financial', 'products', 'operations')")

class FAQList(BaseModel):
    faqs: List[QuestionAnswer] = Field(description="List of question-answer pairs extracted from the document")

# Set up JSON output parser
json_parser = JsonOutputParser(pydantic_object=FAQList)


In [10]:
# Create the FAQ generation prompt
faq_prompt = PromptTemplate(
    template="""You are a document analysis expert. Extract 3-5 high-quality FAQs from the following document chunk.

Guidelines:
- Generate diverse, specific questions that users would realistically ask
- Provide accurate, complete answers based ONLY on the document content
- Assign each FAQ to a category: 'financial', 'products', 'operations', 'technology', or 'general'
- Avoid vague or overly generic questions
- If the chunk lacks substantial content, return fewer FAQs

{format_instructions}

Document Chunk:
{doc_content}

FAQs JSON:""",
    input_variables=["doc_content"],
    partial_variables={"format_instructions": json_parser.get_format_instructions()}
)

# Create the FAQ generation chain
faq_chain = faq_prompt | llm | json_parser

print("FAQ generation chain configured")


FAQ generation chain configured


In [11]:
# Test FAQ generation on a single chunk
print("Testing FAQ generation on sample chunk...\n")
test_faqs = faq_chain.invoke({"doc_content": chunks[10].page_content})

print(f"Generated {len(test_faqs.get('faqs', []))} FAQs:")
for i, faq in enumerate(test_faqs.get('faqs', [])[:3], 1):
    print(f"\n{i}. Q: {faq['question']}")
    print(f"   Category: {faq['category']}")
    print(f"   A: {faq['answer'][:150]}...")


Testing FAQ generation on sample chunk...

18:25:33 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Generated 5 FAQs:

1. Q: What industries are utilizing NVIDIA's GPUs for automation?
   Category: products
   A: A rapidly growing number of enterprises and startups across a broad range of industries, including transportation for autonomous driving, healthcare f...

2. Q: What was the reason for the termination of the Arm Share Purchase Agreement?
   Category: operations
   A: The termination of the Arm Share Purchase Agreement was due to significant regulatory challenges that prevented the completion of the transaction, as ...

3. Q: How much did NVIDIA record as an acquisition termination cost in fiscal year 2023?
   Category: financial
   A: NVIDIA recorded an acquisition termination cost of $1.35 billion in fiscal year 2023, reflecting the write-off of the prepayment provided at signing f...


In [12]:
# Generate FAQs from all chunks (limited to first 25 for demo purposes)
def extract_faqs_from_chunks(chunks: List[Any], max_chunks: int = 25) -> List[Dict]:
    """Extract FAQs from document chunks using LLM.
        
        chunks: list of document chunks
        max_chunks: maximum number of chunks to process
        
        Returns: A list of question-answer pairs
    """
    all_faqs = []

    for i, chunk in enumerate(chunks[:max_chunks]):
        if i % 5 == 0:
            print(f"Processing chunk {i+1}/{min(len(chunks), max_chunks)}...", flush=True)

        try:
            result = faq_chain.invoke({"doc_content": chunk.page_content})
            if result and result.get("faqs"):
                all_faqs.extend(result["faqs"])
        except Exception as e:
            print(f"  Warning: Skipped chunk {i+1} due to error: {str(e)[:100]}")
            continue

    return all_faqs

# Extract FAQs
print("\nGenerating FAQs from document chunks...\n")
faqs = extract_faqs_from_chunks(chunks, max_chunks=25)

print(f"\nGenerated {len(faqs)} FAQs total")
print(f"\nCategory distribution:")
categories = {}
for faq in faqs:
    cat = faq.get('category', 'unknown')
    categories[cat] = categories.get(cat, 0) + 1
for cat, count in sorted(categories.items(), key=lambda x: x[1], reverse=True):
    print(f"  {cat}: {count}")



Generating FAQs from document chunks...

Processing chunk 1/25...
18:25:40 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
18:25:52 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
18:25:58 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
18:26:04 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
18:26:09 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Processing chunk 6/25...
18:26:18 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
18:26:23 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
18:26:30 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
18:26:42 httpx INFO   HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.

## 4. Pre-load semantic cache with FAQs

Now we'll populate the cache instance with our generated FAQs. We'll use the `store()` API with metadata tags for filtering and organization.


In [13]:
# Store FAQs in cache with metadata tags
print("Storing FAQs in cache...\n")

stored_count = 0
cache_keys = {}  # Map questions to their cache keys

for i, faq in enumerate(faqs):
    if i % 20 == 0:
        print(f"  Stored {i}/{len(faqs)} FAQs...", flush=True)

    try:
        # Store with metadata - note that metadata is stored but not used for filtering in basic SemanticCache
        key = cache.store(prompt=faq['question'], response=faq['answer'], metadata={'category': faq['category']})
        cache_keys[faq['question']] = key
        stored_count += 1
    except Exception as e:
        print(f"  Warning: Failed to store FAQ {i+1}: {str(e)[:100]}")

print(f"\nStored {stored_count} FAQs in cache")

print(f"\nExample cache entries:")
for i, (q, k) in enumerate(list(cache_keys.items())[:2], 1):
    print(f"\n{i}. Key: {k}")
    print(f"   Q: {q[:150]}...")


Storing FAQs in cache...

  Stored 0/112 FAQs...
18:28:40 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries "HTTP/1.1 201 Created"
18:28:40 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries "HTTP/1.1 201 Created"
18:28:40 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries "HTTP/1.1 201 Created"
18:28:40 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries "HTTP/1.1 201 Created"
18:28:41 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries "HTTP/1.1 201 Created"
18:28:41 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries "HTTP/1.1 201 Created"
18:28:41 httpx 

## 5. Evaluating our semantic cache


### Create test/evaluation dataset

We'll create a test dataset with:
- **Positive examples**: Questions that should match cached FAQs
- **Negative examples**: Questions that should NOT match cached FAQs
- **Edge cases**: Slightly different phrasings to test threshold sensitivity


In [14]:
# Select representative FAQs for test set
sample_faqs = faqs[:10]  # Take first 10 FAQs

print("Sample FAQs for testing:")
for i, faq in enumerate(sample_faqs[:3], 1):
    print(f"\n{i}. {faq['question'][:100]}...")


Sample FAQs for testing:

1. What is the fiscal year end date for NVIDIA Corporation as reported in the Form 10-K?...

2. What is the trading symbol for NVIDIA Corporation's common stock?...

3. Where is the principal executive office of NVIDIA Corporation located?...


In [15]:
# Create test dataset with negative examples (off-topic questions)
negative_examples = [
    {"query": "What is the weather today?", "expected_match": False, "category": "off-topic"},
    {"query": "How do I cook pasta?", "expected_match": False, "category": "off-topic"},
    {"query": "What is the capital of France?", "expected_match": False, "category": "off-topic"},
    {"query": "Tell me a joke", "expected_match": False, "category": "off-topic"},
    {"query": "What time is it?", "expected_match": False, "category": "off-topic"},
]

print(f"Test dataset created")
print(f"  Negative examples: {len(negative_examples)}")


Test dataset created
  Negative examples: 5


## 5. Test cache retrieval With different strategies

Let's test how the cache performs with different types of queries and matching thresholds.


### Test exact match queries


In [16]:
# Test with exact questions from cache
print("Testing exact match queries:\n")

for i, faq in enumerate(faqs[:3], 1):
    result = cache.check(prompt=faq['question'])
    
    if result:
        print(f"{i}. Cache HIT")
        print(f"   Query: {faq['question'][:80]}...")
        print(f"   Answer: {result[0]['response'][:100]}...\n")
    else:
        print(f"{i}. ✗ Cache MISS")
        print(f"   Query: {faq['question'][:80]}...\n")


Testing exact match queries:

18:28:54 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
1. Cache HIT
   Query: What is the fiscal year end date for NVIDIA Corporation as reported in the Form ...
   Answer: The fiscal year ended January 29, 2023....

18:28:54 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
2. Cache HIT
   Query: What is the trading symbol for NVIDIA Corporation's common stock?...
   Answer: The trading symbol for NVIDIA Corporation's common stock is NVDA....

18:28:54 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
3. Cache HIT
   Query: Where is the principal executive office of NVIDIA Corporation located?...
   Answer: The principal executive office of NVIDIA Corporation 

### Test semantic similarity

In [17]:
# Test with semantically similar queries
print("Testing semantic similarity:\n")

similar_queries = [
    "Tell me about NVIDIA's revenue",
    "What products does the company make?",
    "How is the company performing financially?",
]

for i, query in enumerate(similar_queries, 1):
    result = cache.check(prompt=query, return_fields=["prompt", "response", "distance"])
    
    if result:
        print(f"{i}. Cache HIT (distance: {result[0].get('vector_distance', 'N/A'):.4f})")
        print(f"   Query: {query}")
        print(f"   Matched: {result[0]['prompt'][:80]}...")
        print(f"   Answer: {result[0]['response'][:100]}...\n")
    else:
        print(f"{i}. ✗ Cache MISS")
        print(f"   Query: {query}\n")


Testing semantic similarity:

18:28:55 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
1. Cache HIT (distance: 0.1143)
   Query: Tell me about NVIDIA's revenue
   Matched: Where can I find NVIDIA's material financial information?...
   Answer: NVIDIA announces material financial information through its investor relations website, press releas...

18:28:55 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
2. ✗ Cache MISS
   Query: What products does the company make?

18:28:55 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
3. ✗ Cache MISS
   Query: How is the company performing financially?



### Test cache with sample query

In [18]:
# Test cache behavior with a sample query
test_query = "What is NVIDIA's main business?"

print(f"Testing query: '{test_query}'")

result = cache.check(prompt=test_query, return_fields=["prompt", "vector_distance"])

if result:
    print(f"Cache HIT")
    print(f"  Distance: {result[0].get('vector_distance', 0):.6f}")
    print(f"  Matched: {result[0]['prompt'][:80]}...")
else:
    print(f"✗ Cache MISS - No match found within threshold")


Testing query: 'What is NVIDIA's main business?'
18:28:55 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
Cache HIT
  Distance: 0.076104
  Matched: What segments does NVIDIA report its business results in?...


### Test negative examples (should not match)


In [19]:
# Test with off-topic queries that should NOT match
print("Testing negative examples (should NOT match):\n")

for i, test_case in enumerate(negative_examples, 1):
    result = cache.check(prompt=test_case['query'], return_fields=["prompt", "vector_distance"])
    
    if result:
        print(f"{i}. ⚠️  UNEXPECTED HIT (distance: {result[0].get('vector_distance', 'N/A'):.4f})")
        print(f"   Query: {test_case['query']}")
        print(f"   Matched: {result[0]['prompt'][:80]}...\n")
    else:
        print(f"{i}. Correct MISS")
        print(f"   Query: {test_case['query']}\n")


Testing negative examples (should NOT match):

18:28:55 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
1. Correct MISS
   Query: What is the weather today?

18:28:55 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
2. Correct MISS
   Query: How do I cook pasta?

18:28:55 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
3. Correct MISS
   Query: What is the capital of France?

18:28:55 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
4. Correct MISS
   Query: Tell me a joke

18:28:56 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1cc

## 6. Tune cache threshold

Using sample questions, we can find the optimal distance threshold based on our test dataset.

In [20]:
# Create optimization test data

# Format: [{"query": "...", "query_match": "cache_key_or_empty_string"}, ...]

optimization_test_data = []

# Add positive examples (should match specific cache entries)
for faq in faqs[:5]:
    if faq['question'] in cache_keys:
        optimization_test_data.append({
            "query": faq['question'],
            "query_match": cache_keys[faq['question']]
        })

# Add negative examples (should not match anything)
for neg_example in negative_examples:
    optimization_test_data.append({
        "query": neg_example['query'],
        "query_match": ""  # Empty string means it should NOT match
    })

print(f"Created optimization test data:")
print(f"  Total examples: {len(optimization_test_data)}")
print(f"  Positive (should match): {sum(1 for x in optimization_test_data if x['query_match'])}")
print(f"  Negative (should not match): {sum(1 for x in optimization_test_data if not x['query_match'])}")


Created optimization test data:
  Total examples: 10
  Positive (should match): 5
  Negative (should not match): 5


In [21]:
# Test a range of different cache similarity thresholds
import pandas as pd

# Define threshold ranges to test
thresholds_to_test = [0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.00]

print("Testing cache performance across different similarity thresholds...")
print(f"Evaluation dataset: {len(optimization_test_data)} queries")
print(f"  - Positive examples (should match): {sum(1 for x in optimization_test_data if x['query_match'])}")
print(f"  - Negative examples (should NOT match): {sum(1 for x in optimization_test_data if not x['query_match'])}")
print("\n" + "="*100 + "\n")

# Store results for all queries to reuse across thresholds
query_results = []
for test_case in optimization_test_data:
    result = cache.check(prompt=test_case['query'], return_fields=["prompt", "response", "vector_distance", "entry_id"])
    
    query_results.append({
        'query': test_case['query'],
        'expected_match': bool(test_case['query_match']),
        'expected_key': test_case['query_match'],
        'cache_result': result[0] if result else None,
        'distance': result[0].get('vector_distance') if result else float('inf')
    })

# Evaluate each threshold
results = []

for threshold in thresholds_to_test:
    true_positives = 0
    false_positives = 0
    true_negatives = 0
    false_negatives = 0

    for query_data in query_results:
        # Determine if this would be a cache hit at this threshold
        is_cache_hit = query_data['distance'] < threshold
        print('distance is ', query_data['distance'], 'threshold is ', threshold)
        should_match = query_data['expected_match']

        if is_cache_hit and should_match:
            true_positives += 1
        elif is_cache_hit and not should_match:
            false_positives += 1
        elif not is_cache_hit and not should_match:
            true_negatives += 1
        elif not is_cache_hit and should_match:
            false_negatives += 1

    # Calculate metrics
    total_hits = true_positives + false_positives
    total_misses = true_negatives + false_negatives

    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    accuracy = (true_positives + true_negatives) / len(optimization_test_data)

    results.append({
        'Threshold': threshold,
        'Total Hits': total_hits,
        'Total Misses': total_misses,
        'True Positives': true_positives,
        'False Positives': false_positives,
        'True Negatives': true_negatives,
        'False Negatives': false_negatives,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1_score,
        'Accuracy': accuracy
    })

# Display results in a formatted table
df_results = pd.DataFrame(results)

print("THRESHOLD OPTIMIZATION RESULTS")
print("="*100)
print("\nPerformance Metrics by Threshold:")
print(df_results.to_string(index=False))

# Find optimal threshold based on F1 score
optimal_idx = df_results['F1 Score'].idxmax()
optimal_threshold = df_results.loc[optimal_idx, 'Threshold']
optimal_f1 = df_results.loc[optimal_idx, 'F1 Score']

print("\n" + "="*100)
print(f"OPTIMAL THRESHOLD: {optimal_threshold}")
print(f"  F1 Score: {optimal_f1:.3f}")
print(f"  Precision: {df_results.loc[optimal_idx, 'Precision']:.3f}")
print(f"  Recall: {df_results.loc[optimal_idx, 'Recall']:.3f}")
print(f"  Accuracy: {df_results.loc[optimal_idx, 'Accuracy']:.3f}")
print("="*100)

# Show detailed breakdown for optimal threshold
print(f"\nDetailed breakdown at optimal threshold ({optimal_threshold}):\n")
for query_data in query_results:
    is_cache_hit = query_data['distance'] < optimal_threshold
    should_match = query_data['expected_match']

    status = ""
    if is_cache_hit and should_match:
        status = "✓ TP (True Positive)"
    elif is_cache_hit and not should_match:
        status = "✗ FP (False Positive)"
    elif not is_cache_hit and not should_match:
        status = "✓ TN (True Negative)"
    elif not is_cache_hit and should_match:
        status = "✗ FN (False Negative)"

    print(f"Query: {query_data['query'][:60]:60s} | Distance: {query_data['distance']:.4f} | {status}")

18:28:56 numexpr.utils INFO   NumExpr defaulting to 10 threads.
Testing cache performance across different similarity thresholds...
Evaluation dataset: 10 queries
  - Positive examples (should match): 5
  - Negative examples (should NOT match): 5


18:28:56 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
18:28:56 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
18:28:56 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
18:28:56 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
18:28:57 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1cc

In [22]:
# Re-test with optimized threshold
print("\nRe-testing negative examples with optimized threshold:\n")

for i, test_case in enumerate(negative_examples, 1):
    result = cache.check(prompt=test_case['query'], return_fields=["prompt", "vector_distance"])

    if result:
        print(f"{i}. HIT (distance: {result[0].get('vector_distance', 'N/A'):.4f})")
        print(f"   Query: {test_case['query']}")
        print(f"   Matched: {result[0]['prompt'][:80]}...\n")
    else:
        print(f"{i}. MISS (correct)")
        print(f"   Query: {test_case['query']}\n")



Re-testing negative examples with optimized threshold:

18:28:57 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
1. MISS (correct)
   Query: What is the weather today?

18:28:57 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
2. MISS (correct)
   Query: How do I cook pasta?

18:28:58 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
3. MISS (correct)
   Query: What is the capital of France?

18:28:58 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
4. MISS (correct)
   Query: Tell me a joke

18:28:58 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09a

## 7. RAG pipeline integration

Now let's integrate the semantic cache into a complete RAG pipeline and measure the performance improvements.

### Build a simple RAG chain


In [23]:
# Create a simple RAG prompt template
rag_template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant answering questions about NVIDIA based on their 10-K filing. Provide accurate, concise answers."),
    ("user", "{question}")
])

# Create RAG chain
rag_chain = rag_template | llm

print("RAG chain created")


RAG chain created


### Create cached RAG function


In [24]:
def rag_with_cache(question: str, use_cache: bool = True) -> tuple:
    """
    Process a question through RAG pipeline with optional semantic caching.
    
    Returns: A tuple of (answer, cache_hit, response_time)
    """
    start_time = time.time()
    cache_hit = False
    
    # Check cache first if enabled
    if use_cache:
        cached_result = cache.check(prompt=question)
        if cached_result:
            answer = cached_result[0]['response']
            cache_hit = True
            response_time = time.time() - start_time
            return answer, cache_hit, response_time
    
    # Cache miss - use LLM
    answer = rag_chain.invoke({"question": question})
    response_time = time.time() - start_time
    
    # Store in cache for future use
    if use_cache and hasattr(answer, 'content'):
        cache.store(prompt=question, response=answer.content)
    elif use_cache:
        cache.store(prompt=question, response=str(answer))
    
    return answer.content if hasattr(answer, 'content') else str(answer), cache_hit, response_time

print("Cached RAG function ready")


Cached RAG function ready


### Performance comparison: with vs without cache


In [25]:
# Test questions for RAG evaluation
test_questions_rag = [
    "What is NVIDIA's primary business?",
    "How much revenue did NVIDIA generate?",
    "What are NVIDIA's main products?",
]

print("\n" + "="*80)
print("PERFORMANCE COMPARISON: With Cache vs Without Cache")
print("="*80)

# First pass - populate cache (cache misses, must call LLM)
print("\n[FIRST PASS - Populating Cache]\n")
first_pass_times = []

for i, question in enumerate(test_questions_rag, 1):
    answer, cache_hit, response_time = rag_with_cache(question, use_cache=True)
    first_pass_times.append(response_time)
    print(f"{i}. {question}")
    print(f"   Cache: {'HIT' if cache_hit else 'MISS'} | Time: {response_time:.3f}s")
    print(f"   Answer: {answer[:100]}...\n")

# Second pass - test cache hits with similar questions
print("\n[SECOND PASS - Cache Hits with Paraphrased Questions]\n")
second_pass_times = []

similar_questions = [
    "What does NVIDIA do as a business?",
    "Can you tell me NVIDIA's revenue figures?",
    "What products does NVIDIA sell?",
]

for i, question in enumerate(similar_questions, 1):
    answer, cache_hit, response_time = rag_with_cache(question, use_cache=True)
    second_pass_times.append(response_time)
    print(f"{i}. {question}")
    print(f"   Cache: {'HIT ✓' if cache_hit else 'MISS ✗'} | Time: {response_time:.3f}s")
    print(f"   Answer: {answer[:100]}...\n")

# Third pass - without cache (baseline)
print("\n[THIRD PASS - Without Cache (Baseline)]\n")
no_cache_times = []

for i, question in enumerate(test_questions_rag, 1):
    answer, _, response_time = rag_with_cache(question, use_cache=False)
    no_cache_times.append(response_time)
    print(f"{i}. {question}")
    print(f"   Cache: DISABLED | Time: {response_time:.3f}s\n")

# Summary
print("\n" + "="*80)
print("PERFORMANCE SUMMARY")
print("="*80)
avg_first = sum(first_pass_times)/len(first_pass_times)
avg_second = sum(second_pass_times)/len(second_pass_times)
avg_no_cache = sum(no_cache_times)/len(no_cache_times)

print(f"Average time - First pass (cache miss):  {avg_first:.3f}s")
print(f"Average time - Second pass (cache hit):  {avg_second:.3f}s")
print(f"Average time - Without cache:            {avg_no_cache:.3f}s")

if avg_second > 0:
    speedup = avg_first / avg_second
    print(f"\nSpeedup with cache: {speedup:.1f}x faster")

cache_hit_count = sum(1 for i, _ in enumerate(similar_questions) if second_pass_times[i] < 0.1)
cache_hit_rate = cache_hit_count / len(similar_questions)
print(f"  Cache hit rate: {cache_hit_rate*100:.0f}%")



PERFORMANCE COMPARISON: With Cache vs Without Cache

[FIRST PASS - Populating Cache]

18:28:58 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
1. What is NVIDIA's primary business?
   Cache: HIT | Time: 0.118s
   Answer: NVIDIA specializes in four large markets: Data Center, Gaming, Professional Visualization, and Autom...

18:28:58 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
2. How much revenue did NVIDIA generate?
   Cache: HIT | Time: 0.124s
   Answer: NVIDIA reports its business results in two segments: the Compute & Networking segment and the Data C...

18:28:58 httpx INFO   HTTP Request: POST https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries/search "HTTP/1.1 200 OK"
3. What are NVIDIA's main products?
   Cache: HIT | Time: 0.118s
   An

## 8. Best Practices and Tips

### Key Takeaways

1. **Threshold Optimization**: Start conservative (0.10-0.15) and optimize based on real usage data
2. **Doc-to-Cache**: Pre-populate your cache with high-quality FAQs for immediate benefits
3. **Monitoring**: Track cache hit rates and adjust thresholds as user patterns emerge
4. **Model Selection**: The `langcache-embed-v1` model is specifically optimized for caching tasks
5. **Cost-Performance Balance**: Even a 50% cache hit rate provides significant cost savings

### When to Use Semantic Caching

✅ **Good Use Cases:**
- High-traffic applications with repeated question patterns
- Customer support chatbots
- FAQ systems
- Documentation Q&A
- Product information queries
- Educational content Q&A

❌ **Less Suitable:**
- Highly dynamic content requiring real-time data
- Creative writing tasks needing variety
- Personalized responses based on user-specific context
- Time-sensitive queries (use TTL if needed)

### Performance Tips

1. **Batch Loading**: Pre-populate cache with Doc-to-Cache for immediate value
2. **Monitor Hit Rates**: Track and adjust thresholds based on production metrics
3. **A/B Testing**: Test different thresholds with a subset of traffic
4. **Cache Warming**: Regularly update cache with trending topics
5. **TTL Management**: Set time-to-live for entries that may become stale


## 9. Cleanup

Clean up resources when done.


In [26]:
# Clear cache contents
cache.clear()
print("Cache contents cleared")

18:29:07 httpx INFO   HTTP Request: DELETE https://aws-us-east-1.langcache.redis.io/v1/caches/50eb6a09acf5415d8b68619b1ccffd9a/entries "HTTP/1.1 400 Bad Request"


BadRequestErrorResponseContent: {"detail":"attributes: cannot be blank.","status":400,"title":"Invalid Request","type":"/errors/invalid-data"}

## Summary

Congratulations! You've completed this comprehensive guide on semantic caching with LangCache and RedisVL. 

**What You've Learned:**
- ✅ Set up and configure LangCache with Redis Cloud
- ✅ Load and process PDF documents into knowledge bases
- ✅ Generate FAQs using the Doc-to-Cache technique with LLMs
- ✅ Pre-populate a semantic cache with tagged entries
- ✅ Test different cache matching strategies and thresholds
- ✅ Optimize cache performance using test datasets
- ✅ Leverage the `redis/langcache-embed-v1` cross-encoder model
- ✅ Integrate semantic caching into RAG pipelines
- ✅ Measure performance improvements and cost savings

**Next Steps:**
- Experiment with different distance thresholds for your use case
- Try other embedding models and compare performance
- Implement cache analytics and monitoring in production
- Explore advanced features like TTL, metadata filtering, and cache warming strategies
- Scale your semantic cache to handle production traffic

**Resources:**
- [RedisVL Documentation](https://docs.redisvl.com/en/stable/index.html)
- [LangCache Sign Up](https://redis.io/langcache/)
- [Redis AI Resources](https://github.com/redis-developer/redis-ai-resources)
- [Semantic Caching Paper](https://arxiv.org/abs/2504.02268)
