# Semantic Caching

Semantic caching is an intelligent caching strategy that stores and retrieves responses based on the meaning of queries rather than exact text matches. Unlike traditional caching that requires identical strings, semantic caching can return cached responses for questions that are semantically similar, even when phrased differently.

## Semantic Caching vs. Traditional Caching vs. LLM Re-generation

**Traditional caching** stores responses using exact query strings as keys:
- **Fast retrieval** for identical queries
- **Cache misses** for any variation in phrasing, even minor differences
- **Low cache hit rates** in conversational applications where users rarely phrase questions identically

**LLM re-generation** involves calling the language model for every query:
- **Flexible** handling of any question variation
- **High API costs** and latency for repeated similar questions

**Semantic caching** uses vector similarity to match queries with cached responses:
- **High cache hit rates** by matching semantically similar questions
- **Cost reduction** by avoiding redundant LLM calls for similar queries
- **Fast retrieval** through vector similarity search

In this notebook, we'll implement semantic caching using RedisVL with pre-generated FAQs about a Chevrolet Colorado vehicle brochure, demonstrating how semantic similarity can dramatically improve cache hit rates compared to exact string matching.

## Installing Dependencies

This semantic caching implementation requires several Python libraries that work together to provide vector embeddings, caching functionality, and LLM integration.

- RedisVL - Provides the semantic caching functionality built on top of Redis. This library handles vector storage, similarity search, and the caching interface we'll use to store and retrieve semantically similar queries.
- Sentence Transformers - Supplies pre-trained models for converting text into high-quality vector embeddings. These embeddings capture semantic meaning, allowing us to find similar queries even when they're phrased differently.

In [1]:
%pip install -q "redisvl>=0.8.2" sentence-transformers

Note: you may need to restart the kernel to use updated packages.


## Loading Pre-Generated FAQs

For this semantic caching demonstration, we'll use pre-generated frequently asked questions (FAQs) about a Chevrolet Colorado vehicle brochure. These FAQs were created by processing the vehicle documentation and extracting question-answer pairs using an LLM.


In [3]:
import json

# Read the saved FAQs
with open('../data/3_colorado_faqs.json', 'r', encoding='utf-8') as f:
    all_faqs = json.load(f)

print(f"Loaded {len(all_faqs)} FAQs from file")

Loaded 346 FAQs from file


## Setting up the Text Vectorizer

The vectorizer is responsible for converting text into numerical vector representations that capture semantic meaning. RedisVL provides several vectorizer options such as OpenAI and VertexAI. We're using the HuggingFace Text Vectorizer for this example.

In [4]:
from redisvl.utils.vectorize import HFTextVectorizer

vectorizer = HFTextVectorizer(
    model="sentence-transformers/all-MiniLM-L6-v2"
)

  from .autonotebook import tqdm as notebook_tqdm


20:06:30 numexpr.utils INFO   NumExpr defaulting to 11 threads.
20:06:32 sentence_transformers.SentenceTransformer INFO   Use pytorch device_name: mps
20:06:32 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


## Vectorizing the FAQ record pairs

In [5]:
# Embed each chunk content using the vectorizer
embeddings = vectorizer.embed_many([pair["prompt"] for pair in all_faqs])

# Check to make sure we've created enough embeddings, 1 per FAQ record
len(embeddings) == len(all_faqs)

True

## Creating the SemanticCache

In [7]:
from redisvl.extensions.llmcache import SemanticCache

cache = SemanticCache(vectorizer=vectorizer, distance_threshold=0.2, overwrite=True)

20:06:45 redisvl.index.index INFO   Index already exists, overwriting.


## Adding the previously vectorized FAQ pairs to the semantic cache

In [8]:
for i, entry in enumerate(all_faqs):
    cache.store(prompt=entry["prompt"], response=entry["response"], vector=embeddings[i])

## Testing the Semantic Cache

In [9]:
cache.check("What models of chevy colorado are available?")

[{'entry_id': '93f888c0c7ff6f0852dd10581cfe0851d728e785a2a5daeef2b426f86f45dc28',
  'prompt': 'What are the available models of the Colorado?',
  'response': 'The available models of the Colorado are WT, LT, Z71, and ZR2.',
  'vector_distance': 0.18787831068,
  'inserted_at': 1760465207.2,
  'updated_at': 1760465207.2,
  'key': 'llmcache:93f888c0c7ff6f0852dd10581cfe0851d728e785a2a5daeef2b426f86f45dc28'}]

In [10]:
cache.check("What entertainment system comes with the car?")

[{'entry_id': '5464af81efbd39a09db61aa346cbea9416538cbd09edfa5a2930d92f0c9e4a65',
  'prompt': 'What entertainment system is included in the vehicle?',
  'response': 'The vehicle includes the Chevrolet Infotainment 3 Plus system with an 8-inch diagonal HD color touch-screen.',
  'vector_distance': 0.0709647536278,
  'inserted_at': 1760465207.27,
  'updated_at': 1760465207.27,
  'key': 'llmcache:5464af81efbd39a09db61aa346cbea9416538cbd09edfa5a2930d92f0c9e4a65'}]

In [11]:
cache.check("Does the colorado drive on the water?")

[]