# Semantic Cache Middleware with LangChain Agents

This notebook demonstrates how to use `SemanticCacheMiddleware` with LangChain agents using the standard `create_agent` pattern.

## Key Features

- **Semantic matching**: Cache similar prompts, not just exact matches
- **Cost reduction**: Avoid redundant LLM calls for similar queries
- **Latency improvement**: Instant responses for cached queries
- **Tool-aware caching**: Smart handling of tool-calling workflows

## Tool-Aware Caching

When the LLM uses tools, caching behavior depends on whether tools are deterministic:

- **Deterministic tools** (e.g., calculator): Same input always produces same output. Safe to cache.
- **Non-deterministic tools** (e.g., stock prices): Output changes over time. Don't cache.

Configure this via `deterministic_tools` in `SemanticCacheConfig`:
```python
SemanticCacheConfig(
    deterministic_tools=["calculate", "convert_units"],  # Safe to cache after these
)
```

## Prerequisites

- Redis 8.0+ or Redis Stack (with RedisJSON and RediSearch)
- OpenAI API key

## Note on Async Usage

The Redis middleware uses async methods internally. When using with `create_agent`, you must use `await agent.ainvoke()` rather than `agent.invoke()`.

## Setup

Install required packages and set API keys.

In [1]:
%%capture --no-stderr
%pip install -U langgraph-checkpoint-redis langchain langchain-openai sentence-transformers

In [2]:
import getpass
import os


def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")


_set_env("OPENAI_API_KEY")

OPENAI_API_KEY:  ········


## Using SemanticCacheMiddleware with create_agent

The `SemanticCacheMiddleware` inherits from LangChain's `AgentMiddleware`, so it can be passed directly to `create_agent`.

In [3]:
import time
from langchain.agents import create_agent
from langchain_core.tools import tool
from langgraph.checkpoint.redis import RedisSaver
from langgraph.middleware.redis import SemanticCacheMiddleware, SemanticCacheConfig

REDIS_URL = os.environ.get("REDIS_URL", "redis://redis:6379")

# Create the semantic cache middleware
cache_middleware = SemanticCacheMiddleware(
    SemanticCacheConfig(
        redis_url=REDIS_URL,
        name="demo_semantic_cache",
        distance_threshold=0.15,  # Lower = stricter matching
        ttl_seconds=3600,  # Cache entries expire after 1 hour
        cache_final_only=True,  # Only cache final responses (not tool calls)
        # deterministic_tools: List tools whose results are deterministic.
        # When set, cache can be used even after these tools execute.
        # If None (default), cache is skipped when any tool results are present.
        deterministic_tools=["calculate"],  # Calculator is deterministic
    )
)

print("SemanticCacheMiddleware created!")
print("- distance_threshold: 0.15 (semantic matching)")
print("- cache_final_only: True (don't cache tool-calling responses)")
print("- deterministic_tools: ['calculate'] (safe to cache after these tools)")

SemanticCacheMiddleware created!
- distance_threshold: 0.15 (semantic matching)
- cache_final_only: True (don't cache tool-calling responses)
- deterministic_tools: ['calculate'] (safe to cache after these tools)


In [None]:
import ast
import operator as op

# Safe math evaluator - no arbitrary code execution
SAFE_OPS = {
    ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul,
    ast.Div: op.truediv, ast.Pow: op.pow, ast.USub: op.neg,
}

def _eval_node(node):
    if isinstance(node, ast.Constant):
        return node.value
    elif isinstance(node, ast.BinOp) and type(node.op) in SAFE_OPS:
        return SAFE_OPS[type(node.op)](_eval_node(node.left), _eval_node(node.right))
    elif isinstance(node, ast.UnaryOp) and type(node.op) in SAFE_OPS:
        return SAFE_OPS[type(node.op)](_eval_node(node.operand))
    raise ValueError("Unsupported expression")

def safe_eval(expr: str) -> float:
    return _eval_node(ast.parse(expr, mode='eval').body)


# Define some tools for the agent
@tool
def get_weather(location: str) -> str:
    """Get the current weather for a location."""
    # Simulated weather data
    weather_data = {
        "new york": "72°F, Partly Cloudy",
        "san francisco": "65°F, Foggy",
        "london": "58°F, Rainy",
        "tokyo": "80°F, Sunny",
    }
    location_lower = location.lower()
    for city, weather in weather_data.items():
        if city in location_lower:
            return f"Weather in {location}: {weather}"
    return f"Weather data not available for {location}"


@tool
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression."""
    try:
        result = safe_eval(expression)
        return f"{expression} = {result}"
    except Exception as e:
        return f"Error: {str(e)}"


tools = [get_weather, calculate]

In [5]:
# Create the agent with semantic cache middleware
agent = create_agent(
    model="gpt-4o-mini",
    tools=tools,
    middleware=[cache_middleware],  # Pass middleware here!
)

print("Agent created with SemanticCacheMiddleware!")

Agent created with SemanticCacheMiddleware!


## Demonstrating Cache Behavior

Let's make some queries and observe how the cache works. The first query will hit the LLM, while semantically similar queries should hit the cache.

**Important**: We use `await agent.ainvoke()` because the middleware is async-first.

In [6]:
from langchain_core.messages import HumanMessage

# First query - will be a cache miss (hits the LLM)
print("Query 1: 'What is the capital of France?'")
print("="*50)

start = time.time()
result1 = await agent.ainvoke({"messages": [HumanMessage(content="What is the capital of France?")]})
elapsed1 = time.time() - start

print(f"Response: {result1['messages'][-1].content[:200]}...")
print(f"Time: {elapsed1:.2f}s (cache miss - LLM call)")

Query 1: 'What is the capital of France?'


This vectorizer has no async embed method. Falling back to sync.
This vectorizer has no async embed method. Falling back to sync.


Response: The capital of France is Paris....
Time: 4.44s (cache miss - LLM call)


In [7]:
# Second query - semantically similar, should hit cache
print("\nQuery 2: 'Tell me France's capital city'")
print("="*50)

start = time.time()
result2 = await agent.ainvoke({"messages": [HumanMessage(content="Tell me France's capital city")]})
elapsed2 = time.time() - start

print(f"Response: {result2['messages'][-1].content[:200]}...")
print(f"Time: {elapsed2:.2f}s (expected: cache hit - much faster!)")

This vectorizer has no async embed method. Falling back to sync.



Query 2: 'Tell me France's capital city'
Response: The capital of France is Paris....
Time: 0.06s (expected: cache hit - much faster!)


In [8]:
# Third query - different topic, should be cache miss
print("\nQuery 3: 'What is the capital of Germany?'")
print("="*50)

start = time.time()
result3 = await agent.ainvoke({"messages": [HumanMessage(content="What is the capital of Germany?")]})
elapsed3 = time.time() - start

print(f"Response: {result3['messages'][-1].content[:200]}...")
print(f"Time: {elapsed3:.2f}s (cache miss - different topic)")

This vectorizer has no async embed method. Falling back to sync.



Query 3: 'What is the capital of Germany?'


This vectorizer has no async embed method. Falling back to sync.


Response: The capital of Germany is Berlin....
Time: 0.74s (cache miss - different topic)


In [9]:
# Summary
print("\n" + "="*50)
print("SUMMARY")
print("="*50)
print(f"Query 1 (France capital, miss): {elapsed1:.2f}s")
print(f"Query 2 (France capital, hit):  {elapsed2:.2f}s")
print(f"Query 3 (Germany capital, miss): {elapsed3:.2f}s")

if elapsed2 < elapsed1 * 0.5:
    print("\n Cache hit was significantly faster!")
    print(f"  Speedup: {elapsed1/elapsed2:.1f}x")


SUMMARY
Query 1 (France capital, miss): 4.44s
Query 2 (France capital, hit):  0.06s
Query 3 (Germany capital, miss): 0.74s

 Cache hit was significantly faster!
  Speedup: 71.0x


## Tool Queries and Cache Behavior

By default, `cache_final_only=True` means only final responses (without tool calls) are cached. This prevents caching intermediate tool-calling responses.

In [10]:
# Query that requires a tool call
print("Query with tool: 'What's the weather in Tokyo?'")
print("="*50)

start = time.time()
result = await agent.ainvoke({"messages": [HumanMessage(content="What's the weather in Tokyo?")]})
elapsed = time.time() - start

print(f"Response: {result['messages'][-1].content}")
print(f"Time: {elapsed:.2f}s")
print("\nNote: The final response (after tool execution) is cached, not the tool-calling step.")

This vectorizer has no async embed method. Falling back to sync.


Query with tool: 'What's the weather in Tokyo?'


This vectorizer has no async embed method. Falling back to sync.


Response: The weather in Tokyo is currently 80°F and sunny.
Time: 1.48s

Note: The final response (after tool execution) is cached, not the tool-calling step.


## Cleanup

In [11]:
# Close the middleware to release Redis connection
await cache_middleware.aclose()
print("Middleware closed.")
print("Demo complete!")

Middleware closed.
Demo complete!
