
![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?width=120)

# LiteLLM Proxy with Redis

This notebook demonstrates how to use [LiteLLM](https://github.com/BerriAI/litellm) with Redis to build a powerful and efficient LLM proxy server with caching & rate limiting capabilities. LiteLLM provides a unified interface for accessing multiple LLM providers while Redis enhances performance of the application in several different ways.

*This recipe will help you understand*:

* **Why** and **how** to implement exact and semantic caching for LLM calls
* **How** to set up rate limiting for your LLM APIs
* **How** LiteLLM integrates with Redis for state management 
* **How to measure** the performance benefits of caching

> **Open in Colab**   ↘︎  
> <a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/gateway/00_litellm_proxy_redis.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>



## 1. Environment Setup  

### Install Python Dependencies
Before we begin, we need to make sure our environment is properly set up with all the necessary tools:

**Requirements**:
* Python ≥ 3.9 with the below packages
* OpenAI API key (set as `OPENAI_API_KEY` environment variable)

First, let's install the required packages.

In [6]:
%pip install -q "litellm[proxy]" "redisvl==0.5.2" requests openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Install Redis Stack


#### For Colab
Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive.

In [None]:
# NBVAL_SKIP
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

#### For Alternative Environments
There are many ways to get the necessary redis-stack instance running
1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.io/try-free/). Or, if you have your
own version of Redis Enterprise running, that works too!
2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)
3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`

### Define the Redis Connection URL

By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [11]:
import os

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") # ex: "redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      # ex: 18374
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  # ex: "1TNxTEdYRDgIDKM2gDfasupCADXXXX"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

### Verify Redis Connection

Let's test our Redis connection to make sure it's working properly:

In [13]:
from redis import Redis

client = Redis.from_url(REDIS_URL)
client.ping()

True

### Set OPENAI API Key

In [14]:
import getpass
import os

def _set_env(key: str):
    if key not in os.environ:
        os.environ[key] = getpass.getpass(f"{key}:")


_set_env("OPENAI_API_KEY")

## 2 · Understanding LiteLLM Caching with Redis

LiteLLM Proxy with Redis provides several powerful capabilities that can significantly improve your LLM application performance and reliability:

* **Exact cache (identical prompt)**: Uses Redis `SETEX` with TTL through the `cache:` configuration
* **Semantic cache (similar prompt)**: Uses RediSearch **vector** indexing through the `semantic_cache:` configuration
* **Rate-limit per user/key**: Uses Redis `INCR + EXPIRE` counters through the `rate_limit:` configuration
* **Multi-model routing**: Uses Redis data structures for model configurations

### Why Use Caching for LLMs?

1. **Cost Reduction**: Avoid redundant API calls for identical or similar prompts
2. **Latency Improvement**: Cached responses return in milliseconds vs. seconds
3. **Reliability**: Reduce dependency on external API availability
4. **Rate Limit Management**: Stay within API provider constraints

In this notebook, we'll explore how these features work and measure their impact on performance.

## 3 · Create a Multi-Model Configuration

Let's create a configuration file for LiteLLM Proxy that includes caching, semantic caching, and rate limiting with Redis. This configuration will route requests to two OpenAI models: `gpt-3.5-turbo` and `gpt‑4o‑mini`.

In [15]:
import pathlib, yaml, textwrap, json


cfg = {
    "model_list": [
        {
            "model_name": "openai-old",
            "litellm_params": {
                "model": "gpt-3.5-turbo"
            }
        },
        {
            "model_name": "openai-new",
            "litellm_params": {
                "model": "gpt-4o"
            }
        }
    ],
    "litellm_settings": {
        "set_verbose": True,
        "cache": True,
        "cache_params": {
            "type": "redis",
            "host": REDIS_HOST,
            "port": REDIS_PORT,
            "password": REDIS_PASSWORD
        }
    }
}

path = pathlib.Path("litellm_redis.yml")
path.write_text(yaml.dump(cfg))
print("Configuration saved to:", path)
print("\nConfiguration details:")
print(path.read_text())

Configuration saved to: litellm_redis.yml

Configuration details:
litellm_settings:
  cache: true
  cache_params:
    host: localhost
    password: ''
    port: '6379'
    type: redis
  set_verbose: true
model_list:
- litellm_params:
    model: gpt-3.5-turbo
  model_name: openai-old
- litellm_params:
    model: gpt-4o
  model_name: openai-new



### Launch LiteLLM Proxy

Now, let's start the LiteLLM Proxy server with our configuration. We'll use native Jupyter bash magic commands instead of using Python subprocess:

In [16]:
%%bash --bg
# Start LiteLLM Proxy in the background
echo "Starting LiteLLM Proxy on port 4000..."
litellm --config litellm_redis.yml --port 4000 > litellm_proxy.log 2>&1
echo "Proxy started! Check litellm_proxy.log for details."

Let's wait a few seconds for the proxy to start and check if it's running:

In [17]:
import time, requests
print("Waiting for proxy to start...")
time.sleep(5)

# Check if proxy is running
try:
    response = requests.get("http://localhost:4000/health")
    print(f"Proxy health check status: {response.status_code}")
    print(f"Response: {response.json()}")
except Exception as e:
    print(f"Error checking proxy: {e}")
    print("\nLast few lines from proxy log:")
    !tail -n 5 litellm_proxy.log

Waiting for proxy to start...
Proxy health check status: 200
Response: {'healthy_endpoints': [], 'unhealthy_endpoints': [], 'healthy_count': 0, 'unhealthy_count': 0}


### Create a Helper Function for Making Requests

Let's create a function to make API calls to our proxy and measure response times:

In [18]:
import requests, time, json, textwrap

def chat(prompt, model="gpt-3.5-turbo", user="demo", verbose=True):
    """Send a chat completion request to the LiteLLM proxy"""
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "user": user
    }
    
    t0 = time.time()
    resp = requests.post("http://localhost:4000/v1/chat/completions", 
                        json=payload, 
                        timeout=60)
    elapsed = time.time() - t0
    
    if verbose:
        cache_status = resp.headers.get("X-Cache", "MISS")
        semantic_cache = resp.headers.get("X-Semantic-Cache", "MISS")
        print(f"🕒 Response time: {elapsed:.2f}s | Cache: {cache_status} | Semantic cache: {semantic_cache}")
        
        if resp.status_code == 200:
            print(f"🤖 Response: {resp.json()['choices'][0]['message']['content'][:100]}...")
            
    return elapsed, resp

# Test the function
print("Testing the chat function with a simple prompt...")
_, test_resp = chat("Introduce yourself briefly")
print(f"Status code: {test_resp.status_code}")

Testing the chat function with a simple prompt...
🕒 Response time: 0.01s | Cache: MISS | Semantic cache: MISS
Status code: 400


In [19]:
test_resp.content

b'{"error":{"message":"{\'error\': \'/chat/completions: Invalid model name passed in model=gpt-3.5-turbo. Call `/v1/models` to view available models for your key.\'}","type":"None","param":"None","code":"400"}}'

## 4 · Exact Cache Demonstration

Now we'll demonstrate exact caching by sending the same prompt twice. The first request should hit the LLM API, while the second should be served from cache. We'll see this reflected in the response time and cache headers.

In [None]:
print("🧪 Exact Cache Experiment")
print("\n1️⃣ First Request: (expecting cache MISS)")
lat1, res1 = chat("What are three benefits of Redis for LLM applications?")

print("\n2️⃣ Second Request with Identical Prompt: (expecting cache HIT)")
lat2, res2 = chat("What are three benefits of Redis for LLM applications?")

print(f"\n🔍 Performance Analysis:")
print(f"   First request: {lat1:.3f}s")
print(f"   Second request: {lat2:.3f}s")
if lat1 > 0 and lat2 > 0:
    print(f"   Speed improvement: {lat1/lat2:.1f}x faster")
    print(f"   Time saved: {lat1 - lat2:.3f}s")

### Examining the Cached Keys in Redis

Let's look at the keys created in Redis for the exact cache and understand how LiteLLM structures them:

In [None]:
# Get all keys related to LiteLLM cache
cache_keys = list(r.scan_iter(match="litellm:cache:*"))
print(f"Found {len(cache_keys)} cache keys in Redis")

if cache_keys:
    # Look at the first key
    first_key = cache_keys[0]
    print(f"\nExample cache key: {first_key}")
    
    # Get TTL for the key
    ttl = r.ttl(first_key)
    print(f"TTL: {ttl} seconds")
    
    # Get the value (may be large, so limiting output)
    value = r.get(first_key)
    if value:
        print(f"Value type: {type(value).__name__}")
        print(f"Value size: {len(value)} characters")
        try:
            # Try to parse as JSON for better display
            parsed = json.loads(value[:1000] + '...' if len(value) > 1000 else value)
            print(f"Content preview (JSON): {json.dumps(parsed, indent=2)[:300]}...")
        except:
            print(f"Content preview (raw): {value[:100]}...")

### Benchmarking Cached Response Times

Now, let's precisely measure the cached response time using multiple repeated requests:

In [None]:
# Benchmark cached response time with more samples
def benchmark_cached_query(query, runs=5):
    times = []
    print(f"Benchmarking cached query: '{query}'")
    print(f"Running {runs} iterations...")
    
    for i in range(runs):
        start = time.time()
        elapsed, resp = chat(query, verbose=False)
        times.append(elapsed)
        cache_status = resp.headers.get("X-Cache", "MISS")
        print(f"  Run {i+1}: {elapsed:.4f}s | Cache: {cache_status}")
    
    avg_time = sum(times) / len(times)
    print(f"\nAverage response time: {avg_time:.4f}s")
    print(f"Min: {min(times):.4f}s | Max: {max(times):.4f}s")
    return avg_time

# Run the benchmark
benchmark_cached_query("What are three benefits of Redis for LLM applications?")

## 5 · Semantic Cache Demonstration

Semantic caching is more powerful than exact caching because it can identify semantically similar prompts, not just identical ones. This is implemented using vector embeddings and similarity search in Redis.

Let's test it by sending a prompt that is semantically similar (but not identical) to our previous query:

In [None]:
print("🧪 Semantic Cache Experiment")

# First, let's send a new query that will be stored in the semantic cache
print("\n1️⃣ Establishing a baseline query for semantic cache:")
lat1, res1 = chat("Tell me a useful application of Redis for AI systems")

# Now send a semantically similar query
print("\n2️⃣ Testing a semantically similar query:")
lat2, res2 = chat("What's a good use case for Redis in artificial intelligence?")

# Try a completely different query
print("\n3️⃣ Testing an unrelated query (should not hit semantic cache):")
lat3, res3 = chat("How to make chocolate chip cookies?")

print(f"\n🔍 Performance Analysis:")
print(f"   Original query: {lat1:.3f}s")
print(f"   Similar query: {lat2:.3f}s")
print(f"   Unrelated query: {lat3:.3f}s")

sim_cache_hit = "HIT" in res2.headers.get("X-Semantic-Cache", "MISS")
if sim_cache_hit and lat1 > 0 and lat2 > 0:
    print(f"   Speed improvement: {lat1/lat2:.1f}x faster for semantically similar query")

### Examining Semantic Cache Keys

Let's look at the keys and indices created in Redis for the semantic cache:

In [None]:
# Check semantic cache keys
semantic_keys = list(r.scan_iter(match="litellm:semantic*"))
print(f"Found {len(semantic_keys)} semantic cache keys in Redis")

if semantic_keys:
    # Display the first few keys
    for key in semantic_keys[:5]:
        print(f"  - {key}")
    
    # Check for Redis Search indices
    try:
        indices = r.execute_command("FT._LIST")
        print(f"\nRedis Search indices: {indices}")
        
        # Get info about the semantic cache index if it exists
        semantic_index = [idx for idx in indices if "semantic" in idx.lower()]
        if semantic_index:
            index_info = r.execute_command(f"FT.INFO {semantic_index[0]}")
            print(f"\nSemantic Index Info:")
            # Format and display selected info
            info_dict = {index_info[i]: index_info[i+1] for i in range(0, len(index_info), 2) if i+1 < len(index_info)}
            for k in ['num_docs', 'num_terms', 'index_name', 'index_definition']:
                if k in info_dict:
                    print(f"  {k}: {info_dict[k]}")
    except Exception as e:
        print(f"Error accessing Redis Search indices: {e}")

### How Semantic Caching Works

LiteLLM's semantic caching works through these steps:
1. When a query arrives, LiteLLM generates an embedding vector for the query using the configured model
2. This vector is searched against previously stored vectors in Redis using cosine similarity
3. If a match is found with similarity above the threshold (we set 0.9), the cached response is returned
4. If not, the query is sent to the LLM API and the result is cached with its vector

This approach is especially valuable for:
- Applications with many similar but not identical queries
- Customer support systems where questions vary in phrasing but seek the same information
- Educational applications where different students may ask similar questions

## 6 · Multi-Model Routing with LiteLLM Proxy

Our configuration enables access to multiple models through a single endpoint. Let's test both the configured models to verify they work:

In [None]:
print("🧪 Multi-Model Routing Demonstration")

models = ["gpt-3.5-turbo", "gpt-4o-mini"]
results = {}

for model in models:
    print(f"\nTesting model: {model}")
    lat, res = chat("Say hi in two words", model=model)
    
    if res.status_code == 200:
        response_content = res.json()["choices"][0]["message"]["content"]
        results[model] = {
            "latency": lat,
            "response": response_content,
            "model": res.json().get("model", model)
        }
        print(f"✅ Success | Response: '{response_content}'")
    else:
        print(f"❌ Error | Status code: {res.status_code}")
        print(f"Error message: {res.text}")

# Compare the models
if len(results) > 1:
    print("\n📊 Model Comparison:")
    for model, data in results.items():
        print(f"  {model}: {data['latency']:.2f}s - '{data['response']}'")

## 7 · Testing Failure Modes

Let's examine how the proxy handles error conditions, which is important for building robust applications.

In [None]:
print("🧪 Testing Error Handling")

# Test with an unsupported model
print("\n1️⃣ Testing with non-existent model:")
_, bad_model_resp = chat("test", model="gpt-nonexistent-001")
print(f"Status: {bad_model_resp.status_code}")
print(f"Error message: {json.dumps(bad_model_resp.json(), indent=2)}")

### Testing Rate Limiting

The LiteLLM proxy includes rate limiting functionality, which helps protect your API keys from overuse. Let's test this by sending requests rapidly until we hit the rate limit:

In [None]:
print("🧪 Testing Rate Limiting")
print("Sending multiple requests with the same user ID to trigger rate limiting...")

for i in range(5):
    _, r2 = chat(f"Request {i+1}", user="test-rate-limit")
    remaining = r2.headers.get("X-Rate-Limit-Remaining", "unknown")
    limit_reset = r2.headers.get("X-Rate-Limit-Reset", "unknown")
    
    print(f"Request {i+1}: Status {r2.status_code} | Remaining: {remaining} | Reset: {limit_reset}")
    
    if r2.status_code == 429:
        print(f"Rate limit reached after {i+1} requests!")
        print(f"Error response: {json.dumps(r2.json(), indent=2)}")
        break
        
    time.sleep(0.5)  # Small delay to see rate limiting in action

## 8 · Implementation Options

LiteLLM provides multiple ways to implement caching in your application:

### Using LiteLLM Proxy (as shown)

The proxy approach (demonstrated in this notebook) is recommended for production deployments because it:
- Provides a unified API endpoint for all your models
- Centralizes caching, rate-limiting, and fallback logic
- Works with any client that uses the OpenAI API format
- Supports multiple languages and frameworks

### Direct Integration with LiteLLM Python SDK

For Python applications, you can also integrate caching directly using the SDK. See the [LiteLLM Caching documentation](https://docs.litellm.ai/docs/caching/all_caches) for details.

## 9 · Cleanup

Let's stop the LiteLLM proxy server and clean up our environment:

In [None]:
%%bash
# Find and stop the LiteLLM process
echo "Stopping LiteLLM Proxy..."
litellm_pid=$(ps aux | grep "litellm --config" | grep -v grep | awk '{print $2}')
if [ -n "$litellm_pid" ]; then
    kill $litellm_pid
    echo "Stopped LiteLLM Proxy (PID: $litellm_pid)"
else
    echo "LiteLLM Proxy not found running"
fi

# Optionally stop Redis if you started it just for this notebook
# Note: Comment this out if you want to keep Redis running
# redis-cli shutdown

## Summary

In this notebook, we've demonstrated how to:

1. **Set up LiteLLM Proxy** with Redis for caching and rate limiting
2. **Configure exact and semantic caching** to improve performance
3. **Measure the performance benefits** of caching LLM responses
4. **Route requests to multiple models** through a single endpoint
5. **Test error handling and rate limiting** behavior

The benchmarks clearly show that implementing caching with Redis can significantly reduce response times and API costs, making it an essential component of production LLM applications.

For more information, see the [LiteLLM documentation](https://docs.litellm.ai/docs/proxy/caching) and [Redis documentation](https://redis.io/docs/).