
<div style="display: flex; justify-content: space-around; align-items: center;">
  <img src="https://redis.io/wp-content/uploads/2024/04/Logotype.svg" width="150" alt="Redis">
  <img src="https://awsmp-logos.s3.amazonaws.com/seller-xw5kijmvmzasy/c233c9ade2ccb5491072ae232c814942.png" width="200" alt="LiteLLM">
</div>

# LiteLLM Proxy with Redis

This notebook demonstrates how to use [LiteLLM](https://github.com/BerriAI/litellm) with Redis to build a powerful and efficient LLM proxy server backed by caching & rate limiting capabilities. LiteLLM provides a unified interface for accessing multiple LLM providers while Redis enhances performance of the application in several different ways.

*This recipe will help you understand*:

* **How** to set up LiteLLM as a proxy for different LLM endpoints
* **Why** and **how** to implement exact and semantic caching for LLM calls

**Open in Colab**

<a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/gateway/00_litellm_proxy_redis.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>



## 1 · Environment Setup  
Before we begin, we need to make sure our environment is properly set up with all the necessary tools and resources.

**Requirements**:
* Python ≥ 3.9 with the below packages
* OpenAI API key (set as `OPENAI_API_KEY` environment variable)


### Install Python Dependencies

First, let's install the required packages.

In [None]:
%pip install "litellm[proxy]==1.68.0" "redisvl==0.5.2" requests openai

### Install Redis Stack


#### For Colab
Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive.

In [None]:
# NBVAL_SKIP
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


#### For Alternative Environments
There are many ways to get the necessary redis-stack instance running
1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.io/try-free/). Or, if you have your
own version of Redis Enterprise running, that works too!
2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)
3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`

### Define the Redis Connection URL

By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [1]:
import os

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") # ex: "redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      # ex: 18374
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  # ex: "1TNxTEdYRDgIDKM2gDfasupCADXXXX"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"
os.environ["REDIS_URL"] = REDIS_URL
os.environ["REDIS_HOST"] = REDIS_HOST
os.environ["REDIS_PORT"] = REDIS_PORT
os.environ["REDIS_PASSWORD"] = REDIS_PASSWORD

### Verify Redis Connection

Let's test our Redis connection to make sure it's working properly:

In [132]:
from redis import Redis

client = Redis.from_url(REDIS_URL)
client.ping()

True

In [133]:
client.flushall()

True

### Set OPENAI API Key

In [None]:
import getpass
import os

os.environ["LITELLM_LOG"] = "DEBUG"

def _set_env(key: str):
    if key not in os.environ:
        os.environ[key] = getpass.getpass(f"{key}:")

_set_env("OPENAI_API_KEY")


## 2 · Running the LiteLLM Proxy
First, we will define a LiteLLM config that contains:

- a few supported model options
- a semantic caching configuration using Redis

In [234]:
%%writefile litellm_redis.yml
model_list:
- litellm_params:
    api_key: os.environ/OPENAI_API_KEY
    model: gpt-3.5-turbo
    rpm: 30
  model_name: gpt-3.5-turbo
- litellm_params:
    api_key: os.environ/OPENAI_API_KEY
    model: gpt-4o-mini
    rpm: 30
  model_name: gpt-4o-mini
- litellm_params:
    api_key: os.environ/OPENAI_API_KEY
    model: text-embedding-3-small
  model_name: text-embedding-3-small

litellm_settings:
  cache: True
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: os.environ/REDIS_PORT
    password: os.environ/REDIS_PASSWORD
    default_in_redis_ttl: 60

Overwriting litellm_redis.yml


Now for some helper code that will start/stop **LiteLLM** proxy as a background task here on the host machine.

In [235]:
import subprocess, atexit, os, signal, socket, time, pathlib, textwrap, sys


_proxy_handle: subprocess.Popen | None = None


def _is_port_open(port: int) -> bool:
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.settimeout(0.25)
        return s.connect_ex(("127.0.0.1", port)) == 0

def start_proxy(
    config_path: str = "litellm_redis.yml",
    port: int = 4000,
    log_path: str = "litellm_proxy.log",
    restart: bool = True,
    timeout: float = 10.0,          # seconds we’re willing to wait
) -> subprocess.Popen:

    global _proxy_handle

    # ── 1. stop running proxy we launched earlier ──
    if _proxy_handle and _proxy_handle.poll() is None:
        if restart:
            _proxy_handle.terminate()
            _proxy_handle.wait(timeout=3)
            time.sleep(1)          # give the OS a breath
        else:
            print(f"LiteLLM already running (PID {_proxy_handle.pid}) — reusing.")
            return _proxy_handle

    # ── 2. ensure the port is free ──
    if _is_port_open(port):
        print(f"Port {port} busy; trying to free it …")
        pids = os.popen(f"lsof -ti tcp:{port}").read().strip().splitlines()
        for pid in pids:
            try:
                os.kill(int(pid), signal.SIGTERM)
            except Exception:
                pass
        time.sleep(1)

    # ── 3. launch proxy ──
    log_file = open(log_path, "w")
    cmd = ["litellm", "--config", config_path, "--port", str(port), "--detailed_debug"]
    _proxy_handle = subprocess.Popen(cmd, stdout=log_file, stderr=subprocess.STDOUT)

    atexit.register(lambda: _proxy_handle and _proxy_handle.terminate())

    # ── 4. readiness loop with timeout & crash detection ──
    deadline = time.time() + timeout
    while time.time() < deadline:
        if _is_port_open(port):
            break
        if _proxy_handle.poll() is not None:             # died early
            last_lines = pathlib.Path(log_path).read_text().splitlines()[-20:]
            raise RuntimeError(
                "LiteLLM exited before opening the port:\n" +
                textwrap.indent("\n".join(last_lines), "  ")
            )
        time.sleep(0.25)
    else:
        _proxy_handle.terminate()
        raise RuntimeError(f"LiteLLM proxy did not open port {port} within {timeout}s.")

    print(f"✅ LiteLLM proxy on http://localhost:{port} (PID {_proxy_handle.pid})")
    print(f"   Logs → {pathlib.Path(log_path).resolve()}")
    return _proxy_handle


def stop_proxy() -> None:
    global _proxy_handle
    if _proxy_handle and _proxy_handle.poll() is None:
        _proxy_handle.terminate()
        _proxy_handle.wait(timeout=3)
        print("LiteLLM proxy stopped.")
    _proxy_handle = None

Start up the LiteLLM proxy for the first time.

In [236]:
_proxy_handle = start_proxy()

✅ LiteLLM proxy on http://localhost:4000 (PID 63464)
   Logs → /content/litellm_proxy.log


Now we will add a simple helper method to test out models.

In [237]:
import requests


def call_model(text: str, model: str = "gpt-4o-mini"):
  try:
      t0 = time.time()
      payload = {
          "model": model,
          "messages": [{"role": "user", "content": text}]
      }
      r = requests.post("http://localhost:4000/chat/completions", json=payload, timeout=30)
      r.raise_for_status()
      print(r.json()["choices"][0]["message"]["content"])
      print(f"{r.json()['id']} -- {r.json()['model']} -- latency: {time.time() - t0:.2f}s \n")
      return r
  except Exception as e:
    print(str(e))
    if "error" in r.json():
      print(r.json()["error"]["message"])

In [238]:
res = call_model("hello, how are you?")

Hello! I'm just a program, so I don't have feelings, but I'm here and ready to help you. How can I assist you today?
chatcmpl-BUdDxEetmH0k6yJkaDLeSshRZmGnz -- gpt-4o-mini-2024-07-18 -- latency: 0.90s 



In [239]:
res = call_model("hello, how are you?", model="gpt-3.5-turbo")

Hello! I'm just a computer program, so I don't have feelings, but I'm here to assist you. How can I help you today?
chatcmpl-BUdDySZjzxB8tCTLkuYDTyPFfKo1P -- gpt-3.5-turbo-0125 -- latency: 0.65s 



In [240]:
# Try a non-supported model!
res = call_model("hello, how are you?", model="claude")

400 Client Error: Bad Request for url: http://localhost:4000/chat/completions
{'error': '/chat/completions: Invalid model name passed in model=claude. Call `/v1/models` to view available models for your key.'}


## 3 · Implement LLM caching with Redis

LiteLLM Proxy with Redis provides two powerful caching capabilities that can significantly improve your LLM application performance and reliability:

* **Exact cache (identical prompt)**: Pulls exact prompt/query matches from Redis with configurable TTL.
* **Semantic cache (similar prompt)**: Uses Redis as a semantic cache powered by **vector search** to determine if a prompt/query is similar enough to a cached entry.

### Why Use Caching for LLMs?

1. **Cost Reduction**: Avoid redundant API calls for identical or similar prompts
2. **Latency Improvement**: Cached responses return in milliseconds vs. seconds
3. **Reliability**: Reduce dependency on external API availability


In [241]:
%%timeit
res = call_model("what is the capital of france?")

The capital of France is Paris.
chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.63s 

The capital of France is Paris.
chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s 

The capital of France is Paris.
chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s 

The capital of France is Paris.
chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s 

The capital of France is Paris.
chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s 

The capital of France is Paris.
chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.03s 

The capital of France is Paris.
chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s 

The capital of France is Paris.
chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s 

18.6 ms ± 3.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop e

Check response equivalence:

In [242]:
res1 = call_model("what is the capital of france?")
res2 = call_model("what is the capital of france?")

assert res1.json() == res2.json()

res1.json()

The capital of France is Paris.
chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s 

The capital of France is Paris.
chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s 



{'id': 'chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8',
 'created': 1746640319,
 'model': 'gpt-4o-mini-2024-07-18',
 'object': 'chat.completion',
 'system_fingerprint': 'fp_129a36352a',
 'choices': [{'finish_reason': 'stop',
   'index': 0,
   'message': {'content': 'The capital of France is Paris.',
    'role': 'assistant',
    'tool_calls': None,
    'function_call': None,
    'annotations': []}}],
 'usage': {'completion_tokens': 8,
  'prompt_tokens': 14,
  'total_tokens': 22,
  'completion_tokens_details': {'accepted_prediction_tokens': 0,
   'audio_tokens': 0,
   'reasoning_tokens': 0,
   'rejected_prediction_tokens': 0},
  'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}},
 'service_tier': 'default'}

## 4 · Semantic caching

Now we'll demonstrate semantic caching by sending similar prompts back to back. The first request should hit the LLM API, while future requests should be served from cache as long as they are similar enough. We'll see this reflected in the response times.

First, we need to stop the running proxy and update the LiteLLM config.

In [243]:
# Stop the proxy process
_proxy_handle.terminate()
_proxy_handle.wait(timeout=4)

-15

In [244]:
%%writefile litellm_redis.yml
model_list:
- litellm_params:
    api_key: os.environ/OPENAI_API_KEY
    model: gpt-3.5-turbo
    rpm: 30
  model_name: gpt-3.5-turbo
- litellm_params:
    api_key: os.environ/OPENAI_API_KEY
    model: gpt-4o-mini
    rpm: 30
  model_name: gpt-4o-mini
- litellm_params:
    api_key: os.environ/OPENAI_API_KEY
    model: text-embedding-3-small
  model_name: text-embedding-3-small

litellm_settings:
  cache: True
  set_verbose: True
  cache_params:
    type: redis-semantic
    host: os.environ/REDIS_HOST
    port: os.environ/REDIS_PORT
    password: os.environ/REDIS_PASSWORD
    ttl: 60
    similarity_threshold: 0.90
    redis_semantic_cache_embedding_model: text-embedding-3-small
    redis_semantic_cache_index_name: llmcache

Overwriting litellm_redis.yml


In [245]:
_proxy_handle = start_proxy()

✅ LiteLLM proxy on http://localhost:4000 (PID 63528)
   Logs → /content/litellm_proxy.log


Semantic cache can handle exact match scenarios (where the characters/tokens are identical). This would happen more in a development environment or in cases where a programmatic user is providing input to an LLM call.

In [246]:
%%timeit

call_model("what is the capital city of the United States?")

The capital city of the United States is Washington, D.C.
chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 1.35s 

The capital city of the United States is Washington, D.C.
chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.37s 

The capital city of the United States is Washington, D.C.
chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.53s 

The capital city of the United States is Washington, D.C.
chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.47s 

The capital city of the United States is Washington, D.C.
chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.36s 

The capital city of the United States is Washington, D.C.
chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.24s 

The capital city of the United States is Washington, D.C.
chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.39s 


Additional (or variable) latency here per check is due to using OpenAI embeddings which makes calls over the network. A more optimized solution would be to use a more scalable embedding inference system OR a localized model that doesn't require a network hop.

The semantic cache can also be used for near exact matches (fuzzy caching) based on semantic meaning. Below are a few scenarios:

In [258]:
texts = [
    "who is the president of France?",
    "who is the country president of France?",
    "who is France's current presidet?",
    "The current president of France is?"
]

for text in texts:
  res = call_model(text)

As of my last update in October 2023, the President of France is Emmanuel Macron. He has been in office since May 14, 2017. However, please verify with a current source, as political positions can change.
chatcmpl-BUdHNxLLb7HBmnTUUHRQpxWBVhGAI -- gpt-4o-mini-2024-07-18 -- latency: 2.37s 

As of my last knowledge update in October 2023, the President of France is Emmanuel Macron. He has been in office since May 14, 2017, and was re-elected for a second term in April 2022. Please verify with up-to-date sources, as political situations can change.
chatcmpl-BUdHOz7UCsO4KKKcDfx8ZGv2LJ6dZ -- gpt-4o-mini-2024-07-18 -- latency: 1.38s 

As of my last update in October 2023, the President of France is Emmanuel Macron. He has been in office since May 14, 2017. However, please verify with a current source, as political positions can change.
chatcmpl-BUdHNxLLb7HBmnTUUHRQpxWBVhGAI -- gpt-4o-mini-2024-07-18 -- latency: 0.65s 

As of my last update in October 2023, the President of France is Emmanuel 

## 5 · Inspect Redis Index with RedisVL
Use the `redisvl` helpers and CLI to investigate more about the underlying vector index that supports the checks within the LiteLLM proxy.

In [248]:
from redisvl.index import SearchIndex

idx = SearchIndex.from_existing(redis_client=client, name="llmcache")

In [249]:
idx.exists()

True

In [250]:
!rvl index info -i llmcache

[32m17:52:13[0m [34m[RedisVL][0m [1;30mINFO[0m   Using Redis address from environment variable, REDIS_URL


Index Information:
╭──────────────┬────────────────┬──────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes     │ Index Options   │   Indexing │
├──────────────┼────────────────┼──────────────┼─────────────────┼────────────┤
│ llmcache     │ HASH           │ ['llmcache'] │ []              │          0 │
╰──────────────┴────────────────┴──────────────┴─────────────────┴────────────╯
Index Fields:
╭───────────────┬───────────────┬─────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────╮
│ Name          │ Attribute     │ Type    │ Field Option   │ Option Value   │ Field Option   │ Option Value   │ Field Option   │   Option Value │ Field Option    │ Option Value   │
├───────────────┼───────────────┼─────────┼────────────────┼────────────────┼──────────────

### Examining the Cached Keys in Redis

Let's look at the keys created in Redis for the cache and understand how LiteLLM structures them:

In [251]:
import json

from redisvl.redis.utils import convert_bytes
from redisvl.query import FilterQuery

# Get all keys related to LiteLLM cache
cache_keys = list(idx.paginate(query=FilterQuery()))
print(f"Found {len(cache_keys)} cache keys in Redis")

if cache_keys:
    # Look at the first key
    first_key = cache_keys[0][0]['id']
    print(f"\nExample cache key: {first_key}")

    # Get TTL for the key
    ttl = client.ttl(first_key)
    print(f"TTL: {ttl} seconds remaining...")

    # Get the value (may be large, so limiting output)
    value = client.hgetall(first_key)
    if value:
        v = convert_bytes(value)
        print(v)


Found 1 cache keys in Redis

Example cache key: llmcache:e4e4faaeea347b9876d03c4f68b7d981234a3a7a4281590ab4bc0e70dbdaef9e
TTL: 55 seconds remaining...
{'response': '{\'timestamp\': 1746640328.978919, \'response\': \'{"id":"chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ","created":1746640328,"model":"gpt-4o-mini-2024-07-18","object":"chat.completion","system_fingerprint":"fp_dbaca60df0","choices":[{"finish_reason":"stop","index":0,"message":{"content":"The capital city of the United States is Washington, D.C.","role":"assistant","tool_calls":null,"function_call":null,"annotations":[]}}],"usage":{"completion_tokens":14,"prompt_tokens":17,"total_tokens":31,"completion_tokens_details":{"accepted_prediction_tokens":0,"audio_tokens":0,"reasoning_tokens":0,"rejected_prediction_tokens":0},"prompt_tokens_details":{"audio_tokens":0,"cached_tokens":0}},"service_tier":"default"}\'}', 'prompt_vector': b'\xccY/=\xbf0\x00\xbdd\x0f\xa2=X\xa5\xc8=\x1f\t-\xbc\\\x1d\x1b\xbc^\xda\xdb\xbc\x02\xfc<<t\xb4\x80;CI\x1b

## 6 · Implementation Options

LiteLLM provides multiple ways to implement caching in your application:

### Using LiteLLM Proxy (as shown)

The proxy approach (demonstrated in this notebook) is recommended for production deployments because it:
- Provides a unified API endpoint for all your models
- Centralizes caching, rate-limiting, and fallback logic
- Works with any client that uses the OpenAI API format
- Supports multiple languages and frameworks

### Direct Integration with LiteLLM Python SDK

For Python applications, you can also integrate caching directly using the SDK. See the [LiteLLM Caching documentation](https://docs.litellm.ai/docs/caching/all_caches) for details.

## 7 · Semantic caching with RedisVL directly
In some cases you may want more control over your cache. No problem here! Use RedisVL semantic cache directly in your application code like below:


In [252]:
from redisvl.utils.vectorize import OpenAITextVectorizer
from redisvl.extensions.llmcache import SemanticCache

oai = OpenAITextVectorizer("text-embedding-3-small")

cache = SemanticCache(
    redis_client=client,
    distance_threshold=0.1,
    overwrite=False,
    vectorizer=oai,
)

In [253]:
cache.store(
    prompt="what is the capital city of the United States?",
    response="Washington DC is the capital of the USA."
)

'llmcache:e4e4faaeea347b9876d03c4f68b7d981234a3a7a4281590ab4bc0e70dbdaef9e'

In [255]:
cache.check(prompt="what is the capital of the United States of America?")

[{'entry_id': 'e4e4faaeea347b9876d03c4f68b7d981234a3a7a4281590ab4bc0e70dbdaef9e',
  'prompt': 'what is the capital city of the United States?',
  'response': 'Washington DC is the capital of the USA.',
  'vector_distance': 0.0600312948227,
  'inserted_at': 1746640334.45,
  'updated_at': 1746640334.45,
  'key': 'llmcache:e4e4faaeea347b9876d03c4f68b7d981234a3a7a4281590ab4bc0e70dbdaef9e'}]

Now we should NOT get any cache hits for these.

In [256]:
texts = [
    "who is the president of France?",
    "who is the country president of France?",
    "who is France's current presidet?",
    "The current president of France is?"
]

for text in texts:
    print(cache.check(prompt=text))

[]
[]
[]
[]


## 8 · Cleanup

Let's stop the LiteLLM proxy server and clean up our environment:

In [259]:
_proxy_handle.terminate()
_proxy_handle.wait(timeout=4)
cache.clear()
client.flushall()

True