From 93fd926062d87163c81357410ef9446101d5a7a3 Mon Sep 17 00:00:00 2001 From: Tyler Hutcherson Date: Thu, 1 May 2025 08:16:58 -0400 Subject: [PATCH 1/2] WIP on LiteLLM example --- .../gateway/00_litellm_proxy_redis.ipynb | 870 ++++++++++++++++++ python-recipes/gateway/litellm_redis.yml | 15 + 2 files changed, 885 insertions(+) create mode 100644 python-recipes/gateway/00_litellm_proxy_redis.ipynb create mode 100644 python-recipes/gateway/litellm_redis.yml diff --git a/python-recipes/gateway/00_litellm_proxy_redis.ipynb b/python-recipes/gateway/00_litellm_proxy_redis.ipynb new file mode 100644 index 0000000..57f288f --- /dev/null +++ b/python-recipes/gateway/00_litellm_proxy_redis.ipynb @@ -0,0 +1,870 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "47c3fefa", + "metadata": {}, + "source": [ + "\n", + "![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?width=120)\n", + "\n", + "# LiteLLM Proxy with Redis\n", + "\n", + "This notebook demonstrates how to use [LiteLLM](https://github.com/BerriAI/litellm) with Redis to build a powerful and efficient LLM proxy server with caching & rate limiting capabilities. LiteLLM provides a unified interface for accessing multiple LLM providers while Redis enhances performance of the application in several different ways.\n", + "\n", + "*This recipe will help you understand*:\n", + "\n", + "* **Why** and **how** to implement exact and semantic caching for LLM calls\n", + "* **How** to set up rate limiting for your LLM APIs\n", + "* **How** LiteLLM integrates with Redis for state management \n", + "* **How to measure** the performance benefits of caching\n", + "\n", + "> **Open in Colab** ↘︎ \n", + "> \"Open\n" + ] + }, + { + "cell_type": "markdown", + "id": "06c7b959", + "metadata": {}, + "source": [ + "\n", + "## 1. Environment Setup \n", + "\n", + "### Install Python Dependencies\n", + "Before we begin, we need to make sure our environment is properly set up with all the necessary tools:\n", + "\n", + "**Requirements**:\n", + "* Python ≥ 3.9 with the below packages\n", + "* OpenAI API key (set as `OPENAI_API_KEY` environment variable)\n", + "\n", + "First, let's install the required packages." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "47246c48", + "metadata": { + "id": "pip" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install -q \"litellm[proxy]\" \"redisvl==0.5.2\" requests openai" + ] + }, + { + "cell_type": "markdown", + "id": "redis-setup", + "metadata": {}, + "source": [ + "### Install Redis Stack\n", + "\n", + "\n", + "#### For Colab\n", + "Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0db80601", + "metadata": { + "id": "redis-install" + }, + "outputs": [], + "source": [ + "# NBVAL_SKIP\n", + "%%sh\n", + "curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg\n", + "echo \"deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main\" | sudo tee /etc/apt/sources.list.d/redis.list\n", + "sudo apt-get update > /dev/null 2>&1\n", + "sudo apt-get install redis-stack-server > /dev/null 2>&1\n", + "redis-stack-server --daemonize yes" + ] + }, + { + "cell_type": "markdown", + "id": "b750e779", + "metadata": {}, + "source": [ + "#### For Alternative Environments\n", + "There are many ways to get the necessary redis-stack instance running\n", + "1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.io/try-free/). Or, if you have your\n", + "own version of Redis Enterprise running, that works too!\n", + "2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)\n", + "3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`" + ] + }, + { + "cell_type": "markdown", + "id": "177e9fe3", + "metadata": {}, + "source": [ + "### Define the Redis Connection URL\n", + "\n", + "By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "be77a1d3", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# Replace values below with your own if using Redis Cloud instance\n", + "REDIS_HOST = os.getenv(\"REDIS_HOST\", \"localhost\") # ex: \"redis-18374.c253.us-central1-1.gce.cloud.redislabs.com\"\n", + "REDIS_PORT = os.getenv(\"REDIS_PORT\", \"6379\") # ex: 18374\n", + "REDIS_PASSWORD = os.getenv(\"REDIS_PASSWORD\", \"\") # ex: \"1TNxTEdYRDgIDKM2gDfasupCADXXXX\"\n", + "\n", + "# If SSL is enabled on the endpoint, use rediss:// as the URL prefix\n", + "REDIS_URL = f\"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}\"" + ] + }, + { + "cell_type": "markdown", + "id": "redis-connection", + "metadata": {}, + "source": [ + "### Verify Redis Connection\n", + "\n", + "Let's test our Redis connection to make sure it's working properly:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "f3ddcabf", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from redis import Redis\n", + "\n", + "client = Redis.from_url(REDIS_URL)\n", + "client.ping()" + ] + }, + { + "cell_type": "markdown", + "id": "ce052678", + "metadata": {}, + "source": [ + "### Set OPENAI API Key" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "e21ac07e", + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import os\n", + "\n", + "def _set_env(key: str):\n", + " if key not in os.environ:\n", + " os.environ[key] = getpass.getpass(f\"{key}:\")\n", + "\n", + "\n", + "_set_env(\"OPENAI_API_KEY\")" + ] + }, + { + "cell_type": "markdown", + "id": "fc65bfdd", + "metadata": {}, + "source": [ + "## 2 · Understanding LiteLLM Caching with Redis\n", + "\n", + "LiteLLM Proxy with Redis provides several powerful capabilities that can significantly improve your LLM application performance and reliability:\n", + "\n", + "* **Exact cache (identical prompt)**: Uses Redis `SETEX` with TTL through the `cache:` configuration\n", + "* **Semantic cache (similar prompt)**: Uses RediSearch **vector** indexing through the `semantic_cache:` configuration\n", + "* **Rate-limit per user/key**: Uses Redis `INCR + EXPIRE` counters through the `rate_limit:` configuration\n", + "* **Multi-model routing**: Uses Redis data structures for model configurations\n", + "\n", + "### Why Use Caching for LLMs?\n", + "\n", + "1. **Cost Reduction**: Avoid redundant API calls for identical or similar prompts\n", + "2. **Latency Improvement**: Cached responses return in milliseconds vs. seconds\n", + "3. **Reliability**: Reduce dependency on external API availability\n", + "4. **Rate Limit Management**: Stay within API provider constraints\n", + "\n", + "In this notebook, we'll explore how these features work and measure their impact on performance." + ] + }, + { + "cell_type": "markdown", + "id": "9d003168", + "metadata": {}, + "source": [ + "## 3 · Create a Multi-Model Configuration\n", + "\n", + "Let's create a configuration file for LiteLLM Proxy that includes caching, semantic caching, and rate limiting with Redis. This configuration will route requests to two OpenAI models: `gpt-3.5-turbo` and `gpt‑4o‑mini`." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "d859197b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Configuration saved to: litellm_redis.yml\n", + "\n", + "Configuration details:\n", + "litellm_settings:\n", + " cache: true\n", + " cache_params:\n", + " host: localhost\n", + " password: ''\n", + " port: '6379'\n", + " type: redis\n", + " set_verbose: true\n", + "model_list:\n", + "- litellm_params:\n", + " model: gpt-3.5-turbo\n", + " model_name: openai-old\n", + "- litellm_params:\n", + " model: gpt-4o\n", + " model_name: openai-new\n", + "\n" + ] + } + ], + "source": [ + "import pathlib, yaml, textwrap, json\n", + "\n", + "\n", + "cfg = {\n", + " \"model_list\": [\n", + " {\n", + " \"model_name\": \"openai-old\",\n", + " \"litellm_params\": {\n", + " \"model\": \"gpt-3.5-turbo\"\n", + " }\n", + " },\n", + " {\n", + " \"model_name\": \"openai-new\",\n", + " \"litellm_params\": {\n", + " \"model\": \"gpt-4o\"\n", + " }\n", + " }\n", + " ],\n", + " \"litellm_settings\": {\n", + " \"set_verbose\": True,\n", + " \"cache\": True,\n", + " \"cache_params\": {\n", + " \"type\": \"redis\",\n", + " \"host\": REDIS_HOST,\n", + " \"port\": REDIS_PORT,\n", + " \"password\": REDIS_PASSWORD\n", + " }\n", + " }\n", + "}\n", + "\n", + "path = pathlib.Path(\"litellm_redis.yml\")\n", + "path.write_text(yaml.dump(cfg))\n", + "print(\"Configuration saved to:\", path)\n", + "print(\"\\nConfiguration details:\")\n", + "print(path.read_text())" + ] + }, + { + "cell_type": "markdown", + "id": "b0e0860e", + "metadata": {}, + "source": [ + "### Launch LiteLLM Proxy\n", + "\n", + "Now, let's start the LiteLLM Proxy server with our configuration. We'll use native Jupyter bash magic commands instead of using Python subprocess:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "48f5a3b4", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash --bg\n", + "# Start LiteLLM Proxy in the background\n", + "echo \"Starting LiteLLM Proxy on port 4000...\"\n", + "litellm --config litellm_redis.yml --port 4000 > litellm_proxy.log 2>&1\n", + "echo \"Proxy started! Check litellm_proxy.log for details.\"" + ] + }, + { + "cell_type": "markdown", + "id": "proxy-check", + "metadata": {}, + "source": [ + "Let's wait a few seconds for the proxy to start and check if it's running:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "check-proxy", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Waiting for proxy to start...\n", + "Proxy health check status: 200\n", + "Response: {'healthy_endpoints': [], 'unhealthy_endpoints': [], 'healthy_count': 0, 'unhealthy_count': 0}\n" + ] + } + ], + "source": [ + "import time, requests\n", + "print(\"Waiting for proxy to start...\")\n", + "time.sleep(5)\n", + "\n", + "# Check if proxy is running\n", + "try:\n", + " response = requests.get(\"http://localhost:4000/health\")\n", + " print(f\"Proxy health check status: {response.status_code}\")\n", + " print(f\"Response: {response.json()}\")\n", + "except Exception as e:\n", + " print(f\"Error checking proxy: {e}\")\n", + " print(\"\\nLast few lines from proxy log:\")\n", + " !tail -n 5 litellm_proxy.log" + ] + }, + { + "cell_type": "markdown", + "id": "chat-helper", + "metadata": {}, + "source": [ + "### Create a Helper Function for Making Requests\n", + "\n", + "Let's create a function to make API calls to our proxy and measure response times:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "76cdc5a5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing the chat function with a simple prompt...\n", + "🕒 Response time: 0.01s | Cache: MISS | Semantic cache: MISS\n", + "Status code: 400\n" + ] + } + ], + "source": [ + "import requests, time, json, textwrap\n", + "\n", + "def chat(prompt, model=\"gpt-3.5-turbo\", user=\"demo\", verbose=True):\n", + " \"\"\"Send a chat completion request to the LiteLLM proxy\"\"\"\n", + " payload = {\n", + " \"model\": model,\n", + " \"messages\": [{\"role\": \"user\", \"content\": prompt}],\n", + " \"user\": user\n", + " }\n", + " \n", + " t0 = time.time()\n", + " resp = requests.post(\"http://localhost:4000/v1/chat/completions\", \n", + " json=payload, \n", + " timeout=60)\n", + " elapsed = time.time() - t0\n", + " \n", + " if verbose:\n", + " cache_status = resp.headers.get(\"X-Cache\", \"MISS\")\n", + " semantic_cache = resp.headers.get(\"X-Semantic-Cache\", \"MISS\")\n", + " print(f\"🕒 Response time: {elapsed:.2f}s | Cache: {cache_status} | Semantic cache: {semantic_cache}\")\n", + " \n", + " if resp.status_code == 200:\n", + " print(f\"🤖 Response: {resp.json()['choices'][0]['message']['content'][:100]}...\")\n", + " \n", + " return elapsed, resp\n", + "\n", + "# Test the function\n", + "print(\"Testing the chat function with a simple prompt...\")\n", + "_, test_resp = chat(\"Introduce yourself briefly\")\n", + "print(f\"Status code: {test_resp.status_code}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "d935df11", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "b'{\"error\":{\"message\":\"{\\'error\\': \\'/chat/completions: Invalid model name passed in model=gpt-3.5-turbo. Call `/v1/models` to view available models for your key.\\'}\",\"type\":\"None\",\"param\":\"None\",\"code\":\"400\"}}'" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_resp.content" + ] + }, + { + "cell_type": "markdown", + "id": "e121e215", + "metadata": {}, + "source": [ + "## 4 · Exact Cache Demonstration\n", + "\n", + "Now we'll demonstrate exact caching by sending the same prompt twice. The first request should hit the LLM API, while the second should be served from cache. We'll see this reflected in the response time and cache headers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c08699fc", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"🧪 Exact Cache Experiment\")\n", + "print(\"\\n1️⃣ First Request: (expecting cache MISS)\")\n", + "lat1, res1 = chat(\"What are three benefits of Redis for LLM applications?\")\n", + "\n", + "print(\"\\n2️⃣ Second Request with Identical Prompt: (expecting cache HIT)\")\n", + "lat2, res2 = chat(\"What are three benefits of Redis for LLM applications?\")\n", + "\n", + "print(f\"\\n🔍 Performance Analysis:\")\n", + "print(f\" First request: {lat1:.3f}s\")\n", + "print(f\" Second request: {lat2:.3f}s\")\n", + "if lat1 > 0 and lat2 > 0:\n", + " print(f\" Speed improvement: {lat1/lat2:.1f}x faster\")\n", + " print(f\" Time saved: {lat1 - lat2:.3f}s\")" + ] + }, + { + "cell_type": "markdown", + "id": "00bd3fc6", + "metadata": {}, + "source": [ + "### Examining the Cached Keys in Redis\n", + "\n", + "Let's look at the keys created in Redis for the exact cache and understand how LiteLLM structures them:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "46eb6aa5", + "metadata": {}, + "outputs": [], + "source": [ + "# Get all keys related to LiteLLM cache\n", + "cache_keys = list(r.scan_iter(match=\"litellm:cache:*\"))\n", + "print(f\"Found {len(cache_keys)} cache keys in Redis\")\n", + "\n", + "if cache_keys:\n", + " # Look at the first key\n", + " first_key = cache_keys[0]\n", + " print(f\"\\nExample cache key: {first_key}\")\n", + " \n", + " # Get TTL for the key\n", + " ttl = r.ttl(first_key)\n", + " print(f\"TTL: {ttl} seconds\")\n", + " \n", + " # Get the value (may be large, so limiting output)\n", + " value = r.get(first_key)\n", + " if value:\n", + " print(f\"Value type: {type(value).__name__}\")\n", + " print(f\"Value size: {len(value)} characters\")\n", + " try:\n", + " # Try to parse as JSON for better display\n", + " parsed = json.loads(value[:1000] + '...' if len(value) > 1000 else value)\n", + " print(f\"Content preview (JSON): {json.dumps(parsed, indent=2)[:300]}...\")\n", + " except:\n", + " print(f\"Content preview (raw): {value[:100]}...\")" + ] + }, + { + "cell_type": "markdown", + "id": "8959ff3d", + "metadata": {}, + "source": [ + "### Benchmarking Cached Response Times\n", + "\n", + "Now, let's precisely measure the cached response time using multiple repeated requests:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "007710d7", + "metadata": {}, + "outputs": [], + "source": [ + "# Benchmark cached response time with more samples\n", + "def benchmark_cached_query(query, runs=5):\n", + " times = []\n", + " print(f\"Benchmarking cached query: '{query}'\")\n", + " print(f\"Running {runs} iterations...\")\n", + " \n", + " for i in range(runs):\n", + " start = time.time()\n", + " elapsed, resp = chat(query, verbose=False)\n", + " times.append(elapsed)\n", + " cache_status = resp.headers.get(\"X-Cache\", \"MISS\")\n", + " print(f\" Run {i+1}: {elapsed:.4f}s | Cache: {cache_status}\")\n", + " \n", + " avg_time = sum(times) / len(times)\n", + " print(f\"\\nAverage response time: {avg_time:.4f}s\")\n", + " print(f\"Min: {min(times):.4f}s | Max: {max(times):.4f}s\")\n", + " return avg_time\n", + "\n", + "# Run the benchmark\n", + "benchmark_cached_query(\"What are three benefits of Redis for LLM applications?\")" + ] + }, + { + "cell_type": "markdown", + "id": "67888d4e", + "metadata": {}, + "source": [ + "## 5 · Semantic Cache Demonstration\n", + "\n", + "Semantic caching is more powerful than exact caching because it can identify semantically similar prompts, not just identical ones. This is implemented using vector embeddings and similarity search in Redis.\n", + "\n", + "Let's test it by sending a prompt that is semantically similar (but not identical) to our previous query:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5c5ca8ac", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"🧪 Semantic Cache Experiment\")\n", + "\n", + "# First, let's send a new query that will be stored in the semantic cache\n", + "print(\"\\n1️⃣ Establishing a baseline query for semantic cache:\")\n", + "lat1, res1 = chat(\"Tell me a useful application of Redis for AI systems\")\n", + "\n", + "# Now send a semantically similar query\n", + "print(\"\\n2️⃣ Testing a semantically similar query:\")\n", + "lat2, res2 = chat(\"What's a good use case for Redis in artificial intelligence?\")\n", + "\n", + "# Try a completely different query\n", + "print(\"\\n3️⃣ Testing an unrelated query (should not hit semantic cache):\")\n", + "lat3, res3 = chat(\"How to make chocolate chip cookies?\")\n", + "\n", + "print(f\"\\n🔍 Performance Analysis:\")\n", + "print(f\" Original query: {lat1:.3f}s\")\n", + "print(f\" Similar query: {lat2:.3f}s\")\n", + "print(f\" Unrelated query: {lat3:.3f}s\")\n", + "\n", + "sim_cache_hit = \"HIT\" in res2.headers.get(\"X-Semantic-Cache\", \"MISS\")\n", + "if sim_cache_hit and lat1 > 0 and lat2 > 0:\n", + " print(f\" Speed improvement: {lat1/lat2:.1f}x faster for semantically similar query\")" + ] + }, + { + "cell_type": "markdown", + "id": "2566c681", + "metadata": {}, + "source": [ + "### Examining Semantic Cache Keys\n", + "\n", + "Let's look at the keys and indices created in Redis for the semantic cache:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a6d5be0e", + "metadata": {}, + "outputs": [], + "source": [ + "# Check semantic cache keys\n", + "semantic_keys = list(r.scan_iter(match=\"litellm:semantic*\"))\n", + "print(f\"Found {len(semantic_keys)} semantic cache keys in Redis\")\n", + "\n", + "if semantic_keys:\n", + " # Display the first few keys\n", + " for key in semantic_keys[:5]:\n", + " print(f\" - {key}\")\n", + " \n", + " # Check for Redis Search indices\n", + " try:\n", + " indices = r.execute_command(\"FT._LIST\")\n", + " print(f\"\\nRedis Search indices: {indices}\")\n", + " \n", + " # Get info about the semantic cache index if it exists\n", + " semantic_index = [idx for idx in indices if \"semantic\" in idx.lower()]\n", + " if semantic_index:\n", + " index_info = r.execute_command(f\"FT.INFO {semantic_index[0]}\")\n", + " print(f\"\\nSemantic Index Info:\")\n", + " # Format and display selected info\n", + " info_dict = {index_info[i]: index_info[i+1] for i in range(0, len(index_info), 2) if i+1 < len(index_info)}\n", + " for k in ['num_docs', 'num_terms', 'index_name', 'index_definition']:\n", + " if k in info_dict:\n", + " print(f\" {k}: {info_dict[k]}\")\n", + " except Exception as e:\n", + " print(f\"Error accessing Redis Search indices: {e}\")" + ] + }, + { + "cell_type": "markdown", + "id": "semantic-cache-explain", + "metadata": {}, + "source": [ + "### How Semantic Caching Works\n", + "\n", + "LiteLLM's semantic caching works through these steps:\n", + "1. When a query arrives, LiteLLM generates an embedding vector for the query using the configured model\n", + "2. This vector is searched against previously stored vectors in Redis using cosine similarity\n", + "3. If a match is found with similarity above the threshold (we set 0.9), the cached response is returned\n", + "4. If not, the query is sent to the LLM API and the result is cached with its vector\n", + "\n", + "This approach is especially valuable for:\n", + "- Applications with many similar but not identical queries\n", + "- Customer support systems where questions vary in phrasing but seek the same information\n", + "- Educational applications where different students may ask similar questions" + ] + }, + { + "cell_type": "markdown", + "id": "4d4cc017", + "metadata": {}, + "source": [ + "## 6 · Multi-Model Routing with LiteLLM Proxy\n", + "\n", + "Our configuration enables access to multiple models through a single endpoint. Let's test both the configured models to verify they work:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c21192be", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"🧪 Multi-Model Routing Demonstration\")\n", + "\n", + "models = [\"gpt-3.5-turbo\", \"gpt-4o-mini\"]\n", + "results = {}\n", + "\n", + "for model in models:\n", + " print(f\"\\nTesting model: {model}\")\n", + " lat, res = chat(\"Say hi in two words\", model=model)\n", + " \n", + " if res.status_code == 200:\n", + " response_content = res.json()[\"choices\"][0][\"message\"][\"content\"]\n", + " results[model] = {\n", + " \"latency\": lat,\n", + " \"response\": response_content,\n", + " \"model\": res.json().get(\"model\", model)\n", + " }\n", + " print(f\"✅ Success | Response: '{response_content}'\")\n", + " else:\n", + " print(f\"❌ Error | Status code: {res.status_code}\")\n", + " print(f\"Error message: {res.text}\")\n", + "\n", + "# Compare the models\n", + "if len(results) > 1:\n", + " print(\"\\n📊 Model Comparison:\")\n", + " for model, data in results.items():\n", + " print(f\" {model}: {data['latency']:.2f}s - '{data['response']}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "0b0e740a", + "metadata": {}, + "source": [ + "## 7 · Testing Failure Modes\n", + "\n", + "Let's examine how the proxy handles error conditions, which is important for building robust applications." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3effa8a0", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"🧪 Testing Error Handling\")\n", + "\n", + "# Test with an unsupported model\n", + "print(\"\\n1️⃣ Testing with non-existent model:\")\n", + "_, bad_model_resp = chat(\"test\", model=\"gpt-nonexistent-001\")\n", + "print(f\"Status: {bad_model_resp.status_code}\")\n", + "print(f\"Error message: {json.dumps(bad_model_resp.json(), indent=2)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cbed4fa7", + "metadata": {}, + "source": [ + "### Testing Rate Limiting\n", + "\n", + "The LiteLLM proxy includes rate limiting functionality, which helps protect your API keys from overuse. Let's test this by sending requests rapidly until we hit the rate limit:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db464c0f", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"🧪 Testing Rate Limiting\")\n", + "print(\"Sending multiple requests with the same user ID to trigger rate limiting...\")\n", + "\n", + "for i in range(5):\n", + " _, r2 = chat(f\"Request {i+1}\", user=\"test-rate-limit\")\n", + " remaining = r2.headers.get(\"X-Rate-Limit-Remaining\", \"unknown\")\n", + " limit_reset = r2.headers.get(\"X-Rate-Limit-Reset\", \"unknown\")\n", + " \n", + " print(f\"Request {i+1}: Status {r2.status_code} | Remaining: {remaining} | Reset: {limit_reset}\")\n", + " \n", + " if r2.status_code == 429:\n", + " print(f\"Rate limit reached after {i+1} requests!\")\n", + " print(f\"Error response: {json.dumps(r2.json(), indent=2)}\")\n", + " break\n", + " \n", + " time.sleep(0.5) # Small delay to see rate limiting in action" + ] + }, + { + "cell_type": "markdown", + "id": "implementation-alternatives", + "metadata": {}, + "source": [ + "## 8 · Implementation Options\n", + "\n", + "LiteLLM provides multiple ways to implement caching in your application:\n", + "\n", + "### Using LiteLLM Proxy (as shown)\n", + "\n", + "The proxy approach (demonstrated in this notebook) is recommended for production deployments because it:\n", + "- Provides a unified API endpoint for all your models\n", + "- Centralizes caching, rate-limiting, and fallback logic\n", + "- Works with any client that uses the OpenAI API format\n", + "- Supports multiple languages and frameworks\n", + "\n", + "### Direct Integration with LiteLLM Python SDK\n", + "\n", + "For Python applications, you can also integrate caching directly using the SDK. See the [LiteLLM Caching documentation](https://docs.litellm.ai/docs/caching/all_caches) for details." + ] + }, + { + "cell_type": "markdown", + "id": "117e0229", + "metadata": {}, + "source": [ + "## 9 · Cleanup\n", + "\n", + "Let's stop the LiteLLM proxy server and clean up our environment:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b7ce8f7", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "# Find and stop the LiteLLM process\n", + "echo \"Stopping LiteLLM Proxy...\"\n", + "litellm_pid=$(ps aux | grep \"litellm --config\" | grep -v grep | awk '{print $2}')\n", + "if [ -n \"$litellm_pid\" ]; then\n", + " kill $litellm_pid\n", + " echo \"Stopped LiteLLM Proxy (PID: $litellm_pid)\"\n", + "else\n", + " echo \"LiteLLM Proxy not found running\"\n", + "fi\n", + "\n", + "# Optionally stop Redis if you started it just for this notebook\n", + "# Note: Comment this out if you want to keep Redis running\n", + "# redis-cli shutdown" + ] + }, + { + "cell_type": "markdown", + "id": "conclusion", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "In this notebook, we've demonstrated how to:\n", + "\n", + "1. **Set up LiteLLM Proxy** with Redis for caching and rate limiting\n", + "2. **Configure exact and semantic caching** to improve performance\n", + "3. **Measure the performance benefits** of caching LLM responses\n", + "4. **Route requests to multiple models** through a single endpoint\n", + "5. **Test error handling and rate limiting** behavior\n", + "\n", + "The benchmarks clearly show that implementing caching with Redis can significantly reduce response times and API costs, making it an essential component of production LLM applications.\n", + "\n", + "For more information, see the [LiteLLM documentation](https://docs.litellm.ai/docs/proxy/caching) and [Redis documentation](https://redis.io/docs/)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.2" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/python-recipes/gateway/litellm_redis.yml b/python-recipes/gateway/litellm_redis.yml new file mode 100644 index 0000000..412f9d0 --- /dev/null +++ b/python-recipes/gateway/litellm_redis.yml @@ -0,0 +1,15 @@ +litellm_settings: + cache: true + cache_params: + host: localhost + password: '' + port: '6379' + type: redis + set_verbose: true +model_list: +- litellm_params: + model: gpt-3.5-turbo + model_name: openai-old +- litellm_params: + model: gpt-4o + model_name: openai-new From 2eeadca0f29e6bbb2047814e528d7c9a826c8d59 Mon Sep 17 00:00:00 2001 From: Tyler Hutcherson Date: Wed, 7 May 2025 14:03:49 -0400 Subject: [PATCH 2/2] cleanup and finish the recipe --- .github/ignore-notebooks.txt | 1 + .gitignore | 4 + README.md | 23 +- .../gateway/00_litellm_proxy_redis.ipynb | 2207 ++++++++++------- python-recipes/gateway/litellm_redis.yml | 15 - 5 files changed, 1364 insertions(+), 886 deletions(-) delete mode 100644 python-recipes/gateway/litellm_redis.yml diff --git a/.github/ignore-notebooks.txt b/.github/ignore-notebooks.txt index ba0615c..0fe17bd 100644 --- a/.github/ignore-notebooks.txt +++ b/.github/ignore-notebooks.txt @@ -6,3 +6,4 @@ 01_routing_optimization 02_semantic_cache_optimization spring_ai_redis_rag.ipynb +00_litellm_proxy_redis.ipynb \ No newline at end of file diff --git a/.gitignore b/.gitignore index 47ae183..1a8186b 100644 --- a/.gitignore +++ b/.gitignore @@ -217,6 +217,7 @@ pyrightconfig.json pyvenv.cfg pip-selfcheck.json +# other libs/redis/docs/.Trash* .python-version .idea/* @@ -224,3 +225,6 @@ java-recipes/.* python-recipes/vector-search/beir_datasets python-recipes/vector-search/datasets + +litellm_proxy.log +litellm_redis.yml diff --git a/README.md b/README.md index 16fb4c6..9113ca3 100644 --- a/README.md +++ b/README.md @@ -38,6 +38,13 @@ No faster way to get started than by diving in and playing around with a demo. Need quickstarts to begin your Redis AI journey? **Start here.** +### Non-Python Redis AI Recipes + +#### ☕️ Java + +A set of Java recipes can be found under [/java-recipes](/java-recipes/README.md). + + ### Getting started with Redis & Vector Search | Recipe | Description | @@ -48,11 +55,6 @@ Need quickstarts to begin your Redis AI journey? **Start here.** | [/vector-search/02_hybrid_search.ipynb](/python-recipes/vector-search/02_hybrid_search.ipynb) | Hybrid search techniques with Redis (BM25 + Vector) | | [/vector-search/03_dtype_support.ipynb](/python-recipes/vector-search/03_dtype_support.ipynb) | Shows how to convert a float32 index to float16 or integer dataypes| -### Non-Python Redis AI Recipes - -#### ☕️ Java - -A set of Java recipes can be found under [/java-recipes](/java-recipes/README.md). ### Retrieval Augmented Generation (RAG) @@ -77,7 +79,7 @@ LLMs are stateless. To maintain context within a conversation chat sessions must | [/llm-session-manager/00_session_manager.ipynb](python-recipes/llm-session-manager/00_llm_session_manager.ipynb) | LLM session manager with semantic similarity | | [/llm-session-manager/01_multiple_sessions.ipynb](python-recipes/llm-session-manager/01_multiple_sessions.ipynb) | Handle multiple simultaneous chats with one instance | -### Semantic Cache +### Semantic Caching An estimated 31% of LLM queries are potentially redundant ([source](https://arxiv.org/pdf/2403.02694)). Redis enables semantic caching to help cut down on LLM costs quickly. | Recipe | Description | @@ -94,6 +96,15 @@ Routing is a simple and effective way of preventing misuses with your AI applica | [/semantic-router/00_semantic_routing.ipynb](python-recipes/semantic-router/00_semantic_routing.ipynb) | Simple examples of how to build an allow/block list router in addition to a multi-topic router | | [/semantic-router/01_routing_optimization.ipynb](python-recipes/semantic-router/01_routing_optimization.ipynb) | Use RouterThresholdOptimizer from redisvl to setup best router config | + +### AI Gateways +AI gateways manage LLM traffic through a centralized, managed layer that can implement routing, rate limiting, caching, and more. + +| Recipe | Description | +| --- | --- | +| [/gateway/00_litellm_proxy_redis.ipynb](python-recipes/gateway/00_litellm_proxy_redis.ipynb) | Getting started with LiteLLM proxy and Redis. | + + ### Agents | Recipe | Description | diff --git a/python-recipes/gateway/00_litellm_proxy_redis.ipynb b/python-recipes/gateway/00_litellm_proxy_redis.ipynb index 57f288f..5116a6b 100644 --- a/python-recipes/gateway/00_litellm_proxy_redis.ipynb +++ b/python-recipes/gateway/00_litellm_proxy_redis.ipynb @@ -1,870 +1,1347 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "47c3fefa", - "metadata": {}, - "source": [ - "\n", - "![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?width=120)\n", - "\n", - "# LiteLLM Proxy with Redis\n", - "\n", - "This notebook demonstrates how to use [LiteLLM](https://github.com/BerriAI/litellm) with Redis to build a powerful and efficient LLM proxy server with caching & rate limiting capabilities. LiteLLM provides a unified interface for accessing multiple LLM providers while Redis enhances performance of the application in several different ways.\n", - "\n", - "*This recipe will help you understand*:\n", - "\n", - "* **Why** and **how** to implement exact and semantic caching for LLM calls\n", - "* **How** to set up rate limiting for your LLM APIs\n", - "* **How** LiteLLM integrates with Redis for state management \n", - "* **How to measure** the performance benefits of caching\n", - "\n", - "> **Open in Colab** ↘︎ \n", - "> \"Open\n" - ] - }, - { - "cell_type": "markdown", - "id": "06c7b959", - "metadata": {}, - "source": [ - "\n", - "## 1. Environment Setup \n", - "\n", - "### Install Python Dependencies\n", - "Before we begin, we need to make sure our environment is properly set up with all the necessary tools:\n", - "\n", - "**Requirements**:\n", - "* Python ≥ 3.9 with the below packages\n", - "* OpenAI API key (set as `OPENAI_API_KEY` environment variable)\n", - "\n", - "First, let's install the required packages." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "47246c48", - "metadata": { - "id": "pip" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install -q \"litellm[proxy]\" \"redisvl==0.5.2\" requests openai" - ] - }, - { - "cell_type": "markdown", - "id": "redis-setup", - "metadata": {}, - "source": [ - "### Install Redis Stack\n", - "\n", - "\n", - "#### For Colab\n", - "Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0db80601", - "metadata": { - "id": "redis-install" - }, - "outputs": [], - "source": [ - "# NBVAL_SKIP\n", - "%%sh\n", - "curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg\n", - "echo \"deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main\" | sudo tee /etc/apt/sources.list.d/redis.list\n", - "sudo apt-get update > /dev/null 2>&1\n", - "sudo apt-get install redis-stack-server > /dev/null 2>&1\n", - "redis-stack-server --daemonize yes" - ] - }, - { - "cell_type": "markdown", - "id": "b750e779", - "metadata": {}, - "source": [ - "#### For Alternative Environments\n", - "There are many ways to get the necessary redis-stack instance running\n", - "1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.io/try-free/). Or, if you have your\n", - "own version of Redis Enterprise running, that works too!\n", - "2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)\n", - "3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`" - ] - }, - { - "cell_type": "markdown", - "id": "177e9fe3", - "metadata": {}, - "source": [ - "### Define the Redis Connection URL\n", - "\n", - "By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "be77a1d3", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "# Replace values below with your own if using Redis Cloud instance\n", - "REDIS_HOST = os.getenv(\"REDIS_HOST\", \"localhost\") # ex: \"redis-18374.c253.us-central1-1.gce.cloud.redislabs.com\"\n", - "REDIS_PORT = os.getenv(\"REDIS_PORT\", \"6379\") # ex: 18374\n", - "REDIS_PASSWORD = os.getenv(\"REDIS_PASSWORD\", \"\") # ex: \"1TNxTEdYRDgIDKM2gDfasupCADXXXX\"\n", - "\n", - "# If SSL is enabled on the endpoint, use rediss:// as the URL prefix\n", - "REDIS_URL = f\"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}\"" - ] - }, - { - "cell_type": "markdown", - "id": "redis-connection", - "metadata": {}, - "source": [ - "### Verify Redis Connection\n", - "\n", - "Let's test our Redis connection to make sure it's working properly:" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "f3ddcabf", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from redis import Redis\n", - "\n", - "client = Redis.from_url(REDIS_URL)\n", - "client.ping()" - ] - }, - { - "cell_type": "markdown", - "id": "ce052678", - "metadata": {}, - "source": [ - "### Set OPENAI API Key" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "e21ac07e", - "metadata": {}, - "outputs": [], - "source": [ - "import getpass\n", - "import os\n", - "\n", - "def _set_env(key: str):\n", - " if key not in os.environ:\n", - " os.environ[key] = getpass.getpass(f\"{key}:\")\n", - "\n", - "\n", - "_set_env(\"OPENAI_API_KEY\")" - ] - }, - { - "cell_type": "markdown", - "id": "fc65bfdd", - "metadata": {}, - "source": [ - "## 2 · Understanding LiteLLM Caching with Redis\n", - "\n", - "LiteLLM Proxy with Redis provides several powerful capabilities that can significantly improve your LLM application performance and reliability:\n", - "\n", - "* **Exact cache (identical prompt)**: Uses Redis `SETEX` with TTL through the `cache:` configuration\n", - "* **Semantic cache (similar prompt)**: Uses RediSearch **vector** indexing through the `semantic_cache:` configuration\n", - "* **Rate-limit per user/key**: Uses Redis `INCR + EXPIRE` counters through the `rate_limit:` configuration\n", - "* **Multi-model routing**: Uses Redis data structures for model configurations\n", - "\n", - "### Why Use Caching for LLMs?\n", - "\n", - "1. **Cost Reduction**: Avoid redundant API calls for identical or similar prompts\n", - "2. **Latency Improvement**: Cached responses return in milliseconds vs. seconds\n", - "3. **Reliability**: Reduce dependency on external API availability\n", - "4. **Rate Limit Management**: Stay within API provider constraints\n", - "\n", - "In this notebook, we'll explore how these features work and measure their impact on performance." - ] - }, - { - "cell_type": "markdown", - "id": "9d003168", - "metadata": {}, - "source": [ - "## 3 · Create a Multi-Model Configuration\n", - "\n", - "Let's create a configuration file for LiteLLM Proxy that includes caching, semantic caching, and rate limiting with Redis. This configuration will route requests to two OpenAI models: `gpt-3.5-turbo` and `gpt‑4o‑mini`." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "d859197b", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Configuration saved to: litellm_redis.yml\n", - "\n", - "Configuration details:\n", - "litellm_settings:\n", - " cache: true\n", - " cache_params:\n", - " host: localhost\n", - " password: ''\n", - " port: '6379'\n", - " type: redis\n", - " set_verbose: true\n", - "model_list:\n", - "- litellm_params:\n", - " model: gpt-3.5-turbo\n", - " model_name: openai-old\n", - "- litellm_params:\n", - " model: gpt-4o\n", - " model_name: openai-new\n", - "\n" - ] - } - ], - "source": [ - "import pathlib, yaml, textwrap, json\n", - "\n", - "\n", - "cfg = {\n", - " \"model_list\": [\n", - " {\n", - " \"model_name\": \"openai-old\",\n", - " \"litellm_params\": {\n", - " \"model\": \"gpt-3.5-turbo\"\n", - " }\n", - " },\n", - " {\n", - " \"model_name\": \"openai-new\",\n", - " \"litellm_params\": {\n", - " \"model\": \"gpt-4o\"\n", - " }\n", - " }\n", - " ],\n", - " \"litellm_settings\": {\n", - " \"set_verbose\": True,\n", - " \"cache\": True,\n", - " \"cache_params\": {\n", - " \"type\": \"redis\",\n", - " \"host\": REDIS_HOST,\n", - " \"port\": REDIS_PORT,\n", - " \"password\": REDIS_PASSWORD\n", - " }\n", - " }\n", - "}\n", - "\n", - "path = pathlib.Path(\"litellm_redis.yml\")\n", - "path.write_text(yaml.dump(cfg))\n", - "print(\"Configuration saved to:\", path)\n", - "print(\"\\nConfiguration details:\")\n", - "print(path.read_text())" - ] - }, - { - "cell_type": "markdown", - "id": "b0e0860e", - "metadata": {}, - "source": [ - "### Launch LiteLLM Proxy\n", - "\n", - "Now, let's start the LiteLLM Proxy server with our configuration. We'll use native Jupyter bash magic commands instead of using Python subprocess:" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "48f5a3b4", - "metadata": {}, - "outputs": [], - "source": [ - "%%bash --bg\n", - "# Start LiteLLM Proxy in the background\n", - "echo \"Starting LiteLLM Proxy on port 4000...\"\n", - "litellm --config litellm_redis.yml --port 4000 > litellm_proxy.log 2>&1\n", - "echo \"Proxy started! Check litellm_proxy.log for details.\"" - ] - }, - { - "cell_type": "markdown", - "id": "proxy-check", - "metadata": {}, - "source": [ - "Let's wait a few seconds for the proxy to start and check if it's running:" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "check-proxy", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Waiting for proxy to start...\n", - "Proxy health check status: 200\n", - "Response: {'healthy_endpoints': [], 'unhealthy_endpoints': [], 'healthy_count': 0, 'unhealthy_count': 0}\n" - ] - } - ], - "source": [ - "import time, requests\n", - "print(\"Waiting for proxy to start...\")\n", - "time.sleep(5)\n", - "\n", - "# Check if proxy is running\n", - "try:\n", - " response = requests.get(\"http://localhost:4000/health\")\n", - " print(f\"Proxy health check status: {response.status_code}\")\n", - " print(f\"Response: {response.json()}\")\n", - "except Exception as e:\n", - " print(f\"Error checking proxy: {e}\")\n", - " print(\"\\nLast few lines from proxy log:\")\n", - " !tail -n 5 litellm_proxy.log" - ] - }, - { - "cell_type": "markdown", - "id": "chat-helper", - "metadata": {}, - "source": [ - "### Create a Helper Function for Making Requests\n", - "\n", - "Let's create a function to make API calls to our proxy and measure response times:" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "76cdc5a5", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Testing the chat function with a simple prompt...\n", - "🕒 Response time: 0.01s | Cache: MISS | Semantic cache: MISS\n", - "Status code: 400\n" - ] + "cells": [ + { + "cell_type": "markdown", + "id": "47c3fefa", + "metadata": { + "id": "47c3fefa" + }, + "source": [ + "\n", + "
\n", + " \"Redis\"\n", + " \"LiteLLM\"\n", + "
\n", + "\n", + "# LiteLLM Proxy with Redis\n", + "\n", + "This notebook demonstrates how to use [LiteLLM](https://github.com/BerriAI/litellm) with Redis to build a powerful and efficient LLM proxy server backed by caching & rate limiting capabilities. LiteLLM provides a unified interface for accessing multiple LLM providers while Redis enhances performance of the application in several different ways.\n", + "\n", + "*This recipe will help you understand*:\n", + "\n", + "* **How** to set up LiteLLM as a proxy for different LLM endpoints\n", + "* **Why** and **how** to implement exact and semantic caching for LLM calls\n", + "\n", + "**Open in Colab**\n", + "\n", + "\"Open\n" + ] + }, + { + "cell_type": "markdown", + "id": "06c7b959", + "metadata": { + "id": "06c7b959" + }, + "source": [ + "\n", + "## 1 · Environment Setup \n", + "Before we begin, we need to make sure our environment is properly set up with all the necessary tools and resources.\n", + "\n", + "**Requirements**:\n", + "* Python ≥ 3.9 with the below packages\n", + "* OpenAI API key (set as `OPENAI_API_KEY` environment variable)\n", + "\n", + "\n", + "### Install Python Dependencies\n", + "\n", + "First, let's install the required packages." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47246c48", + "metadata": { + "id": "47246c48" + }, + "outputs": [], + "source": [ + "%pip install \"litellm[proxy]==1.68.0\" \"redisvl==0.5.2\" requests openai" + ] + }, + { + "cell_type": "markdown", + "id": "redis-setup", + "metadata": { + "id": "redis-setup" + }, + "source": [ + "### Install Redis Stack\n", + "\n", + "\n", + "#### For Colab\n", + "Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0db80601", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0db80601", + "outputId": "e01d1a40-f412-4808-d5f0-4d34fb2204d7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main\n", + "Starting redis-stack-server, database path /var/lib/redis-stack\n" + ] + } + ], + "source": [ + "# NBVAL_SKIP\n", + "%%sh\n", + "curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg\n", + "echo \"deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main\" | sudo tee /etc/apt/sources.list.d/redis.list\n", + "sudo apt-get update > /dev/null 2>&1\n", + "sudo apt-get install redis-stack-server > /dev/null 2>&1\n", + "redis-stack-server --daemonize yes" + ] + }, + { + "cell_type": "markdown", + "id": "b750e779", + "metadata": { + "id": "b750e779" + }, + "source": [ + "#### For Alternative Environments\n", + "There are many ways to get the necessary redis-stack instance running\n", + "1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.io/try-free/). Or, if you have your\n", + "own version of Redis Enterprise running, that works too!\n", + "2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)\n", + "3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`" + ] + }, + { + "cell_type": "markdown", + "id": "177e9fe3", + "metadata": { + "id": "177e9fe3" + }, + "source": [ + "### Define the Redis Connection URL\n", + "\n", + "By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "be77a1d3", + "metadata": { + "id": "be77a1d3" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# Replace values below with your own if using Redis Cloud instance\n", + "REDIS_HOST = os.getenv(\"REDIS_HOST\", \"localhost\") # ex: \"redis-18374.c253.us-central1-1.gce.cloud.redislabs.com\"\n", + "REDIS_PORT = os.getenv(\"REDIS_PORT\", \"6379\") # ex: 18374\n", + "REDIS_PASSWORD = os.getenv(\"REDIS_PASSWORD\", \"\") # ex: \"1TNxTEdYRDgIDKM2gDfasupCADXXXX\"\n", + "\n", + "# If SSL is enabled on the endpoint, use rediss:// as the URL prefix\n", + "REDIS_URL = f\"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}\"\n", + "os.environ[\"REDIS_URL\"] = REDIS_URL\n", + "os.environ[\"REDIS_HOST\"] = REDIS_HOST\n", + "os.environ[\"REDIS_PORT\"] = REDIS_PORT\n", + "os.environ[\"REDIS_PASSWORD\"] = REDIS_PASSWORD" + ] + }, + { + "cell_type": "markdown", + "id": "redis-connection", + "metadata": { + "id": "redis-connection" + }, + "source": [ + "### Verify Redis Connection\n", + "\n", + "Let's test our Redis connection to make sure it's working properly:" + ] + }, + { + "cell_type": "code", + "execution_count": 132, + "id": "f3ddcabf", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f3ddcabf", + "outputId": "162846c8-4add-4de7-9ed6-69e8656ec102" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 132, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from redis import Redis\n", + "\n", + "client = Redis.from_url(REDIS_URL)\n", + "client.ping()" + ] + }, + { + "cell_type": "code", + "execution_count": 133, + "id": "AZmD8eR1lphs", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "AZmD8eR1lphs", + "outputId": "0aaf4533-d239-4ad9-8853-e7192abf78d6" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 133, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "client.flushall()" + ] + }, + { + "cell_type": "markdown", + "id": "ce052678", + "metadata": { + "id": "ce052678" + }, + "source": [ + "### Set OPENAI API Key" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e21ac07e", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e21ac07e", + "outputId": "3a6d5465-35e0-49af-ce1a-54df86898cee" + }, + "outputs": [], + "source": [ + "import getpass\n", + "import os\n", + "\n", + "os.environ[\"LITELLM_LOG\"] = \"DEBUG\"\n", + "\n", + "def _set_env(key: str):\n", + " if key not in os.environ:\n", + " os.environ[key] = getpass.getpass(f\"{key}:\")\n", + "\n", + "_set_env(\"OPENAI_API_KEY\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "5X9nFyFkPdkV", + "metadata": { + "id": "5X9nFyFkPdkV" + }, + "source": [ + "## 2 · Running the LiteLLM Proxy\n", + "First, we will define a LiteLLM config that contains:\n", + "\n", + "- a few supported model options\n", + "- a semantic caching configuration using Redis" + ] + }, + { + "cell_type": "code", + "execution_count": 234, + "id": "pdeAixSUPxT7", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "pdeAixSUPxT7", + "outputId": "9cbff8c0-7fc8-431a-e93c-ba05698d217e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting litellm_redis.yml\n" + ] + } + ], + "source": [ + "%%writefile litellm_redis.yml\n", + "model_list:\n", + "- litellm_params:\n", + " api_key: os.environ/OPENAI_API_KEY\n", + " model: gpt-3.5-turbo\n", + " rpm: 30\n", + " model_name: gpt-3.5-turbo\n", + "- litellm_params:\n", + " api_key: os.environ/OPENAI_API_KEY\n", + " model: gpt-4o-mini\n", + " rpm: 30\n", + " model_name: gpt-4o-mini\n", + "- litellm_params:\n", + " api_key: os.environ/OPENAI_API_KEY\n", + " model: text-embedding-3-small\n", + " model_name: text-embedding-3-small\n", + "\n", + "litellm_settings:\n", + " cache: True\n", + " cache_params:\n", + " type: redis\n", + " host: os.environ/REDIS_HOST\n", + " port: os.environ/REDIS_PORT\n", + " password: os.environ/REDIS_PASSWORD\n", + " default_in_redis_ttl: 60" + ] + }, + { + "cell_type": "markdown", + "id": "4RqOqBoAHwVD", + "metadata": { + "id": "4RqOqBoAHwVD" + }, + "source": [ + "Now for some helper code that will start/stop **LiteLLM** proxy as a background task here on the host machine." + ] + }, + { + "cell_type": "code", + "execution_count": 235, + "id": "8mml7LhvPxWU", + "metadata": { + "id": "8mml7LhvPxWU" + }, + "outputs": [], + "source": [ + "import subprocess, atexit, os, signal, socket, time, pathlib, textwrap, sys\n", + "\n", + "\n", + "_proxy_handle: subprocess.Popen | None = None\n", + "\n", + "\n", + "def _is_port_open(port: int) -> bool:\n", + " with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:\n", + " s.settimeout(0.25)\n", + " return s.connect_ex((\"127.0.0.1\", port)) == 0\n", + "\n", + "def start_proxy(\n", + " config_path: str = \"litellm_redis.yml\",\n", + " port: int = 4000,\n", + " log_path: str = \"litellm_proxy.log\",\n", + " restart: bool = True,\n", + " timeout: float = 10.0, # seconds we’re willing to wait\n", + ") -> subprocess.Popen:\n", + "\n", + " global _proxy_handle\n", + "\n", + " # ── 1. stop running proxy we launched earlier ──\n", + " if _proxy_handle and _proxy_handle.poll() is None:\n", + " if restart:\n", + " _proxy_handle.terminate()\n", + " _proxy_handle.wait(timeout=3)\n", + " time.sleep(1) # give the OS a breath\n", + " else:\n", + " print(f\"LiteLLM already running (PID {_proxy_handle.pid}) — reusing.\")\n", + " return _proxy_handle\n", + "\n", + " # ── 2. ensure the port is free ──\n", + " if _is_port_open(port):\n", + " print(f\"Port {port} busy; trying to free it …\")\n", + " pids = os.popen(f\"lsof -ti tcp:{port}\").read().strip().splitlines()\n", + " for pid in pids:\n", + " try:\n", + " os.kill(int(pid), signal.SIGTERM)\n", + " except Exception:\n", + " pass\n", + " time.sleep(1)\n", + "\n", + " # ── 3. launch proxy ──\n", + " log_file = open(log_path, \"w\")\n", + " cmd = [\"litellm\", \"--config\", config_path, \"--port\", str(port), \"--detailed_debug\"]\n", + " _proxy_handle = subprocess.Popen(cmd, stdout=log_file, stderr=subprocess.STDOUT)\n", + "\n", + " atexit.register(lambda: _proxy_handle and _proxy_handle.terminate())\n", + "\n", + " # ── 4. readiness loop with timeout & crash detection ──\n", + " deadline = time.time() + timeout\n", + " while time.time() < deadline:\n", + " if _is_port_open(port):\n", + " break\n", + " if _proxy_handle.poll() is not None: # died early\n", + " last_lines = pathlib.Path(log_path).read_text().splitlines()[-20:]\n", + " raise RuntimeError(\n", + " \"LiteLLM exited before opening the port:\\n\" +\n", + " textwrap.indent(\"\\n\".join(last_lines), \" \")\n", + " )\n", + " time.sleep(0.25)\n", + " else:\n", + " _proxy_handle.terminate()\n", + " raise RuntimeError(f\"LiteLLM proxy did not open port {port} within {timeout}s.\")\n", + "\n", + " print(f\"✅ LiteLLM proxy on http://localhost:{port} (PID {_proxy_handle.pid})\")\n", + " print(f\" Logs → {pathlib.Path(log_path).resolve()}\")\n", + " return _proxy_handle\n", + "\n", + "\n", + "def stop_proxy() -> None:\n", + " global _proxy_handle\n", + " if _proxy_handle and _proxy_handle.poll() is None:\n", + " _proxy_handle.terminate()\n", + " _proxy_handle.wait(timeout=3)\n", + " print(\"LiteLLM proxy stopped.\")\n", + " _proxy_handle = None" + ] + }, + { + "cell_type": "markdown", + "id": "8WSEon9JIRn8", + "metadata": { + "id": "8WSEon9JIRn8" + }, + "source": [ + "Start up the LiteLLM proxy for the first time." + ] + }, + { + "cell_type": "code", + "execution_count": 236, + "id": "jrw2Gu6uPxYr", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "jrw2Gu6uPxYr", + "outputId": "ae65f321-1d4e-49fe-9282-d418f324a5cc" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ LiteLLM proxy on http://localhost:4000 (PID 63464)\n", + " Logs → /content/litellm_proxy.log\n" + ] + } + ], + "source": [ + "_proxy_handle = start_proxy()" + ] + }, + { + "cell_type": "markdown", + "id": "zzOSmL0_IzwF", + "metadata": { + "id": "zzOSmL0_IzwF" + }, + "source": [ + "Now we will add a simple helper method to test out models." + ] + }, + { + "cell_type": "code", + "execution_count": 237, + "id": "9rbN7PiMVAmA", + "metadata": { + "id": "9rbN7PiMVAmA" + }, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "\n", + "def call_model(text: str, model: str = \"gpt-4o-mini\"):\n", + " try:\n", + " t0 = time.time()\n", + " payload = {\n", + " \"model\": model,\n", + " \"messages\": [{\"role\": \"user\", \"content\": text}]\n", + " }\n", + " r = requests.post(\"http://localhost:4000/chat/completions\", json=payload, timeout=30)\n", + " r.raise_for_status()\n", + " print(r.json()[\"choices\"][0][\"message\"][\"content\"])\n", + " print(f\"{r.json()['id']} -- {r.json()['model']} -- latency: {time.time() - t0:.2f}s \\n\")\n", + " return r\n", + " except Exception as e:\n", + " print(str(e))\n", + " if \"error\" in r.json():\n", + " print(r.json()[\"error\"][\"message\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 238, + "id": "KEdfst47VdjN", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KEdfst47VdjN", + "outputId": "0898a5da-b907-4231-c171-ddf6a1043911" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Hello! I'm just a program, so I don't have feelings, but I'm here and ready to help you. How can I assist you today?\n", + "chatcmpl-BUdDxEetmH0k6yJkaDLeSshRZmGnz -- gpt-4o-mini-2024-07-18 -- latency: 0.90s \n", + "\n" + ] + } + ], + "source": [ + "res = call_model(\"hello, how are you?\")" + ] + }, + { + "cell_type": "code", + "execution_count": 239, + "id": "XJnkyMUDI9xu", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "XJnkyMUDI9xu", + "outputId": "bebbc826-60e8-4de9-8ddf-425d7c087cfa" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Hello! I'm just a computer program, so I don't have feelings, but I'm here to assist you. How can I help you today?\n", + "chatcmpl-BUdDySZjzxB8tCTLkuYDTyPFfKo1P -- gpt-3.5-turbo-0125 -- latency: 0.65s \n", + "\n" + ] + } + ], + "source": [ + "res = call_model(\"hello, how are you?\", model=\"gpt-3.5-turbo\")" + ] + }, + { + "cell_type": "code", + "execution_count": 240, + "id": "79nkkD6cVii2", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "79nkkD6cVii2", + "outputId": "c4ee9d21-3a81-4453-e412-2bd17d4a4372" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "400 Client Error: Bad Request for url: http://localhost:4000/chat/completions\n", + "{'error': '/chat/completions: Invalid model name passed in model=claude. Call `/v1/models` to view available models for your key.'}\n" + ] + } + ], + "source": [ + "# Try a non-supported model!\n", + "res = call_model(\"hello, how are you?\", model=\"claude\")" + ] + }, + { + "cell_type": "markdown", + "id": "fc65bfdd", + "metadata": { + "id": "fc65bfdd" + }, + "source": [ + "## 3 · Implement LLM caching with Redis\n", + "\n", + "LiteLLM Proxy with Redis provides two powerful caching capabilities that can significantly improve your LLM application performance and reliability:\n", + "\n", + "* **Exact cache (identical prompt)**: Pulls exact prompt/query matches from Redis with configurable TTL.\n", + "* **Semantic cache (similar prompt)**: Uses Redis as a semantic cache powered by **vector search** to determine if a prompt/query is similar enough to a cached entry.\n", + "\n", + "### Why Use Caching for LLMs?\n", + "\n", + "1. **Cost Reduction**: Avoid redundant API calls for identical or similar prompts\n", + "2. **Latency Improvement**: Cached responses return in milliseconds vs. seconds\n", + "3. **Reliability**: Reduce dependency on external API availability\n" + ] + }, + { + "cell_type": "code", + "execution_count": 241, + "id": "eup_Z0Z_Y493", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "eup_Z0Z_Y493", + "outputId": "d815413e-acc0-4108-8b47-87dfb35cd59f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The capital of France is Paris.\n", + "chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.63s \n", + "\n", + "The capital of France is Paris.\n", + "chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s \n", + "\n", + "The capital of France is Paris.\n", + "chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s \n", + "\n", + "The capital of France is Paris.\n", + "chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s \n", + "\n", + "The capital of France is Paris.\n", + "chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s \n", + "\n", + "The capital of France is Paris.\n", + "chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.03s \n", + "\n", + "The capital of France is Paris.\n", + "chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s \n", + "\n", + "The capital of France is Paris.\n", + "chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s \n", + "\n", + "18.6 ms ± 3.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%%timeit\n", + "res = call_model(\"what is the capital of france?\")" + ] + }, + { + "cell_type": "markdown", + "id": "GQRkOghoB9-Y", + "metadata": { + "id": "GQRkOghoB9-Y" + }, + "source": [ + "Check response equivalence:" + ] + }, + { + "cell_type": "code", + "execution_count": 242, + "id": "IbfUylGGUhP7", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "IbfUylGGUhP7", + "outputId": "e56853a1-61b0-4916-fb2b-c1695d922e8f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The capital of France is Paris.\n", + "chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s \n", + "\n", + "The capital of France is Paris.\n", + "chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8 -- gpt-4o-mini-2024-07-18 -- latency: 0.02s \n", + "\n" + ] + }, + { + "data": { + "text/plain": [ + "{'id': 'chatcmpl-BUdDz7ZsNbR2PTGbnzgALezkkVvh8',\n", + " 'created': 1746640319,\n", + " 'model': 'gpt-4o-mini-2024-07-18',\n", + " 'object': 'chat.completion',\n", + " 'system_fingerprint': 'fp_129a36352a',\n", + " 'choices': [{'finish_reason': 'stop',\n", + " 'index': 0,\n", + " 'message': {'content': 'The capital of France is Paris.',\n", + " 'role': 'assistant',\n", + " 'tool_calls': None,\n", + " 'function_call': None,\n", + " 'annotations': []}}],\n", + " 'usage': {'completion_tokens': 8,\n", + " 'prompt_tokens': 14,\n", + " 'total_tokens': 22,\n", + " 'completion_tokens_details': {'accepted_prediction_tokens': 0,\n", + " 'audio_tokens': 0,\n", + " 'reasoning_tokens': 0,\n", + " 'rejected_prediction_tokens': 0},\n", + " 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}},\n", + " 'service_tier': 'default'}" + ] + }, + "execution_count": 242, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "res1 = call_model(\"what is the capital of france?\")\n", + "res2 = call_model(\"what is the capital of france?\")\n", + "\n", + "assert res1.json() == res2.json()\n", + "\n", + "res1.json()" + ] + }, + { + "cell_type": "markdown", + "id": "e121e215", + "metadata": { + "id": "e121e215" + }, + "source": [ + "## 4 · Semantic caching\n", + "\n", + "Now we'll demonstrate semantic caching by sending similar prompts back to back. The first request should hit the LLM API, while future requests should be served from cache as long as they are similar enough. We'll see this reflected in the response times.\n", + "\n", + "First, we need to stop the running proxy and update the LiteLLM config." + ] + }, + { + "cell_type": "code", + "execution_count": 243, + "id": "iX5F90uWCpuY", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "iX5F90uWCpuY", + "outputId": "6ba29c04-a9f1-48f0-ae59-8fd059419fa7" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "-15" + ] + }, + "execution_count": 243, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Stop the proxy process\n", + "_proxy_handle.terminate()\n", + "_proxy_handle.wait(timeout=4)" + ] + }, + { + "cell_type": "code", + "execution_count": 244, + "id": "MpcYlHdSCvQE", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "MpcYlHdSCvQE", + "outputId": "666254d5-4d3e-4af2-e003-60a0c70ae29c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting litellm_redis.yml\n" + ] + } + ], + "source": [ + "%%writefile litellm_redis.yml\n", + "model_list:\n", + "- litellm_params:\n", + " api_key: os.environ/OPENAI_API_KEY\n", + " model: gpt-3.5-turbo\n", + " rpm: 30\n", + " model_name: gpt-3.5-turbo\n", + "- litellm_params:\n", + " api_key: os.environ/OPENAI_API_KEY\n", + " model: gpt-4o-mini\n", + " rpm: 30\n", + " model_name: gpt-4o-mini\n", + "- litellm_params:\n", + " api_key: os.environ/OPENAI_API_KEY\n", + " model: text-embedding-3-small\n", + " model_name: text-embedding-3-small\n", + "\n", + "litellm_settings:\n", + " cache: True\n", + " set_verbose: True\n", + " cache_params:\n", + " type: redis-semantic\n", + " host: os.environ/REDIS_HOST\n", + " port: os.environ/REDIS_PORT\n", + " password: os.environ/REDIS_PASSWORD\n", + " ttl: 60\n", + " similarity_threshold: 0.90\n", + " redis_semantic_cache_embedding_model: text-embedding-3-small\n", + " redis_semantic_cache_index_name: llmcache" + ] + }, + { + "cell_type": "code", + "execution_count": 245, + "id": "9Ak-jWcXC6dq", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9Ak-jWcXC6dq", + "outputId": "eec709e6-075a-4c23-b6d4-c2ed59a4fd02" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ LiteLLM proxy on http://localhost:4000 (PID 63528)\n", + " Logs → /content/litellm_proxy.log\n" + ] + } + ], + "source": [ + "_proxy_handle = start_proxy()" + ] + }, + { + "cell_type": "markdown", + "id": "4sf49YkOnhww", + "metadata": { + "id": "4sf49YkOnhww" + }, + "source": [ + "Semantic cache can handle exact match scenarios (where the characters/tokens are identical). This would happen more in a development environment or in cases where a programmatic user is providing input to an LLM call." + ] + }, + { + "cell_type": "code", + "execution_count": 246, + "id": "c08699fc", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "c08699fc", + "outputId": "1ef29ae8-6fd6-4cff-909f-0da1874dbe60" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The capital city of the United States is Washington, D.C.\n", + "chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 1.35s \n", + "\n", + "The capital city of the United States is Washington, D.C.\n", + "chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.37s \n", + "\n", + "The capital city of the United States is Washington, D.C.\n", + "chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.53s \n", + "\n", + "The capital city of the United States is Washington, D.C.\n", + "chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.47s \n", + "\n", + "The capital city of the United States is Washington, D.C.\n", + "chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.36s \n", + "\n", + "The capital city of the United States is Washington, D.C.\n", + "chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.24s \n", + "\n", + "The capital city of the United States is Washington, D.C.\n", + "chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.39s \n", + "\n", + "The capital city of the United States is Washington, D.C.\n", + "chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ -- gpt-4o-mini-2024-07-18 -- latency: 0.28s \n", + "\n", + "379 ms ± 94.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%%timeit\n", + "\n", + "call_model(\"what is the capital city of the United States?\")" + ] + }, + { + "cell_type": "markdown", + "id": "mQTzCNvCFHRJ", + "metadata": { + "id": "mQTzCNvCFHRJ" + }, + "source": [ + "Additional (or variable) latency here per check is due to using OpenAI embeddings which makes calls over the network. A more optimized solution would be to use a more scalable embedding inference system OR a localized model that doesn't require a network hop.\n", + "\n", + "The semantic cache can also be used for near exact matches (fuzzy caching) based on semantic meaning. Below are a few scenarios:" + ] + }, + { + "cell_type": "code", + "execution_count": 258, + "id": "v5lkpxafr7ot", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "v5lkpxafr7ot", + "outputId": "c00f3c88-e72d-4195-fd64-84bccf2ae185" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "As of my last update in October 2023, the President of France is Emmanuel Macron. He has been in office since May 14, 2017. However, please verify with a current source, as political positions can change.\n", + "chatcmpl-BUdHNxLLb7HBmnTUUHRQpxWBVhGAI -- gpt-4o-mini-2024-07-18 -- latency: 2.37s \n", + "\n", + "As of my last knowledge update in October 2023, the President of France is Emmanuel Macron. He has been in office since May 14, 2017, and was re-elected for a second term in April 2022. Please verify with up-to-date sources, as political situations can change.\n", + "chatcmpl-BUdHOz7UCsO4KKKcDfx8ZGv2LJ6dZ -- gpt-4o-mini-2024-07-18 -- latency: 1.38s \n", + "\n", + "As of my last update in October 2023, the President of France is Emmanuel Macron. He has been in office since May 14, 2017. However, please verify with a current source, as political positions can change.\n", + "chatcmpl-BUdHNxLLb7HBmnTUUHRQpxWBVhGAI -- gpt-4o-mini-2024-07-18 -- latency: 0.65s \n", + "\n", + "As of my last update in October 2023, the President of France is Emmanuel Macron. He has been in office since May 14, 2017. However, please verify with a current source, as political positions can change.\n", + "chatcmpl-BUdHNxLLb7HBmnTUUHRQpxWBVhGAI -- gpt-4o-mini-2024-07-18 -- latency: 0.60s \n", + "\n" + ] + } + ], + "source": [ + "texts = [\n", + " \"who is the president of France?\",\n", + " \"who is the country president of France?\",\n", + " \"who is France's current presidet?\",\n", + " \"The current president of France is?\"\n", + "]\n", + "\n", + "for text in texts:\n", + " res = call_model(text)" + ] + }, + { + "cell_type": "markdown", + "id": "-akCGqYkqGVs", + "metadata": { + "id": "-akCGqYkqGVs" + }, + "source": [ + "## 5 · Inspect Redis Index with RedisVL\n", + "Use the `redisvl` helpers and CLI to investigate more about the underlying vector index that supports the checks within the LiteLLM proxy." + ] + }, + { + "cell_type": "code", + "execution_count": 248, + "id": "RntBqIlipyHA", + "metadata": { + "id": "RntBqIlipyHA" + }, + "outputs": [], + "source": [ + "from redisvl.index import SearchIndex\n", + "\n", + "idx = SearchIndex.from_existing(redis_client=client, name=\"llmcache\")" + ] + }, + { + "cell_type": "code", + "execution_count": 249, + "id": "tHVIHkXCqU7V", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "tHVIHkXCqU7V", + "outputId": "f68ad535-0f9d-4467-e0c7-bbf9ca271915" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 249, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "idx.exists()" + ] + }, + { + "cell_type": "code", + "execution_count": 250, + "id": "8mNvmr7op-B-", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "8mNvmr7op-B-", + "outputId": "ea0535f7-e6fa-490e-8a8d-288572d7170d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[32m17:52:13\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m Using Redis address from environment variable, REDIS_URL\n", + "\n", + "\n", + "Index Information:\n", + "╭──────────────┬────────────────┬──────────────┬─────────────────┬────────────╮\n", + "│ Index Name │ Storage Type │ Prefixes │ Index Options │ Indexing │\n", + "├──────────────┼────────────────┼──────────────┼─────────────────┼────────────┤\n", + "│ llmcache │ HASH │ ['llmcache'] │ [] │ 0 │\n", + "╰──────────────┴────────────────┴──────────────┴─────────────────┴────────────╯\n", + "Index Fields:\n", + "╭───────────────┬───────────────┬─────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────╮\n", + "│ Name │ Attribute │ Type │ Field Option │ Option Value │ Field Option │ Option Value │ Field Option │ Option Value │ Field Option │ Option Value │\n", + "├───────────────┼───────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────────────────┼────────────────┤\n", + "│ prompt │ prompt │ TEXT │ WEIGHT │ 1 │ │ │ │ │ │ │\n", + "│ response │ response │ TEXT │ WEIGHT │ 1 │ │ │ │ │ │ │\n", + "│ inserted_at │ inserted_at │ NUMERIC │ │ │ │ │ │ │ │ │\n", + "│ updated_at │ updated_at │ NUMERIC │ │ │ │ │ │ │ │ │\n", + "│ prompt_vector │ prompt_vector │ VECTOR │ algorithm │ FLAT │ data_type │ FLOAT32 │ dim │ 1536 │ distance_metric │ COSINE │\n", + "╰───────────────┴───────────────┴─────────┴────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴─────────────────┴────────────────╯\n" + ] + } + ], + "source": [ + "!rvl index info -i llmcache" + ] + }, + { + "cell_type": "markdown", + "id": "00bd3fc6", + "metadata": { + "id": "00bd3fc6" + }, + "source": [ + "### Examining the Cached Keys in Redis\n", + "\n", + "Let's look at the keys created in Redis for the cache and understand how LiteLLM structures them:" + ] + }, + { + "cell_type": "code", + "execution_count": 251, + "id": "46eb6aa5", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "46eb6aa5", + "outputId": "bfae071a-b8c4-44bd-8672-0bbddc170027" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Found 1 cache keys in Redis\n", + "\n", + "Example cache key: llmcache:e4e4faaeea347b9876d03c4f68b7d981234a3a7a4281590ab4bc0e70dbdaef9e\n", + "TTL: 55 seconds remaining...\n", + "{'response': '{\\'timestamp\\': 1746640328.978919, \\'response\\': \\'{\"id\":\"chatcmpl-BUdE8A9yQyijCBN4Agg5QJxsrifUJ\",\"created\":1746640328,\"model\":\"gpt-4o-mini-2024-07-18\",\"object\":\"chat.completion\",\"system_fingerprint\":\"fp_dbaca60df0\",\"choices\":[{\"finish_reason\":\"stop\",\"index\":0,\"message\":{\"content\":\"The capital city of the United States is Washington, D.C.\",\"role\":\"assistant\",\"tool_calls\":null,\"function_call\":null,\"annotations\":[]}}],\"usage\":{\"completion_tokens\":14,\"prompt_tokens\":17,\"total_tokens\":31,\"completion_tokens_details\":{\"accepted_prediction_tokens\":0,\"audio_tokens\":0,\"reasoning_tokens\":0,\"rejected_prediction_tokens\":0},\"prompt_tokens_details\":{\"audio_tokens\":0,\"cached_tokens\":0}},\"service_tier\":\"default\"}\\'}', 'prompt_vector': b'\\xccY/=\\xbf0\\x00\\xbdd\\x0f\\xa2=X\\xa5\\xc8=\\x1f\\t-\\xbc\\\\\\x1d\\x1b\\xbc^\\xda\\xdb\\xbc\\x02\\xfc<@\\xbc\\xe8h\\xb4<\\xaf\\x8bn\\xbc\\x91Ad\\xbcP\\xf2\\xf0;}$\\xe6\\xbc\\xf2V\\x11\\xbdk\\x03>\\xbc\\xe6l\\x91\\xbd\\xaf\\xcc\\xe5\\xbc\\xaa\\x15\\x17<\\x90\\xc3\\x05\\xbc\\xb4\\x83\\xe7\\xb9\\t\\xaf\\x14=\\xe9\\'=\\xbc\\xc8\\xe1\\x0f<\\xf6P\\x1f\\xbb^\\xda\\x0e\\xbd\\x8c\\x8a\\xe2\\xb9\\xfb\\x07n;\\x7f\\xe1\\x8c\\xbcts\\x89=\\x95zT\\xbb&<\\xab\\xbb\\xe6l\\x11=h\\x89\\xd6\\xbc\\x9b\\xaf\\x9a\\xbb\\xfe\\x01/=\\xba\\xf9$\\xbdSn\\xa0\\xbb\\xad\\x8f\\xcb\\xb9\\xa7Z89\\xbds\\x0c<\\xa6\\xdcs<\\xf4\\x93+=v0\\xca\\xbb[\\xe0\\x00<\\xbf\\xb0s\\xbc1\\xa8\\xe6;\\xda\\x80\\xc9\\xbd(\\xf9\\x1e<\\xb6\\xc04\\xbdSn ;\\x91A\\x97\\xbd\\xc1m\\x9a;\\xd2O`<\\xd8\\x84\\xa6:xmd=c\\x91\\x10\\xbc\\xe3\\xb1\\xff\\xbc\\xc9\\x9e\\x03=\\xdfx\\xc2\\xbc\\x1d\\xcc\\x92\\xbaQ1\\x86<\\x88Q%\\xbc\\xaf\\xcc\\xe5:ts\\x89\\xbc\\xc9_!\\xbd\\x8c\\x8a\\xe2\\xbc\\x82\\xdb\\xe7\\xbc\\xa6\\x9b/=\\xe3p;\\xba\\xdf\\xf8\\x1b\\xbc\\xef\\x1bY\\xbb%\\xbe\\x99\\xbc\\x9f\\xa7`\\xbd\\xbd\\xb4\\x03<\\xb2\\xc6&\\xbdc\\xd2\\x87\\xbc\\xc2*[<\\x85UO<\\x18\\x15\\x91\\xbbL9\\x8d<\\xe9\\'\\xbd;aTC\\xbbN\\xf6M={\\xe7\\xcb\\xbc\\xf2\\x17\\xaf\\xbb\\x055z\\xbc@\\x0e\\x16<\\xb5B\\xf0<=\\x14\\x08\\xbcc\\x91\\x90\\xbcR\\xaf\\x97<\\x1a\\x114=\\x13^\\x0f=\\xdd|\\x1f\\xbd|\\xa6\\xd4\\xbc\\xfd\\xc4\\x14\\xbd\\xb4\\x83\\x9a\\xbcO\\xb5\\x89\\xba..2=\\':c\\xbc\\x96\\xf8\\xe5<\\xdc\\xfe\\x8d<\\xb9:i\\xbd\\x1b\\xd0<\\xbd`\\x97\\x82;\\xd0\\x92\\x1f;\\x03zN\\xbc+\\xf3\\xac\\xbb\\xe4\\xaf\\x9d;\\xeb#\\x93\\xbd\\x9f\\xa7`:\\xb1\\x89\\x0c\\xbd\\xa5^\\x15<=\\x94\\xae\\xbc\\xb3\\xc4\\xde<\\x1c\\rW\\xc0<\\xb0\\xca\\x03<\\x9c-,=\\xc6\\xa4B\\xbc3e\\x8dS\\xb7<\\xba\\xf9\\xf1\\xbb\\xe7\\xa9\\xf8\\x12@\\xc0;\\xb3F\\x00\\xbd-\\xb0\\xed\\xbbJ\\xbd\\xdd<0k\\xcc<\\x7f\\xe1\\x0c=\\xc2\\xeb+;_\\x99\\x97<\\x16X\\x9d<\\x83\\xd9\\x05\\xbd5\"\\xce\\xbb\\x87\\x92\\xe9\\xbc\\xd2\\x0e\\xe9S7=\\x8a\\xcd\\xa1<\\xf2\\x17/\\xbc\\x98\\xb5\\x0c=9\\x1a\\xc7;\\xacR1S\\xb7<\\xead\\x8a 0 and lat2 > 0:\n", - " print(f\" Speed improvement: {lat1/lat2:.1f}x faster\")\n", - " print(f\" Time saved: {lat1 - lat2:.3f}s\")" - ] - }, - { - "cell_type": "markdown", - "id": "00bd3fc6", - "metadata": {}, - "source": [ - "### Examining the Cached Keys in Redis\n", - "\n", - "Let's look at the keys created in Redis for the exact cache and understand how LiteLLM structures them:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "46eb6aa5", - "metadata": {}, - "outputs": [], - "source": [ - "# Get all keys related to LiteLLM cache\n", - "cache_keys = list(r.scan_iter(match=\"litellm:cache:*\"))\n", - "print(f\"Found {len(cache_keys)} cache keys in Redis\")\n", - "\n", - "if cache_keys:\n", - " # Look at the first key\n", - " first_key = cache_keys[0]\n", - " print(f\"\\nExample cache key: {first_key}\")\n", - " \n", - " # Get TTL for the key\n", - " ttl = r.ttl(first_key)\n", - " print(f\"TTL: {ttl} seconds\")\n", - " \n", - " # Get the value (may be large, so limiting output)\n", - " value = r.get(first_key)\n", - " if value:\n", - " print(f\"Value type: {type(value).__name__}\")\n", - " print(f\"Value size: {len(value)} characters\")\n", - " try:\n", - " # Try to parse as JSON for better display\n", - " parsed = json.loads(value[:1000] + '...' if len(value) > 1000 else value)\n", - " print(f\"Content preview (JSON): {json.dumps(parsed, indent=2)[:300]}...\")\n", - " except:\n", - " print(f\"Content preview (raw): {value[:100]}...\")" - ] - }, - { - "cell_type": "markdown", - "id": "8959ff3d", - "metadata": {}, - "source": [ - "### Benchmarking Cached Response Times\n", - "\n", - "Now, let's precisely measure the cached response time using multiple repeated requests:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "007710d7", - "metadata": {}, - "outputs": [], - "source": [ - "# Benchmark cached response time with more samples\n", - "def benchmark_cached_query(query, runs=5):\n", - " times = []\n", - " print(f\"Benchmarking cached query: '{query}'\")\n", - " print(f\"Running {runs} iterations...\")\n", - " \n", - " for i in range(runs):\n", - " start = time.time()\n", - " elapsed, resp = chat(query, verbose=False)\n", - " times.append(elapsed)\n", - " cache_status = resp.headers.get(\"X-Cache\", \"MISS\")\n", - " print(f\" Run {i+1}: {elapsed:.4f}s | Cache: {cache_status}\")\n", - " \n", - " avg_time = sum(times) / len(times)\n", - " print(f\"\\nAverage response time: {avg_time:.4f}s\")\n", - " print(f\"Min: {min(times):.4f}s | Max: {max(times):.4f}s\")\n", - " return avg_time\n", - "\n", - "# Run the benchmark\n", - "benchmark_cached_query(\"What are three benefits of Redis for LLM applications?\")" - ] - }, - { - "cell_type": "markdown", - "id": "67888d4e", - "metadata": {}, - "source": [ - "## 5 · Semantic Cache Demonstration\n", - "\n", - "Semantic caching is more powerful than exact caching because it can identify semantically similar prompts, not just identical ones. This is implemented using vector embeddings and similarity search in Redis.\n", - "\n", - "Let's test it by sending a prompt that is semantically similar (but not identical) to our previous query:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5c5ca8ac", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"🧪 Semantic Cache Experiment\")\n", - "\n", - "# First, let's send a new query that will be stored in the semantic cache\n", - "print(\"\\n1️⃣ Establishing a baseline query for semantic cache:\")\n", - "lat1, res1 = chat(\"Tell me a useful application of Redis for AI systems\")\n", - "\n", - "# Now send a semantically similar query\n", - "print(\"\\n2️⃣ Testing a semantically similar query:\")\n", - "lat2, res2 = chat(\"What's a good use case for Redis in artificial intelligence?\")\n", - "\n", - "# Try a completely different query\n", - "print(\"\\n3️⃣ Testing an unrelated query (should not hit semantic cache):\")\n", - "lat3, res3 = chat(\"How to make chocolate chip cookies?\")\n", - "\n", - "print(f\"\\n🔍 Performance Analysis:\")\n", - "print(f\" Original query: {lat1:.3f}s\")\n", - "print(f\" Similar query: {lat2:.3f}s\")\n", - "print(f\" Unrelated query: {lat3:.3f}s\")\n", - "\n", - "sim_cache_hit = \"HIT\" in res2.headers.get(\"X-Semantic-Cache\", \"MISS\")\n", - "if sim_cache_hit and lat1 > 0 and lat2 > 0:\n", - " print(f\" Speed improvement: {lat1/lat2:.1f}x faster for semantically similar query\")" - ] - }, - { - "cell_type": "markdown", - "id": "2566c681", - "metadata": {}, - "source": [ - "### Examining Semantic Cache Keys\n", - "\n", - "Let's look at the keys and indices created in Redis for the semantic cache:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a6d5be0e", - "metadata": {}, - "outputs": [], - "source": [ - "# Check semantic cache keys\n", - "semantic_keys = list(r.scan_iter(match=\"litellm:semantic*\"))\n", - "print(f\"Found {len(semantic_keys)} semantic cache keys in Redis\")\n", - "\n", - "if semantic_keys:\n", - " # Display the first few keys\n", - " for key in semantic_keys[:5]:\n", - " print(f\" - {key}\")\n", - " \n", - " # Check for Redis Search indices\n", - " try:\n", - " indices = r.execute_command(\"FT._LIST\")\n", - " print(f\"\\nRedis Search indices: {indices}\")\n", - " \n", - " # Get info about the semantic cache index if it exists\n", - " semantic_index = [idx for idx in indices if \"semantic\" in idx.lower()]\n", - " if semantic_index:\n", - " index_info = r.execute_command(f\"FT.INFO {semantic_index[0]}\")\n", - " print(f\"\\nSemantic Index Info:\")\n", - " # Format and display selected info\n", - " info_dict = {index_info[i]: index_info[i+1] for i in range(0, len(index_info), 2) if i+1 < len(index_info)}\n", - " for k in ['num_docs', 'num_terms', 'index_name', 'index_definition']:\n", - " if k in info_dict:\n", - " print(f\" {k}: {info_dict[k]}\")\n", - " except Exception as e:\n", - " print(f\"Error accessing Redis Search indices: {e}\")" - ] - }, - { - "cell_type": "markdown", - "id": "semantic-cache-explain", - "metadata": {}, - "source": [ - "### How Semantic Caching Works\n", - "\n", - "LiteLLM's semantic caching works through these steps:\n", - "1. When a query arrives, LiteLLM generates an embedding vector for the query using the configured model\n", - "2. This vector is searched against previously stored vectors in Redis using cosine similarity\n", - "3. If a match is found with similarity above the threshold (we set 0.9), the cached response is returned\n", - "4. If not, the query is sent to the LLM API and the result is cached with its vector\n", - "\n", - "This approach is especially valuable for:\n", - "- Applications with many similar but not identical queries\n", - "- Customer support systems where questions vary in phrasing but seek the same information\n", - "- Educational applications where different students may ask similar questions" - ] - }, - { - "cell_type": "markdown", - "id": "4d4cc017", - "metadata": {}, - "source": [ - "## 6 · Multi-Model Routing with LiteLLM Proxy\n", - "\n", - "Our configuration enables access to multiple models through a single endpoint. Let's test both the configured models to verify they work:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c21192be", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"🧪 Multi-Model Routing Demonstration\")\n", - "\n", - "models = [\"gpt-3.5-turbo\", \"gpt-4o-mini\"]\n", - "results = {}\n", - "\n", - "for model in models:\n", - " print(f\"\\nTesting model: {model}\")\n", - " lat, res = chat(\"Say hi in two words\", model=model)\n", - " \n", - " if res.status_code == 200:\n", - " response_content = res.json()[\"choices\"][0][\"message\"][\"content\"]\n", - " results[model] = {\n", - " \"latency\": lat,\n", - " \"response\": response_content,\n", - " \"model\": res.json().get(\"model\", model)\n", - " }\n", - " print(f\"✅ Success | Response: '{response_content}'\")\n", - " else:\n", - " print(f\"❌ Error | Status code: {res.status_code}\")\n", - " print(f\"Error message: {res.text}\")\n", - "\n", - "# Compare the models\n", - "if len(results) > 1:\n", - " print(\"\\n📊 Model Comparison:\")\n", - " for model, data in results.items():\n", - " print(f\" {model}: {data['latency']:.2f}s - '{data['response']}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "0b0e740a", - "metadata": {}, - "source": [ - "## 7 · Testing Failure Modes\n", - "\n", - "Let's examine how the proxy handles error conditions, which is important for building robust applications." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3effa8a0", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"🧪 Testing Error Handling\")\n", - "\n", - "# Test with an unsupported model\n", - "print(\"\\n1️⃣ Testing with non-existent model:\")\n", - "_, bad_model_resp = chat(\"test\", model=\"gpt-nonexistent-001\")\n", - "print(f\"Status: {bad_model_resp.status_code}\")\n", - "print(f\"Error message: {json.dumps(bad_model_resp.json(), indent=2)}\")" - ] - }, - { - "cell_type": "markdown", - "id": "cbed4fa7", - "metadata": {}, - "source": [ - "### Testing Rate Limiting\n", - "\n", - "The LiteLLM proxy includes rate limiting functionality, which helps protect your API keys from overuse. Let's test this by sending requests rapidly until we hit the rate limit:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "db464c0f", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"🧪 Testing Rate Limiting\")\n", - "print(\"Sending multiple requests with the same user ID to trigger rate limiting...\")\n", - "\n", - "for i in range(5):\n", - " _, r2 = chat(f\"Request {i+1}\", user=\"test-rate-limit\")\n", - " remaining = r2.headers.get(\"X-Rate-Limit-Remaining\", \"unknown\")\n", - " limit_reset = r2.headers.get(\"X-Rate-Limit-Reset\", \"unknown\")\n", - " \n", - " print(f\"Request {i+1}: Status {r2.status_code} | Remaining: {remaining} | Reset: {limit_reset}\")\n", - " \n", - " if r2.status_code == 429:\n", - " print(f\"Rate limit reached after {i+1} requests!\")\n", - " print(f\"Error response: {json.dumps(r2.json(), indent=2)}\")\n", - " break\n", - " \n", - " time.sleep(0.5) # Small delay to see rate limiting in action" - ] - }, - { - "cell_type": "markdown", - "id": "implementation-alternatives", - "metadata": {}, - "source": [ - "## 8 · Implementation Options\n", - "\n", - "LiteLLM provides multiple ways to implement caching in your application:\n", - "\n", - "### Using LiteLLM Proxy (as shown)\n", - "\n", - "The proxy approach (demonstrated in this notebook) is recommended for production deployments because it:\n", - "- Provides a unified API endpoint for all your models\n", - "- Centralizes caching, rate-limiting, and fallback logic\n", - "- Works with any client that uses the OpenAI API format\n", - "- Supports multiple languages and frameworks\n", - "\n", - "### Direct Integration with LiteLLM Python SDK\n", - "\n", - "For Python applications, you can also integrate caching directly using the SDK. See the [LiteLLM Caching documentation](https://docs.litellm.ai/docs/caching/all_caches) for details." - ] - }, - { - "cell_type": "markdown", - "id": "117e0229", - "metadata": {}, - "source": [ - "## 9 · Cleanup\n", - "\n", - "Let's stop the LiteLLM proxy server and clean up our environment:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9b7ce8f7", - "metadata": {}, - "outputs": [], - "source": [ - "%%bash\n", - "# Find and stop the LiteLLM process\n", - "echo \"Stopping LiteLLM Proxy...\"\n", - "litellm_pid=$(ps aux | grep \"litellm --config\" | grep -v grep | awk '{print $2}')\n", - "if [ -n \"$litellm_pid\" ]; then\n", - " kill $litellm_pid\n", - " echo \"Stopped LiteLLM Proxy (PID: $litellm_pid)\"\n", - "else\n", - " echo \"LiteLLM Proxy not found running\"\n", - "fi\n", - "\n", - "# Optionally stop Redis if you started it just for this notebook\n", - "# Note: Comment this out if you want to keep Redis running\n", - "# redis-cli shutdown" - ] - }, - { - "cell_type": "markdown", - "id": "conclusion", - "metadata": {}, - "source": [ - "## Summary\n", - "\n", - "In this notebook, we've demonstrated how to:\n", - "\n", - "1. **Set up LiteLLM Proxy** with Redis for caching and rate limiting\n", - "2. **Configure exact and semantic caching** to improve performance\n", - "3. **Measure the performance benefits** of caching LLM responses\n", - "4. **Route requests to multiple models** through a single endpoint\n", - "5. **Test error handling and rate limiting** behavior\n", - "\n", - "The benchmarks clearly show that implementing caching with Redis can significantly reduce response times and API costs, making it an essential component of production LLM applications.\n", - "\n", - "For more information, see the [LiteLLM documentation](https://docs.litellm.ai/docs/proxy/caching) and [Redis documentation](https://redis.io/docs/)." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.2" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/python-recipes/gateway/litellm_redis.yml b/python-recipes/gateway/litellm_redis.yml deleted file mode 100644 index 412f9d0..0000000 --- a/python-recipes/gateway/litellm_redis.yml +++ /dev/null @@ -1,15 +0,0 @@ -litellm_settings: - cache: true - cache_params: - host: localhost - password: '' - port: '6379' - type: redis - set_verbose: true -model_list: -- litellm_params: - model: gpt-3.5-turbo - model_name: openai-old -- litellm_params: - model: gpt-4o - model_name: openai-new