# Using RouteLLM to Optimize LLM Usage

RouteLLM is a flexible framework for serving and evaluating LLM routers, designed to maximize performance while minimizing cost.

Key features:

* Seamless integration — Acts as a drop-in replacement for the OpenAI client or runs as an OpenAI-compatible server, intelligently routing simpler queries to cheaper models.

* Pre-trained routers out of the box — Proven to cut costs by up to 85% while preserving 95% of GPT-4 performance on widely used benchmarks like MT-Bench.

* Cost-effective excellence — Matches the performance of leading commercial offerings while being over 40% cheaper.

* Extensible and customizable — Easily add new routers, fine-tune thresholds, and compare performance across multiple benchmarks.

In this tutorial, we’ll walk through how to:

* Load and use a pre-trained router.

* Calibrate it for your own use case.

* Test routing behavior on different types of prompts.

## Installing the dependencies

In [6]:
!pip install "routellm[serve,eval]"

Collecting routellm[eval,serve]
  Downloading routellm-0.2.0-py3-none-any.whl.metadata (14 kB)
Collecting litellm (from routellm[eval,serve])
  Downloading litellm-1.75.4-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting pandarallel (from routellm[eval,serve])
  Downloading pandarallel-1.6.5.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sglang (from routellm[eval,serve])
  Downloading sglang-0.4.10.post2-py3-none-any.whl.metadata (27 kB)
Collecting shortuuid (from routellm[eval,serve])
  Downloading shortuuid-1.0.13-py3-none-any.whl.metadata (5.8 kB)
Collecting openai (from routellm[eval,serve])
  Downloading openai-1.99.6-py3-none-any.whl.metadata (29 kB)
Collecting python-dotenv>=0.2.0 (from litellm->routellm[eval,serve])
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Collecting setproctitle (from sglang->routellm[eva

## Loading OpenAI API Key
To get an OpenAI API key, visit https://platform.openai.com/settings/organization/api-keys and generate a new key. If you’re a new user, you may need to add billing details and make a minimum payment of $5 to activate API access.

RouteLLM leverages LiteLLM to support chat completions from a wide range of both open-source and closed-source models. You can check out the list of providers at https://litellm.vercel.app/docs/providers if you want to use some other model.

In [2]:
import os
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')

Enter OpenAI API Key: ··········


## Downloading Config File
RouteLLM uses a configuration file to locate pretrained router checkpoints and the datasets they were trained on.
This file tells the system where to find the models that decide whether to send a query to the strong or weak model.

### Do I need to edit it?
For most users — no. The default config already points to well-trained routers (mf, bert, causal_llm) that work out of the box.
You only need to change it if you plan to:

* Train your own router on a custom dataset.

* Replace the routing algorithm entirely with a new one.

For this tutorial, we’ll keep the config as is and simply:

* Set our strong and weak model names in code.

* Add our API keys for the chosen providers.

* Use a calibrated threshold to balance cost and quality.

In [50]:
!wget https://raw.githubusercontent.com/lm-sys/RouteLLM/main/config.example.yaml

--2025-08-10 14:22:39--  https://raw.githubusercontent.com/lm-sys/RouteLLM/main/config.example.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 417 [text/plain]
Saving to: ‘config.example.yaml’


2025-08-10 14:22:39 (14.1 MB/s) - ‘config.example.yaml’ saved [417/417]



## Initializing the RouteLLM Controller
In this code block, we import the necessary libraries and initialize the RouteLLM Controller, which will manage how prompts are routed between models. We specify routers=["mf"] to use the Matrix Factorization router, a pretrained decision model that predicts whether a query should be sent to the strong or weak model.

The strong_model parameter is set to **"gpt-5**", a high-quality but more expensive model, while the weak_model parameter is set to **"o4-mini"**, a faster and cheaper alternative. For each incoming prompt, the router evaluates its complexity against a threshold and automatically chooses the most cost-effective option—ensuring that simple tasks are handled by the cheaper model while more challenging ones get the stronger model’s capabilities.

This configuration allows you to balance cost efficiency and response quality without manual intervention.

In [25]:
import os
import pandas as pd
from routellm.controller import Controller

client = Controller(
    routers=["mf"],  # Model Fusion router
    strong_model="gpt-5",
    weak_model="o4-mini"
)


In [37]:
!python -m routellm.calibrate_threshold --routers mf --strong-model-pct 0.1 --config config.example.yaml

For 10.0% strong model calls for mf, threshold = 0.24034


This command runs RouteLLM’s threshold calibration process for the Matrix Factorization (mf) router. The --strong-model-pct 0.1 argument tells the system to find the threshold value that routes roughly 10% of queries to the strong model (and the rest to the weak model).

Using the --config config.example.yaml file for model and router settings, the calibration determined:

**For 10% strong model calls with mf, the optimal threshold is 0.24034.**

This means that any query with a router-assigned complexity score above 0.24034 will be sent to the strong model, while those below it will go to the weak model, aligning with your desired cost–quality trade-off.

## Defining the threshold & prompts variables
Here, we define a diverse set of test prompts designed to cover a range of complexity levels.
They include simple factual questions (likely to be routed to the weak model), medium reasoning tasks (borderline threshold cases), and high-complexity or creative requests (more suited for the strong model), along with code generation tasks to test technical capabilities.

In [45]:
threshold = 0.24034

prompts = [
    # Easy factual (likely weak model)
    "Who wrote the novel 'Pride and Prejudice'?",
    "What is the largest planet in our solar system?",

    # Medium reasoning (borderline cases)
    "If a train leaves at 3 PM and travels 60 km/h, how far will it travel by 6:30 PM?",
    "Explain why the sky appears blue during the day and red/orange during sunset.",

    # High complexity / creative (likely strong model)
    "Write a 6-line rap verse about climate change using internal rhyme.",
    "Summarize the differences between supervised, unsupervised, and reinforcement learning with examples.",

    # Code generation
    "Write a Python function to check if a given string is a palindrome, ignoring punctuation and spaces.",
    "Generate SQL to find the top 3 highest-paying customers from a 'sales' table."
]


## Evaluating Win Rate
The following code calculates the win rate for each test prompt using the mf router, showing the likelihood that the strong model will outperform the weak model.
Based on the calibrated threshold of 0.24034, two prompts —

**"If a train leaves at 3 PM and travels 60 km/h, how far will it travel by 6:30 PM?"** (0.303087)

**"Write a Python function to check if a given string is a palindrome, ignoring punctuation and spaces."** (0.272534)

— exceed the threshold and would be routed to the strong model.
All other prompts remain below the threshold, meaning they would be served by the weaker, cheaper model.

In [47]:
win_rates = client.batch_calculate_win_rate(prompts=pd.Series(prompts), router="mf")

# Store results in DataFrame
_df = pd.DataFrame({
    "Prompt": prompts,
    "Win_Rate": win_rates
})

# Show full text without truncation
pd.set_option('display.max_colwidth', None)

In [48]:
_df

Unnamed: 0,Prompt,Win_Rate
0,Who wrote the novel 'Pride and Prejudice'?,0.175543
1,What is the largest planet in our solar system?,0.129442
2,"If a train leaves at 3 PM and travels 60 km/h, how far will it travel by 6:30 PM?",0.303087
3,Explain why the sky appears blue during the day and red/orange during sunset.,0.08488
4,Write a 6-line rap verse about climate change using internal rhyme.,0.135652
5,"Summarize the differences between supervised, unsupervised, and reinforcement learning with examples.",0.109009
6,"Write a Python function to check if a given string is a palindrome, ignoring punctuation and spaces.",0.272534
7,Generate SQL to find the top 3 highest-paying customers from a 'sales' table.,0.133232


These results also help in fine-tuning the routing strategy — by analyzing the win rate distribution, we can adjust the threshold to better balance cost savings and performance.

## Routing Prompts Through Calibrated Model Fusion (MF) Router
This code iterates over the list of test prompts and sends each one to the RouteLLM controller using the calibrated mf router with the specified threshold (router-mf-{threshold}).

For each prompt, the router decides whether to use the strong or weak model based on the calculated win rate.

The response includes both the generated output and the actual model that was selected by the router.

These details — the prompt, model used, and generated output — are stored in the results list for later analysis.

In [39]:
results = []
for prompt in prompts:
    response = client.chat.completions.create(
        model=f"router-mf-{threshold}",
        messages=[{"role": "user", "content": prompt}]
    )
    message = response.choices[0].message["content"]
    model_used = response.model  # RouteLLM returns the model actually used

    results.append({
        "Prompt": prompt,
        "Model Used": model_used,
        "Output": message
    })


## Displaying the Results

In [41]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)
df = pd.DataFrame(results)
df

Unnamed: 0,Prompt,Model Used,Output
0,Who wrote the novel 'Pride and Prejudice'?,o4-mini-2025-04-16,The novel “Pride and Prejudice” was written by the English author Jane Austen. It was first published in 1813 under the original title “First Impressions.”
1,What is the largest planet in our solar system?,o4-mini-2025-04-16,"The largest planet in our solar system is Jupiter. \n\nKey facts about Jupiter: \n• Diameter: about 142,984 km (≈11 times that of Earth) \n• Mass: roughly 1.90 × 10^27 kg (over 300 times Earth’s mass) \n• Composition: primarily hydrogen and helium (a gas giant) \n• Notable features: the Great Red Spot (a giant storm), faint ring system, and at least 79 known moons (including Ganymede, the largest moon in the solar system)."
2,"If a train leaves at 3 PM and travels 60 km/h, how far will it travel by 6:30 PM?",gpt-5-2025-08-07,210 km\n\nCalculation:\n- Time from 3:00 PM to 6:30 PM = 3.5 hours\n- Distance = 60 km/h × 3.5 h = 210 km
3,Explain why the sky appears blue during the day and red/orange during sunset.,o4-mini-2025-04-16,"Sunlight is made of all visible colors, but as it passes through Earth’s atmosphere it doesn’t all reach your eyes the same way. Two key effects explain why the sky looks blue by day and red/orange at sunrise or sunset:\n\n1. Rayleigh scattering and the blue sky \n • Air molecules (mostly nitrogen and oxygen) are much smaller than the wavelength of visible light. \n • Rayleigh scattering says that the amount of scattering varies inversely with the fourth power of wavelength (∝1/λ⁴). \n • Short (blue, ~450 nm) wavelengths scatter about ten times more than long (red, ~700 nm) wavelengths. \n • When the Sun is high, sunlight travels a relatively short path through air, so blue light is scattered in all directions, filling the sky with that color. \n\n2. Long path length at sunrise/sunset and the reds and oranges \n • Near the horizon, sunlight must traverse a much thicker “slice” of atmosphere. \n • Blue and green light suffer so much scattering out of the direct beam that very little remains on the path to your eye. \n • Red and orange (longer wavelengths) are scattered much less, so more of that light reaches you directly, giving the Sun—and the surrounding sky—a warm reddish hue. \n\n3. Role of aerosols and dust \n • Particles larger than molecules (dust, water droplets, pollution) scatter all wavelengths more equally (Mie scattering), often intensifying reds and pinks at dawn and dusk. \n\nIn summary, daytime blue comes from efficient scattering of short‐wavelength light; sunrise/sunset reds and oranges occur because the longer atmospheric path removes most of the blues and greens, leaving the reds to dominate."
4,Write a 6-line rap verse about climate change using internal rhyme.,o4-mini-2025-04-16,"Heat beats streets, sweat sweeps through concrete, retreat’s defeat \nOceans rising, surprising tides decide how we survive \nStorms swarm the norm, warn of harm born to alarm \nGreen dreams gleam in clean streams, redeeming seams of schemes \nSolar scholars holler, collaring rays, powering brighter days \nHands band for the land, stand grand, demand we expand our plan"
5,"Summarize the differences between supervised, unsupervised, and reinforcement learning with examples.",o4-mini-2025-04-16,"Here’s a concise comparison of the three major paradigms in machine learning:\n\n1. Supervised Learning \n • Goal: Learn a mapping from inputs X to outputs Y using labeled examples. \n • Feedback: Direct, per‐example “correct” answers (labels). \n • Common tasks: \n – Classification (e.g. spam vs. non‐spam email) \n – Regression (e.g. predicting house prices) \n • Examples: \n – Image-classification networks trained on photos labeled “cat” or “dog.” \n – A model that predicts tomorrow’s temperature from historical weather data. \n\n2. Unsupervised Learning \n • Goal: Discover hidden structure or patterns in unlabeled data. \n • Feedback: None (no explicit labels). \n • Common tasks: \n – Clustering (e.g. segmenting customers into market‐segments via k-means) \n – Dimensionality reduction (e.g. PCA for feature compression) \n – Anomaly detection (e.g. flagging credit-card fraud) \n • Examples: \n – Grouping similar news articles by topic when you have no topic labels. \n – Reducing the number of features to visualize high-dimensional data. \n\n3. Reinforcement Learning (RL) \n • Goal: Learn a policy that maximizes cumulative reward in an environment. \n • Feedback: Scalar reward signal (often delayed), no direct “correct” action. \n • Common tasks: \n – Control (e.g. robotic arm manipulation) \n – Game playing (e.g. AlphaZero in chess and Go) \n – Resource management (e.g. dynamic pricing, traffic‐signal control) \n • Examples: \n – An agent learning to play Pong by trial and error, receiving +1 for winning a point and –1 for losing one. \n – A self-driving car learning to navigate safely through rewards for staying on the road and penalties for collisions. \n\nKey distinctions at a glance: \n• Data: supervised uses labeled data; unsupervised uses unlabeled; RL interacts with an environment. \n• Feedback: supervised gets exact labels; unsupervised gets none; RL gets only reward signals. \n• Objective: supervised predicts outputs; unsupervised finds structure; RL discovers optimal actions over time."
6,"Write a Python function to check if a given string is a palindrome, ignoring punctuation and spaces.",gpt-5-2025-08-07,"def is_palindrome(text: str) -> bool:\n """"""\n Return True if text is a palindrome, ignoring punctuation and spaces (case-insensitive).\n """"""\n filtered = [ch.lower() for ch in text if ch.isalnum()]\n return filtered == filtered[::-1]"
7,Generate SQL to find the top 3 highest-paying customers from a 'sales' table.,o4-mini-2025-04-16,"Here’s the simplest version in MySQL (and Postgres) – sum each customer’s spend, sort descending, then LIMIT 3:\n\n```sql\nSELECT\n customer_id,\n SUM(amount) AS total_spent\nFROM sales\nGROUP BY customer_id\nORDER BY total_spent DESC\nLIMIT 3;\n```\n\n————————————————————————————\n\nIf you’re on SQL Server:\n\n```sql\nSELECT TOP 3\n customer_id,\n SUM(amount) AS total_spent\nFROM sales\nGROUP BY customer_id\nORDER BY total_spent DESC;\n```\n\n————————————————————————————\n\nIn Oracle 12c+ (or any DB that supports ANSI FETCH):\n\n```sql\nSELECT\n customer_id,\n SUM(amount) AS total_spent\nFROM sales\nGROUP BY customer_id\nORDER BY total_spent DESC\nFETCH FIRST 3 ROWS ONLY;\n```\n\nIn pre-12c Oracle you can wrap in a subquery:\n\n```sql\nSELECT customer_id, total_spent\nFROM (\n SELECT\n customer_id,\n SUM(amount) AS total_spent\n FROM sales\n GROUP BY customer_id\n ORDER BY total_spent DESC\n)\nWHERE ROWNUM <= 3;\n```\n\n————————————————————————————\n\nIf you need to handle ties or want an analytic-function approach (works in most engines):\n\n```sql\nSELECT customer_id, total_spent\nFROM (\n SELECT\n customer_id,\n SUM(amount) AS total_spent,\n ROW_NUMBER() OVER (ORDER BY SUM(amount) DESC) AS rn\n FROM sales\n GROUP BY customer_id\n) t\nWHERE rn <= 3;\n```"


In the results, prompts 2 and 6 exceeded the threshold win rate and were therefore routed to the gpt-5 strong model, while the rest were handled by the weaker model.