# Router Agent Benchmark
This notebook implements a benchmark using groundtruthing technique to evaluate AI models that function as **Router Agents**

## Setup LLM/SLM Provider
In this example I'll use LangChain and OpenAI as LLM provider, switch to your provider (AWS Bedrock with boto3, Groq directly calls, etc.)

To guarantee only the three words as result, I'll define an enum and use it as output parser (If the model answers something different from it, it will retry automatically.)

In [30]:
from enum import Enum
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser

class CategoryEnum(str, Enum):
    MATH = "math"
    HISTORY = "history"
    JOKE = "joke"

class CategoryModel(BaseModel):
    category_result: CategoryEnum = Field(description="The agent's category result")

#TODO: should bea function and receive the model as parameter
parser = PydanticOutputParser(pydantic_object=CategoryModel)

prompt = ChatPromptTemplate.from_messages([
    ("system", """
        You are a Router Agent specialized in classifying user questions into specific categories.
        Your function is to analyze the user's question and determine which category best applies:
            - **math**: Questions related to mathematical calculations, arithmetic operations, algebra, geometry, statistics, or any problem that requires mathematical reasoning or numerical computation.
            - **history**: Questions about historical events, historical figures, important dates, historical periods, wars, revolutions, or any topic related to the past and history.
            - **joke**: Requests to tell jokes, make humor, or any request related to humorous entertainment, regardless of language.
        
        **Important instructions:**
            1. Carefully analyze the intention and content of the question
            2. Identify the most appropriate category based on the main purpose of the question
            3. Return EXCLUSIVELY the category in the specified format, without additional text or explanations
            4. Be accurate and consistent in classification

        Expected output format:
        {format_instructions}
    """),
    ("human", """
        Classify the following question into one of the available categories (math, history, or joke):
        Question: {input}""")
    ])

def call_prompt(model_name: str, input: str):
    model = ChatOpenAI(
        model=model_name,
        temperature=0,
    )
    
    chain = prompt | model | parser
    
    return chain.invoke({
        "input": input,
        "format_instructions": parser.get_format_instructions()
    })

## Prepare Groundtruth Data
Loads groundtruth data from JSON file to memory

In [31]:
import json 

groundtruth_file_path = "groundtruth.json"

with open(groundtruth_file_path, "r", encoding="utf-8") as file:
    groundtruth_data = json.load(file)

## Define Models to Test

Define a list of models to test (should follow the provider names definition, in my case, OpenAI.)

In [32]:
models_to_test = [
    "gpt-4o",
    "gpt-4o-mini",
    "gpt-5.2",
    "gpt-5-mini",
    "gpt-5-nano"
]

## Call Models and Collect Metrics

In [33]:
from tqdm import tqdm
import pandas as pd
import time
import numpy as np

class ModelResult(BaseModel):
    model_name: str
    success_rate: float
    latency_p50: float
    latency_p90: float
    latency_p95: float
    latency_average: float

model_results = []
all_dataframes = []

for model in models_to_test:
    print(f"Testing {model} model")
    inputs = []
    expected_outputs = []
    model_outputs = []
    is_matches = []
    latencies = []

    
    for case in tqdm(groundtruth_data):
        inputs.append(case["input"])
        expected_outputs.append(case["expected_output"])
        
        start = time.time()
        result = call_prompt(model, case["input"])
        end = time.time()

        latency = end - start
        latencies.append(latency)

        model_output_str = str(result.category_result.value)
        model_outputs.append(model_output_str)

        if model_output_str == case["expected_output"]:
            is_matches.append(True)
        else:
            is_matches.append(False)

    min_latency = min(latencies)
    max_latency = max(latencies)
    avg_latency = sum(latencies) / len(latencies)

    df = pd.DataFrame({
        "Input": inputs,
        "Expected output": expected_outputs,
        "Model output": model_outputs,
        "Is correct": is_matches
    })

    model_results.append(
        ModelResult(
            model_name=model,
            success_rate=df['Is correct'].mean() * 100,
            latency_p50=np.percentile(latencies, 50),
            latency_p90=np.percentile(latencies, 90),
            latency_p95=np.percentile(latencies, 95),
            latency_average=sum(latencies) / len(latencies)
        )
    )

print("\n" + "="*70)
print(" " * 20 + "BENCHMARK RESULTS")
print("="*70 + "\n")

for model in model_results:
    print(f"┌{'─'*68}┐")
    print(f"│ Model: {model.model_name:<59} │")
    print(f"├{'─'*68}┤")
    
    success_str = f"Success Rate:    {model.success_rate:>6.2f}%"
    avg_str = f"Latency Average: {model.latency_average * 1000:>8.2f} ms"
    p50_str = f"Latency P50:     {model.latency_p50 * 1000:>8.2f} ms"
    p90_str = f"Latency P90:     {model.latency_p90 * 1000:>8.2f} ms"
    p95_str = f"Latency P95:     {model.latency_p95 * 1000:>8.2f} ms"
    
    print(f"│ {success_str:<66} │")
    print(f"│ {avg_str:<66} │")
    print(f"│ {p50_str:<66} │")
    print(f"│ {p90_str:<66} │")
    print(f"│ {p95_str:<66} │")
    print(f"└{'─'*68}┘\n")

Testing gpt-4o model


100%|██████████| 69/69 [00:49<00:00,  1.40it/s]


Testing gpt-4o-mini model


100%|██████████| 69/69 [00:42<00:00,  1.61it/s]


Testing gpt-5.2 model


100%|██████████| 69/69 [01:07<00:00,  1.02it/s]


Testing gpt-5-mini model


100%|██████████| 69/69 [01:59<00:00,  1.73s/it]


Testing gpt-5-nano model


100%|██████████| 69/69 [03:31<00:00,  3.07s/it]


                    BENCHMARK RESULTS

┌────────────────────────────────────────────────────────────────────┐
│ Model: gpt-4o                                                      │
├────────────────────────────────────────────────────────────────────┤
│ Success Rate:    100.00%                                           │
│ Latency Average:   710.78 ms                                       │
│ Latency P50:       632.00 ms                                       │
│ Latency P90:       999.47 ms                                       │
│ Latency P95:      1076.29 ms                                       │
└────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────┐
│ Model: gpt-4o-mini                                                 │
├────────────────────────────────────────────────────────────────────┤
│ Success Rate:    100.00%                                           │
│ Latency Average:   619.55 ms      


