# Router Agent Benchmark
This notebook implements a benchmark using groundtruthing technique to evaluate AI models that function as **Router Agents**

## Setup LLM/SLM Provider
In this example I'll use LangChain and OpenAI as LLM provider, switch to your provider (AWS Bedrock with boto3, Groq directly calls, etc.)

To guarantee only the three words as result, I'll define an enum and use it as output parser (If the model answers something different from it, it will retry automatically.)

In [None]:
from enum import Enum
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser

class CategoryEnum(str, Enum):
    MATH = "math"
    HISTORY = "history"
    JOKE = "joke"

class CategoryModel(BaseModel):
    category_result: CategoryEnum = Field(description="The agent's category result")

parser = PydanticOutputParser(pydantic_object=CategoryModel)

prompt = ChatPromptTemplate.from_messages([
    ("system", """
        You are a Router Agent specialized in classifying user questions into specific categories.
        Your function is to analyze the user's question and determine which category best applies:
            - **math**: Questions related to mathematical calculations, arithmetic operations, algebra, geometry, statistics, or any problem that requires mathematical reasoning or numerical computation.
            - **history**: Questions about historical events, historical figures, important dates, historical periods, wars, revolutions, or any topic related to the past and history.
            - **joke**: Requests to tell jokes, make humor, or any request related to humorous entertainment, regardless of language.
        
        **Important instructions:**
            1. Carefully analyze the intention and content of the question
            2. Identify the most appropriate category based on the main purpose of the question
            3. Return EXCLUSIVELY the category in the specified format, without additional text or explanations
            4. Be accurate and consistent in classification

        Expected output format:
        {format_instructions}
    """),
    ("human", """
        Classify the following question into one of the available categories (math, history, or joke):
        Question: {input}""")
    ])

model = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
)

chain = prompt | model | parser

CategoryModel(category_result=<CategoryEnum.MATH: 'math'>)

## Prepare Groundtruth Data
Loads groundtruth data from JSON file to memory

In [None]:
import json 

groundtruth_file_path = "groundtruth.json"

with open(groundtruth_file_path, "r", encoding="utf-8") as file:
    groundtruth_data = json.load(file)