# Evaluating Agents' Intermediate Steps

====================================

### 1. Theory: Beyond the Final Answer - Evaluating an Agent's Process

An **LLM Agent** is a system that uses a Large Language Model as its reasoning engine to make decisions and take actions. It's not just a simple input-to-output model; it can use a suite of tools (like search engines, calculators, or APIs) to gather information and solve complex problems. 

Because of this, evaluating an agent solely on its final answer is often insufficient. A correct answer might be a lucky guess, or it might have been reached through an inefficient or costly sequence of actions. To build robust and reliable agents, we must also evaluate their **intermediate steps**, or their "trajectory." This means assessing the agent's decision-making process: *Did it choose the right tool for the job? Did it use the tools in a logical order? Was it efficient?*

[![Example Agent Trace](./img/agent_trace.png)](https://smith.langchain.com/public/6d50f517-115f-4c14-97b2-2e19b15efca7/r)

This notebook provides a walkthrough on how to create a **custom run evaluator** in LangSmith to assess an agent based on its sequence of tool calls. We will compare the agent's actual actions against a predefined, expected sequence to ensure it is behaving as intended.

The process involves these key steps:

- **Prepare a specialized dataset**: The dataset will contain not just input questions and reference answers, but also the expected sequence of tool calls.
- **Define the agent**: We will build an agent with a specific set of tools.
- **Construct a custom evaluator**: We will write a Python function to compare the agent's actual tool trajectory with the expected one.
- **Run the evaluation**: We will use LangSmith to run the agent against the dataset and apply our custom evaluator.

By the end of this guide, you'll have a clear understanding of how to evaluate the internal workings of your agents, leading to more predictable and reliable applications.

### 2. Prerequisites and Setup

First, we'll install the necessary Python packages for this tutorial.

In [None]:
# The '%pip install' command installs python packages from the notebook.
# -U flag ensures we get the latest versions of langchain and openai.
# %pip install -U langchain openai

Next, we configure our environment variables. This is a secure way to provide API keys to our application.

- **`LANGCHAIN_API_KEY`**: Your secret key for authenticating with LangSmith.
- **`OPENAI_API_KEY`**: Your secret key for the OpenAI API, required for the agent's LLM.
- **`LANGCHAIN_ENDPOINT`**: This URL directs all LangChain tracing data to the LangSmith platform.

**Action Required**: You must replace the placeholder values with your actual keys.

In [1]:
from dotenv import load_dotenv # Import function to load environment variables
import os # Import the 'os' module to interact with the operating system.

# Load environment variables from the .env file. The `override=True` argument
# ensures that variables from the .env file will overwrite existing environment variables.
load_dotenv(dotenv_path=".env", override=True)



# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint as an environment variable.
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = os.getenv('LANGSMITH_API_KEY')# Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') # Set your OpenAI API key as an environment variable.

## Step 1: Prepare the Dataset

To evaluate an agent's trajectory, our dataset needs a special structure. In addition to the `input` (the question) and a `reference` (the ground-truth final answer), we will include a new field: **`expected_steps`**. This field will contain a list of tool names in the precise order we expect the agent to call them.

This `expected_steps` list will serve as the ground truth for our custom trajectory evaluator. Note that for simple queries that don't require a tool, we can provide an empty list `[]`.

In [2]:
import uuid # Import the uuid library to generate unique identifiers.

from langsmith import Client # Import the Client class to interact with LangSmith.

client = Client() # Instantiate the LangSmith client.

# Define the list of questions and their corresponding outputs.
questions = [
    (
        "Why was was a $10 calculator app one of the best-rated Nintendo Switch games?",
        {
            "reference": "It became an internet meme due to its high price point.", # The ground-truth final answer.
            "expected_steps": ["duck_duck_go"], # The expected sequence of tool calls.
        },
    ),
    (
        "hi",
        {
            "reference": "Hello, how can I assist you?", # The expected direct response.
            "expected_steps": [],  # Expect a direct response with no tools used.
        },
    ),
    (
        "Who is Dejan Trajkov?",
        {
            "reference": "Macedonian Professor, Immunologist and Physician",
            "expected_steps": ["duck_duck_go"],
        },
    ),
    (
        "Who won the 2023 U23 world wresting champs (men's freestyle 92 kg)",
        {
            "reference": "Muhammed Gimri from turkey",
            "expected_steps": ["duck_duck_go"],
        },
    ),
    (
        "What's my first meeting on Friday?",
        {
            "reference": 'Your first meeting is 8:30 AM for "Team Standup"',
            "expected_steps": ["check_calendar"],  # Only expect the calendar tool to be used.
        },
    ),
]

uid = uuid.uuid4() # Generate a new unique identifier.
dataset_name = f"Agent Eval Example {uid}" # Create a unique name for the dataset.
# Create the dataset on the LangSmith platform.
ds = client.create_dataset(
    dataset_name=dataset_name,
    description="An example agent evals dataset using search and calendar checks.",
)
# Create the examples in the dataset.
client.create_examples(
    inputs=[{"question": q[0]} for q in questions], # The inputs are a list of question dictionaries.
    outputs=[q[1] for q in questions], # The outputs are a list of the corresponding output dictionaries.
    dataset_id=ds.id, # Link these examples to the dataset we just created.
)

{'example_ids': ['7c0955df-98da-4517-ad77-713f0638c238',
  '21a5b2e1-ce2f-4f3f-a9ca-ce995ba134a0',
  'bc82f54c-58ac-49a2-96f2-7a6adee245ba',
  'f835fcfe-0547-43dd-b48b-8f383ec3ff5b',
  'b44ec761-91a7-412b-be2c-622d3377f76a'],
 'count': 5}

## Step 2: Define the Agent

An agent is composed of several key parts:
- **LLM**: The core reasoning engine that decides which action to take.
- **Tools**: The functions the agent can call to interact with the outside world (e.g., search, database lookups).
- **Prompt**: A template that instructs the LLM on its role, provides it with the available tools, and formats the conversation history.
- **Agent Executor**: The runtime that orchestrates the agent's loop. It calls the agent, executes the chosen tool, passes the result back to the agent, and repeats until a final answer is produced.

In this example, we will create an agent that has access to two tools:
1. A `DuckDuckGoSearchResults` tool for general web searches.
2. A custom `check_calendar` tool, which we create using the `@tool` decorator, to simulate checking a user's calendar.

In [3]:
from dateutil.parser import parse # A utility to parse date strings into datetime objects.
from langchain.agents import AgentExecutor, create_openai_tools_agent # Import core agent components.
from langchain.agents.format_scratchpad import format_to_openai_functions # A formatting helper.
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser # The output parser.
from langchain_openai import ChatOpenAI # The OpenAI chat model wrapper.
from langchain_community.tools import DuckDuckGoSearchResults # The DuckDuckGo search tool.
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder # Prompting utilities.
from langchain_core.tools import tool # The decorator for creating custom tools.
from langchain_core.utils.function_calling import format_tool_to_openai_function # A function formatting helper.


# The '@tool' decorator easily turns a Python function into a LangChain tool.
@tool
def check_calendar(date: str) -> list:
    """Check the user's calendar for a meetings on the specified datetime (in iso format).""" # The docstring is used as the tool's description for the agent.
    date_time = parse(date) # Parse the input date string.
    # This is a mock implementation to demonstrate the concept.
    if date_time.weekday() == 4: # 4 corresponds to Friday.
        return [
            "8:30 : Team Standup",
            "9:00 : 1 on 1",
            "9:45 design review",
        ]
    return ["Focus time"] # Return a default for other days.


# Define the main function that creates and runs our agent.
def agent(inputs: dict):
    # Initialize the LLM. We use a model that's good at function calling.
    llm = ChatOpenAI(
        model="gpt-3.5-turbo-16k",
        temperature=0, # Set temperature to 0 for more deterministic, repeatable outputs.
    )
    # Define the list of tools the agent has access to.
    tools = [
        DuckDuckGoSearchResults(
            name="duck_duck_go" # Give the tool a specific name.
        ),
        check_calendar, # Our custom calendar tool.
    ]
    # Define the prompt template for the agent.
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", "You are a helpful assistant."),
            MessagesPlaceholder(variable_name="agent_scratchpad"), # Placeholder for intermediate steps.
            ("user", "{question}"), # Placeholder for the user's input question.
        ]
    )
    # Create the runnable agent component.
    runnable_agent = create_openai_tools_agent(llm, tools, prompt)

    # Create the Agent Executor, which orchestrates the agent's runs.
    executor = AgentExecutor(
        agent=runnable_agent,
        tools=tools,
        handle_parsing_errors=True, # Gracefully handle any parsing errors.
        return_intermediate_steps=True, # CRITICAL: This must be True to get the trajectory for evaluation.
    )
    # Invoke the executor with the inputs.
    return executor.invoke(inputs)

## Step 3: Define the Custom Evaluator

Now we'll create our custom evaluator. This is a simple Python function that will receive the agent's run trace (`Run` object) and the corresponding dataset entry (`Example` object).

The logic of our `intermediate_step_correctness` function is straightforward:
1.  Extract the list of intermediate steps from the agent's final output. Because we set `return_intermediate_steps=True`, the output will contain an `"intermediate_steps"` key.
2.  Process this list to get only the names of the tools that were called, creating the agent's actual trajectory.
3.  Retrieve the `"expected_steps"` list from the example's outputs.
4.  Compare the actual trajectory to the expected trajectory.
5.  Return a dictionary with a unique `key` for the evaluation metric and a `score` (1 for a perfect match, 0 otherwise).

In [4]:
from typing import Optional # Import typing hints.

from langsmith.schemas import Example, Run # Import the Run and Example schemas from LangSmith.


# Define the custom evaluator function.
def intermediate_step_correctness(run: Run, example: Optional[Example] = None) -> dict:
    if run.outputs is None: # A safety check to ensure the run has completed and has outputs.
        raise ValueError("Run outputs cannot be None")
    # Get the 'intermediate_steps' from the agent's output, defaulting to an empty list if not found.
    intermediate_steps = run.outputs.get("intermediate_steps") or []
    # The intermediate_steps list contains tuples of (AgentAction, observation).
    # We only care about the action's 'tool' attribute.
    # This list comprehension extracts the tool name for each step.
    trajectory = [action.tool for action, _ in intermediate_steps]
    # Retrieve the ground-truth trajectory from our dataset example.
    expected_trajectory = example.outputs["expected_steps"]
    # Perform a simple equality check between the actual and expected trajectories.
    score = int(trajectory == expected_trajectory)
    # Return the result in the format required by LangSmith.
    return {"key": "Intermediate steps correctness", "score": score}

## Step 4: Run the Evaluation

Now we can put everything together and run the evaluation. We will use two evaluators:
1.  Our custom `intermediate_step_correctness` function to check the tool trajectory.
2.  A standard `"qa"` evaluator to check if the agent's final answer is correct based on the `reference` label.

Because our dataset's `outputs` field is a dictionary with multiple keys (`reference` and `expected_steps`), the standard `qa` evaluator doesn't know which key to use as the ground-truth answer. We must provide a `prepare_data` helper function to explicitly map the `prediction` to the agent's final `output` and the `reference` to the `reference` key in our example's outputs.

In [6]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate # Import the evaluation functions.


# Define a data preparation function for the standard QA evaluator.
def prepare_data(run: Run, example: Example) -> dict:
    # This function creates the specific dictionary that the 'qa' evaluator expects.
    return {
        "input": example.inputs["question"], # The original question.
        "prediction": run.outputs["output"], # The agent's final answer.
        "reference": example.outputs["reference"], # The ground-truth final answer.
    }


# Create an instance of the standard QA evaluator, passing our data preparation function.
qa_evaluator = LangChainStringEvaluator("qa", prepare_data=prepare_data)

# Run the full evaluation.
chain_results = evaluate(
    agent, # The agent function to be tested.
    data=dataset_name, # The name of our dataset in LangSmith.
    # A list containing both our custom evaluator and the standard QA evaluator.
    evaluators=[intermediate_step_correctness, qa_evaluator],
    experiment_prefix="Agent Eval Example", # A prefix for the experiment name in LangSmith.
    max_concurrency=1, # Run sequentially as some agents/tools may not be thread-safe.
)

View the evaluation results for experiment: 'Agent Eval Example-2a07d2eb' at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/e79d4c23-37d0-43b9-abd2-b913e3bbbc94/compare?selectedSessions=7aefc064-da35-4b32-a967-4e583e5df9bc




  with DDGS() as ddgs:
  with DDGS() as ddgs:
  with DDGS() as ddgs:
5it [00:10,  2.14s/it]


## Conclusion

Congratulations! You have successfully configured and run a comprehensive evaluation of an LLM agent. You've not only checked its final answers for correctness but have also validated its internal decision-making process by comparing its tool usage against an expected trajectory.

This technique of evaluating intermediate steps is crucial for building agents that are not only accurate but also efficient, predictable, and aligned with your intended logic.

The simple, direct comparison evaluator we built is a great starting point. For more complex scenarios where multiple tool sequences could be valid, you can extend this concept by using an LLM-powered evaluator to grade the agent's trajectory. LangChain's [TrajectoryEvalChain](https://python.langchain.com/docs/guides/productionization/evaluation/trajectory/trajectory_eval#evaluate-trajectory) provides an excellent off-the-shelf solution for this more advanced use case.