# Tool Selection Evaluation

====================================

### 1. Theory: The Importance of Tool Selection

Modern LLM-powered agents gain their power by interacting with the outside world through a set of tools (e.g., APIs, functions, databases). **Tool selection** is the critical process where the agent, based on a user's query, decides which tool (or tools) is most appropriate to use to generate a response. The quality of this selection is paramount to the agent's effectiveness.

The agent's ability to choose the right tool depends almost entirely on the quality of the tool's **name and description**. A vague or misleading description will cause the agent to make mistakes, leading to incorrect actions, wasted API calls, and poor user experiences.

This notebook provides a complete walkthrough of how to:
1.  **Evaluate** an agent's tool selection capability using a custom precision metric.
2.  **Automatically improve** the tool descriptions by using another LLM to analyze the failure cases.
3.  **Re-evaluate** to see if the improvements had a positive effect.
4.  **Validate** the results on a held-out test set to ensure the improvements are generalizable.

We will use a subset of the [ToolBench](https://github.com/OpenBMB/ToolBench/tree/master) dataset, which is specifically designed for this kind of task.

### 2. Prerequisites and Setup

First, we'll install the necessary Python packages for this tutorial.

In [None]:
# # The '%pip install' command installs python packages from the notebook.
# # The -U flag ensures we get the latest versions of the langchain and openai libraries.
# %pip install -U langchain langchain_openai

Next, we configure our environment variables. This is a secure way to provide API keys to our application.

- **`LANGCHAIN_API_KEY`**: Your secret key for authenticating with LangSmith.
- **`LANGCHAIN_ENDPOINT`**: This URL directs all LangChain tracing data to the LangSmith platform.
- **`LANGCHAIN_PROJECT`**: (Optional) This sets the project name in LangSmith, helping to organize your runs. If not set, it defaults to `"default"`.

**Action Required**: You must replace `"YOUR API KEY"` with your actual key.

In [1]:
import os # Import the 'os' module to interact with the operating system.

# Update with your API URL if using a hosted instance of Langsmith.
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint.
# os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your API key.
# Optional: Set a project name in LangSmith to group related runs.
os.environ["LANGCHAIN_PROJECT"] = "Tool Selection"

In [2]:
from dotenv import load_dotenv # Import function to load environment variables
import os # Import the 'os' module to interact with the operating system.

# Load environment variables from the .env file. The `override=True` argument
# ensures that variables from the .env file will overwrite existing environment variables.
load_dotenv(dotenv_path=".env", override=True)



# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint as an environment variable.
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = os.getenv('LANGSMITH_API_KEY')# Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') # Set your OpenAI API key as an environment variable.

### 3. Prepare the Dataset

We will use a public dataset hosted on LangSmith. This dataset contains examples where the input is a user query, and the output is the list of tools that are expected to be called. We will clone this dataset into our own LangSmith account so we can run evaluations against it.

In [3]:
# The URL of the public LangSmith dataset we will use for development and initial evaluation.
dev_dataset_url = (
    "https://smith.langchain.com/public/bdf7611c-3420-4c71-a492-42715a32d61e/d"
)
# A descriptive name for our local copy of the dataset.
dataset_name = "Tool Selection (Logistics) dev"

In [4]:
import langsmith # Import the langsmith library.

client = langsmith.Client() # Instantiate the LangSmith client.

# Use the client to clone the public dataset into your own LangSmith workspace.
client.clone_public_dataset(dev_dataset_url)

Dataset(name='Tool Selection (Logistics) dev', description=None, data_type=<DataType.kv: 'kv'>, id=UUID('63c7d3fa-cffb-4a3e-832c-e5ba400b0792'), created_at=datetime.datetime(2025, 8, 5, 15, 16, 54, 370560, tzinfo=datetime.timezone.utc), modified_at=datetime.datetime(2025, 8, 5, 15, 16, 54, 370560, tzinfo=datetime.timezone.utc), example_count=0, session_count=0, last_session_start_time=None, inputs_schema=None, outputs_schema=None, transformations=None)

### 4. Define Metrics

To measure how well the agent selects tools, we need a quantitative metric. We will calculate the **precision** of the selected tools. In this context, precision answers the question: *"Of all the tools the agent chose, what fraction were correct?"*

We will implement this as a custom evaluator in LangSmith. The logic is as follows:

1. Get the set of predicted tools and the set of expected tools.
2. Find the intersection of these two sets, which gives us the "True Positives".
3. Calculate Precision = (Number of True Positives) / (Total Number of Predicted Tools).

This gives us a score between 0 and 1 for each run, which we can then average across the dataset.

In [5]:
from typing import Set # Import the Set type hint.

from langchain.smith import RunEvalConfig # Import the evaluation configuration class.
from langsmith.evaluation import run_evaluator # Import the decorator for creating custom evaluators.


# The '@run_evaluator' decorator registers this function as a custom evaluator in LangSmith.
@run_evaluator
def selected_tools_precision(run, example):
    # Get the expected tool calls from the dataset example.
    expected = example.outputs["expected"]
    # Get the predicted tool calls from the agent's run output.
    predicted = run.outputs["output"]
    # The expected format is a list of lists; flatten it and convert to a set for efficient operations.
    expected: Set[str] = {tool for tools in expected for tool in tools}
    # The predicted format is a list of dictionaries; extract the tool name ('type') and convert to a set.
    predicted: Set[str] = {tool["type"] for tool in predicted}
    # Find the common tools between predicted and expected sets (the intersection).
    true_positives = predicted & expected

    # Handle the edge case where no tools were predicted.
    if len(predicted) == 0:
        if len(expected) > 0:
            score = 0 # It should have predicted tools but didn't.
        else:
            score = 1 # It correctly predicted no tools.
    else:
        # Calculate precision: the ratio of correct predictions to total predictions.
        score = len(true_positives) / len(predicted)
    # Return the score in the format LangSmith requires.
    return {"key": "tool_selection_precision", "score": score}


# Create an evaluation configuration object that includes our custom evaluator.
eval_config = RunEvalConfig(
    custom_evaluators=[selected_tools_precision],
)

### 5. Create the Function-Calling Model (V1)

We will now create the agent, which in this case is a simple function-calling chain. The tools are defined in an external JSON file, which we will load. We then use LangChain's `.bind_tools()` method to make the LLM aware of the available tools and their schemas. This is a streamlined way to create a tool-using agent.

In [7]:
import json # Import the json library for file handling.

# Open the JSON file containing the tool definitions.
with open("./data/tools.json") as f:
    # Load the JSON content into a Python list.
    tools = json.load(f)

# Display the first tool as an example of the structure.
tools[0]

{'type': 'function',
 'function': {'name': 'TransportistasdeArgentina',
  'description': 'Quote for postcode in OCA e-Pack.',
  'parameters': {'type': 'object',
   'properties': {'postCodeDst': {'type': 'number',
     'description': 'Postcode Destination'},
    'cuit': {'type': 'string',
     'description': 'CUIT of your account in OCA e-Pack'},
    'operativa': {'type': 'string',
     'description': 'Operativa number of your account in OCA e-Pack'},
    'cost': {'type': 'number', 'description': 'Cost of products in ARS'},
    'postCodeSrc': {'type': 'number', 'description': 'Postcode Source'},
    'volume': {'type': 'number', 'description': 'Volume in cm3'},
    'weight': {'type': 'number', 'description': 'Weight in KG'}},
   'required': ['postCodeDst',
    'cuit',
    'operativa',
    'cost',
    'postCodeSrc',
    'volume',
    'weight']}}}

In [None]:
from langchain_core.output_parsers.openai_tools import JsonOutputToolsParser # Parser for the tool-calling output.
from langchain_core.prompts import ChatPromptTemplate # Prompt templating utility.
from langchain_openai import ChatOpenAI # OpenAI chat model wrapper.

model = "gpt-3.5-turbo" # Specify the model to use.

# Define a simple prompt template for the assistant.
assistant_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant. Respond to the user's query using the provided tools",
        ),
        ("user", "{query}"),
    ]
)


# Initialize the LLM and bind the tools to it. This tells the LLM which functions it can call.
llm = ChatOpenAI(model=model).bind_tools(tools)

# Create the final chain using LangChain Expression Language (LCEL).
chain = assistant_prompt | llm | JsonOutputToolsParser()

### 6. Evaluate (V1)

Now we run our first evaluation, testing our initial chain against the development dataset. LangSmith will orchestrate the process, and our custom `selected_tools_precision` evaluator will score each run.

In [29]:
# Run the evaluation on the dataset.
test_results = client.run_on_dataset(
    dataset_name=dataset_name, # The name of our development dataset.
    llm_or_chain_factory=chain, # The agent chain to be tested.
    evaluation=eval_config, # The evaluation configuration with our custom evaluator.
    verbose=True, # Print progress and links to the results.
    project_metadata={
        "model": model, # Tag the run with the model name.
        "tool_variant": 0, # Tag the run with our tool description version.
    },
)

View the evaluation results for project 'clear-jet-37' at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/462d8386-60c8-4cb3-84eb-6efeae3a1293/compare?selectedSessions=8b95a94e-c05f-4ecf-b749-aeaef3ff3327

View all tests for Dataset Tool Selection (Logistics) dev at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/462d8386-60c8-4cb3-84eb-6efeae3a1293
[------------------------------------------------->] 100/100

Unnamed: 0,feedback.tool_selection_precision,error,execution_time,run_id
count,100.0,0.0,100.0,100
unique,,0.0,,100
top,,,,827e2f98-bcb1-4940-aa16-5a7d0eca80ff
freq,,,,1
mean,0.636667,,1.417737,
std,0.370322,,0.581734,
min,0.0,,0.468482,
25%,0.333333,,1.141958,
50%,0.5,,1.331713,
75%,1.0,,1.576078,


After evaluating, the best practice is to manually review the failure cases in LangSmith, identify patterns, and thoughtfully update the tool descriptions or ground-truth labels. The dataset we are using is noisy, so some labels may be incorrect.

For demonstration purposes, we will now show a more automated (but less reliable) approach to improving the tool descriptions.

### 7. Automated Improvement of Tool Descriptions

Here, we'll use an LLM to try and fix its own mistakes. We will ask a second LLM to act as a "documentation assistant" that analyzes the failure cases from our first evaluation and suggests better tool descriptions. This process follows a **Map-Reduce-Distill** pattern:

1.  **Map**: For each run that failed (precision < 1), we will *map* it to a suggested documentation update. We'll create an "improver chain" that takes the failure case (query, predicted tools, expected tools) and generates a new, improved description for the tool(s) it thinks were misused.
2.  **Reduce**: We will group all the suggested description updates by tool name. A single tool might have multiple suggested improvements from different failure cases.
3.  **Distill**: For each tool, we will use a final "distill chain" to synthesize the list of candidate descriptions into a single, cohesive, improved description.

Let's start by defining the `improver_chain`.

In [9]:
from typing import List # Import typing hints.

from langchain_core.prompts import ChatPromptTemplate # Prompt templating utility.
from langchain_core.pydantic_v1 import BaseModel, Field # Pydantic models for structured output.
from langchain_core.runnables import chain as as_runnable # Runnable utilities.
from langchain_openai import ChatOpenAI # OpenAI chat model wrapper.


# Define the Pydantic schema for a single tool description update.
class FunctionUpdate(BaseModel):
    name: str = Field(
        description="The name of the tool whose description you'd like to update"
    )
    updated_description: str = Field(
        description="The updated description that would make it clear when and why to invoke this function."
    )


# Define the top-level Pydantic model for the LLM's structured output.
class ImproveToolDocumentation(BaseModel):
    """Called to update the docstrings and other information about a given tool
    so that the user has an easier time selecting."""

    updates: List[FunctionUpdate] = Field(
        description="The updates to make, one for each tool description you'd like to change"
    )


# Define the prompt for the 'improver' LLM. It takes the full API list and a specific failure case.
improver_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an API documentation assistant tasked with meticulously improving the descriptions of our API docs."
            " Our AI assistant is trying to assist users by calling APIs, but it continues to invoke the wrong ones."
            " You must improve their documentation to remove ambiguity so that the assistant will no longer make any mistakes.\n\n"
            "##Valid APIs\nBelow are the existing APIs the assistant is choosing between:\n```apis.json\n{apis}\n```\n\n"
            "## Failure Case\nBelow is a user query, expected API calls, and actual API calls."
            " Use this failure case to make motivated doc changes.\n\n```failure_case.json\n{failure}\n```",
        ),
        (
            "user",
            "Respond with the updated tool descriptions to clear up"
            " whatever ambiguity caused the failure case above."
            " Feel free to mention what it is NOT appropriate for (if that's causing issues.), like 'don't use this for x'."
            " The updated description should reflect WHY the assistant got it wrong in the first place.",
        ),
    ]
)

# Initialize an LLM and configure it to produce structured output matching our Pydantic model.
llm = ChatOpenAI(model="gpt-3.5-turbo").with_structured_output(ImproveToolDocumentation)

# Create the final 'improver' chain.
improver_chain = improver_prompt | llm


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


Now, we process the results of our first evaluation to create the inputs for our `improver_chain`.

In [49]:
# Convert the list of all available tools to a JSON string for inclusion in the prompt.
apis = json.dumps(tools, indent=2)

In [50]:
# Convert the test results into a pandas DataFrame for easy manipulation.
df = test_results.to_dataframe()
# Filter the DataFrame to only include rows where the tool selection was not perfect (score < 1).
df = df[df["feedback.tool_selection_precision"] < 1]


# Define a function to format each row of the DataFrame into the input format for the improver_chain.
def format_inputs(series):
    return {
        "apis": apis,
        "failure": json.dumps(
            {
                "query": series["inputs.query"],
                "predicted": [out["type"] for out in series["output"]],
                "expected": series["reference.expected"][0],
            }
        ),
    }


# Apply the formatting function to each failure case to create a list of inputs.
improver_inputs = df.apply(format_inputs, axis=1).tolist()

#### Map errors to updates

We run the `improver_chain` in a batch to efficiently generate suggested updates for all failure cases.

In [51]:
# This is the 'Map' step: run the improver_chain on all the failure cases.
all_updates = improver_chain.batch(improver_inputs, return_exceptions=True)
# Filter out any potential errors from the batch run to ensure we only have valid update objects.
all_updates = [u for u in all_updates if isinstance(u, ImproveToolDocumentation)]

#### Reduce updates per tool

Next, we perform the 'Reduce' step by grouping all the suggested descriptions by the tool name.

In [52]:
from collections import defaultdict # Import defaultdict for easy aggregation.

# Create a dictionary to hold lists of suggested updates for each tool.
toolwise_updates = defaultdict(list)
# Iterate through the list of all generated updates.
for updates in all_updates:
    # Each item can contain updates for multiple tools.
    for tool_update in updates.updates:
        # Append the suggested description to the list for that tool name.
        toolwise_updates[tool_update.name].append(tool_update.updated_description)

#### Distill updates into a final description

Finally, we perform the 'Distill' step. We create a new chain that takes the list of candidate descriptions for a single tool and synthesizes them into one final, improved description.

In [53]:
# Define the prompt for the 'distiller' LLM.
distill_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an API documentation assistant tasked with meticulously improving the descriptions of our API docs."
            " Our AI assistant is trying to help users by calling APIs, but it continues to invoke the wrong ones."
            " You are tasked with updating the {target_api} description.\n\n"
            "## Current APIs\n"
            "Below is a list of the current APIs and descriptions.\n"
            "```apis.json\n{apis}\n```\n\n"
            "## Candidates\n"
            " Here are some candidate desription improvements:\n{candidates}\n"
            " Consider the above feedback in your updated description.",
        ),
        (
            "user",
            "Respond with the updated description for the {target_api} API."
            " It should distill and incorporate the candidate descriptions to"
            " clear up whatever ambiguity is causing our AI assistant to fail.",
        ),
    ]
).partial(apis=apis) # Pre-fill the 'apis' part of the prompt.

# Initialize an LLM configured to produce a single 'FunctionUpdate' object.
distill_llm = ChatOpenAI(model=model).with_structured_output(FunctionUpdate)

# Create the final 'distill' chain.
distill_chain = distill_prompt | distill_llm

In [54]:
# Prepare the inputs for the distill_chain.
distill_inputs = [
    {
        "target_api": name,
        "candidates": "\n".join(["- " + c for c in candidates]),
    }
    for name, candidates in toolwise_updates.items()
]

In [55]:
# Run the distill chain in a batch to generate the final descriptions for all tools.
updated_descriptions = distill_chain.batch(distill_inputs)

In [56]:
# Convert the list of updated descriptions into a simple name: description dictionary.
updates_dict = {upd.name: upd.updated_description for upd in updated_descriptions}
# Display the final, improved descriptions.
updates_dict

{'TransportistasdeArgentina': 'Get a shipping quote for sending products within Argentina using OCA e-Pack. Provide destination and source postcodes, CUIT, operativa number, cost, volume, and weight details for accurate pricing.',
 'TurkeyPostalCodes': 'Retrieve Turkish plate numbers (1 to 81) based on the city code. This API is specifically designed to provide details about Turkish plates and is not intended for tracking packages or obtaining postal codes for cities in Argentina.',
 'CEPBrazil': 'Retrieve address details based on a Brazilian CEP number. This function is NOT intended for tracking package locations or statuses, tracking travel documents, or providing non-address related information. Use this API specifically for address lookup using CEP numbers in Brazil.',
 'PridnestroviePost': 'Get track information by providing a track number for international shipments. Use this API specifically for tracking packages and shipments.',
 'PackAndSend': 'If you have a Pack & Send Refere

Now we create a new list of tools (`new_tools`) with our automatically generated descriptions.

In [57]:
from copy import deepcopy # Import deepcopy to create a new, independent copy of our tools list.

# Create a deep copy of the original tools to avoid modifying them in place.
new_tools = deepcopy(tools)
# Iterate through the new list of tools.
for tool in new_tools:
    # Get the name of the current tool.
    name = tool["function"]["name"]
    # Check if we have a generated an updated description for this tool.
    if name in updates_dict:
        # Get the new description.
        updated = updates_dict[name]
        # Overwrite the old description with the new, improved one.
        tool["function"]["description"] = updated

### 8. Re-Evaluate (V2)

Now that we have our improved tool descriptions, we'll create a new agent chain (`updated_chain`) that uses them. We will then re-run the evaluation on the *same development dataset* to see if our automated changes led to a better precision score.

In [58]:
# Create a new LLM instance and bind our `new_tools` with the updated descriptions.
llm = ChatOpenAI(model=model).bind_tools(new_tools)

# Create the V2 chain, using the same prompt and parser but the new LLM.
updated_chain = assistant_prompt | llm | JsonOutputToolsParser()

In [59]:
model = "gpt-3.5-turbo" # Re-specify the model name.

# Run the evaluation again with the updated chain.
updated_test_results = client.run_on_dataset(
    dataset_name=dataset_name, # Use the same development dataset.
    llm_or_chain_factory=updated_chain, # Pass our new V2 chain.
    evaluation=eval_config, # Use the same evaluation config.
    project_metadata={
        "model": model,
        # Update the version number for our tool descriptions.
        "tool_variant": 2,
    },
    verbose=True,
)

View the evaluation results for project 'ordinary-step-81' at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/462d8386-60c8-4cb3-84eb-6efeae3a1293/compare?selectedSessions=a4204d34-4d08-42fa-a84d-19b850ad920e

View all tests for Dataset Tool Selection (Logistics) dev at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/462d8386-60c8-4cb3-84eb-6efeae3a1293
[------------------------------------------------->] 99/100

Chain failed for example 033fd6d7-6c80-4ef2-ab26-e4116e4da24a with inputs {'query': "I'm planning a family vacation to Brazil and I need to find a hotel in Rio de Janeiro. Can you provide me with a list of available hotels in Rio de Janeiro downtown? Additionally, I would like to know the current health status of the CEP Brazil API and if it's functioning properly."}
Error Type: InternalServerError, Message: Error code: 500 - {'error': {'message': 'The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID req_35b74dde88208be45493f9827dc88674 in your email.)', 'type': 'server_error', 'param': None, 'code': None}}


[------------------------------------------------->] 100/100

The metrics show a slight improvement, though it may be within the standard margin of error. This automated approach is a good starting point, but manual analysis and refinement are often necessary for significant gains.

### 9. Final Validation on a Test Set

So far, we have been developing and evaluating on a single dataset (our "dev" set). This process, often called **hill climbing**, can lead to **overfitting**: our new tool descriptions might be overly tailored to the specific examples in the dev set and may not perform as well on new, unseen data.

To get a true, unbiased measure of whether our V2 chain is actually better than V1, we must benchmark both versions on a **held-out test set**. This is a set of data that was not used at all during the development and improvement process. A better score on this test set gives us high confidence that our changes have led to a genuinely more robust model.

In [60]:
# Define the URLs for the different data splits (dev, test, train).
dataset_urls = {
    # Dev is the same as the one we've been using.
    "dev": dev_dataset_url,
    # The URL for the held-out test set.
    "test": "https://smith.langchain.com/public/a5fd6197-36ed-4d06-993a-89929dded399/d",
    # The URL for a training set (which could be used for fine-tuning).
    "train": "https://smith.langchain.com/public/cf5a1de8-68f0-4170-9bcc-f263c1abb063/d",
}

In [61]:
import langsmith # Import the langsmith library.

client = langsmith.Client() # Instantiate the LangSmith client.

# Clone the public test dataset into our workspace.
client.clone_public_dataset(dataset_urls["test"])

In [None]:
# Define the name of our newly cloned test dataset.
test_dataset_name = "Tool Selection (Logistics) test"

# Iterate through both the original chain (V1) and the updated chain (V2).
for target_chain in [chain, updated_chain]:
    # Run each chain on the test dataset.
    client.run_on_dataset(
        dataset_name=test_dataset_name,
        llm_or_chain_factory=chain, # The chain to be tested.
        evaluation=eval_config, # The evaluation configuration.
        project_metadata={
            "model": model,
            # Mark that this is a new tool description version.
            "tool_variant": 2,
        },
    )

View the evaluation results for project 'definite-coach-89' at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/ddc1bcf7-c3fb-4669-824d-eb2e23af93d0/compare?selectedSessions=2b7204c8-7f07-4c2c-b798-d9005a059ce0

View all tests for Dataset Tool Selection (Logistics) test at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/ddc1bcf7-c3fb-4669-824d-eb2e23af93d0
[------------------------------------------------->] 234/234View the evaluation results for project 'sparkling-doctor-64' at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/ddc1bcf7-c3fb-4669-824d-eb2e23af93d0/compare?selectedSessions=b9d4bd07-d96b-4da8-97df-279158ffafa1

View all tests for Dataset Tool Selection (Logistics) test at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/ddc1bcf7-c3fb-4669-824d-eb2e23af93d0
[------------------------------------------------> ] 231/234