In [88]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Week 6 - Systematically Improving Your Rag Application

## Evaluating Tool Calling

We want to evaluate how well our model performs when it comes to calling tools. In order to do so, we'll be using two main metrics

1. Precision : How many of our tools called are relevant to the query
2. Recall : How many of our relevant tools are called

Ideally, we want both of these metrics to be as high as possible but we want to prioritize precision over recall. This is because we want to avoid wasting resources on calling irrelevant tools. 

This increases the latency and cost of our application, degrading the user experience. We'll be looking at a simple case study once we build some intuition for how we might evaluate our model's performance before moving on to practical tips on how to scale up the number of tools our application can call and improve reliability.

## Raycast Natural Language Extensions

Raycast is an application which enables you to launch custom shortcuts and integrations on your computer. It combines a variety of different integrations with tools such as Jira, Airtable, Google among many others and will be launching an easy way to help prompt extensions with natural language.

For instance, given the command `@calendar when's my next meeting?`, Raycast will be able to execute a series of commands that you have installed which will fetch all of your meetings and then return your next meeting timing after the current time. This will allow users to be able to interact with their system quickly and effeciently. 

In this notebook series, we'll look at how we might prototype a similar application. We'll do so over 3 notebooks.

1. **Evaluation**: Using a simple set of tools, we'll calculate precision and recall and see how to use these two metrics to evaluate the tools a model has called relative to a set of expected tool calls.

2. **Synthetic Data** : Once we've built some intuition for how this evaluation works, we'll then generate a synthetic dataset of queries that we can use to evaluate the performance of a model to translate natural language queries into raycast commands. We'll then measure the impact of different techniques like few-shot examples and synthetic data generation to cover weak spots in our application.

3. **Dynamic Retrieval**: Finally, we'll look at how we might scale up the number of tools we have in our application by dynamically retrieving relevant tools for a given query. We'll then compare this against our original few-shot and static baselines and see how we can improve model performance.

# Understanding Model Performance

In this section, we'll be looking at how we can evaluate the performance of a model to call the right tool. We'll do so in 3 steps

1. **Metrics** : We'll first look at precision and recall and why we want to use them to evaluate our model's performance
2. **Tool Calling** : We'll then see how we can evalute the performance of our model using these metrics by writing simple assertions and unit tests
3. **Parallel Tool Calling**: We'll then see how we can leverage parallel tool calling to improve the latency of our application and improve the performance of our model

## Precision and Recall

A good way to think about precision and recall is to think of them as a way of measuring the quality of a model's output. 

Let's see how we can manually calculate these metrics.

In [89]:
# Tools that our model called
model_tool_call = [
    "GET_CALENDAR_EVENTS",
    "CREATE_REMINDER",
    "SEND_EMAIL",
]

# Tools that we expected our model to call
expected_tool_call = [
    "GET_CALENDAR_EVENTS",
]


def calculate_precision(model_tool_call, expected_tool_call):
    """
    Calculate precision: (true positives) / (true positives + false positives)
    Precision = (relevant tools called) / (total tools called)
    """
    if len(model_tool_call) == 0:
        return 0.0  # Changed from 1 since no tools called means no true positives
    
    relevant_results = sum(1 for tool in model_tool_call if tool in expected_tool_call)
    return round(relevant_results / len(model_tool_call), 2)

def calculate_recall(model_tool_call, expected_tool_call):
    """
    Calculate recall: (true positives) / (true positives + false negatives)
    Recall = (relevant tools called) / (total relevant tools)
    """
    if len(expected_tool_call) == 0:
        return 1.0  # Perfect recall if no tools were expected
    
    if len(model_tool_call) == 0:
        return 0.0  # No recall if no tools were called

    relevant_results = sum(1 for tool in expected_tool_call if tool in model_tool_call)
    return round(relevant_results / len(expected_tool_call), 2)


precision, recall = (
    calculate_precision(model_tool_call, expected_tool_call),
    calculate_recall(model_tool_call, expected_tool_call),
)

precision, recall

(0.33, 1.0)

We can see that for this specific case, we had two tools that were called that were irrelevant to the user's query - `CREATE_REMINDER` and `SEND_EMAIL`. For a production application, we'd want to avoid this.

We did achieve a perfect recall - but remember here that a perfect recall can also be achieved by calling every single tool in our application. We want to minimise the amount of wasted computation. Let's see another example of how to compute these metrics.

In [90]:
# Tools that our model called
model_tool_call = [
    "GET_CALENDAR_EVENTS",
]

# Tools that we expected our model to call
expected_tool_call = [
    "GET_CALENDAR_EVENTS",
    "CREATE_REMINDER",
]

precision, recall = (
    calculate_precision(model_tool_call, expected_tool_call),
    calculate_recall(model_tool_call, expected_tool_call),
)

precision, recall

(1.0, 0.5)

While we have a slightly lower recall of 0.5 here because we didn't call the `CREATE_REMINDER` tool, we have a higher precision of 1. This is preferable to the previous case where we called two irrelevant tools.

Therefore, what we want to do is to maximise precision while keeping recall high. This means that we ideally want to make sure that **all of our tools called are relevant** while making sure that we **call as many of the relevant tools as possible**. This is quite distinct from RAG where we want to amximise recall while relying on the model's ability to filter out irrelevant information.

## Defining our Tools

We want to have a set of test cases that we can use to evaluate the performance of our model. We want to use them to measure the precision and recall of our model's tool calling in response to a user query. 

To demonstrate how we can do so, we'll do so in 3 steps below

1. We'll first define some tools that a simple personal assistant chatbot might use
2. We'll then define a set of test cases and corresponding expected tool calls
3. Lastly, we'll evaluate how well our model performs on these test cases using simple precision and recall metrics


In [91]:
from pydantic import BaseModel
from typing import Literal
from datetime import datetime
from typing import Union


class SendEmail(BaseModel):
    email: str
    subject: str
    body: str


class GetCalendarEvents(BaseModel):
    calendar: Literal["work", "personal"]
    start_date: datetime
    end_date: datetime


class CreateReminder(BaseModel):
    title: str
    description: str
    due_date: datetime


class ToolCalls(BaseModel):
    calls: list[
        Union[
            SendEmail,
            GetCalendarEvents,
            CreateReminder,
        ]
    ]


In [92]:
import instructor
from openai import AsyncOpenAI
from asyncio import Semaphore
import time


client = instructor.from_openai(AsyncOpenAI())


async def generate_tool_calls(query: str, sem: Semaphore):
    async with sem:
        start = time.time()
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": "You are a helpful assistant that can call tools in response to user requests.",
                },
                {"role": "user", "content": query},
            ],
            response_model=ToolCalls,
        )
        end = time.time()
        return {
            "response": resp,
            "time": end - start,
        }


In [93]:
import asyncio

tests = [
    # Single tool queries
    ["Send an email to john@example.com about the project update", [SendEmail]],
    ["What meetings do I have scheduled for tomorrow?", [GetCalendarEvents]],
    ["Set a reminder for my dentist appointment next week", [CreateReminder]],
    # Two tool combinations
    [
        "Check my calendar for next week's meetings and set reminders for each one",
        [GetCalendarEvents, CreateReminder],
    ],
    [
        "Look up my team meeting schedule and send the agenda to all participants",
        [GetCalendarEvents, SendEmail],
    ],
    [
        "Set a reminder for the client call and send a confirmation email to the team",
        [CreateReminder, SendEmail],
    ],
]

sem = asyncio.Semaphore(10)
coros = [generate_tool_calls(query, sem) for query, _ in tests]

results = await asyncio.gather(*coros)

In [94]:
import pandas as pd

df = pd.DataFrame(
    [
        {
            "query": test_item[0],
            "expected_tools": [tool.__name__ for tool in test_item[1]],
            "actual_tools": list(set([type(tool).__name__ for tool in result["response"].calls])),
            "time": round(result["time"],2),
        }
        for test_item, result in zip(tests, results)
    ]
)

df["precision"] = df.apply(
    lambda x: calculate_precision(x["actual_tools"], x["expected_tools"]), axis=1
)
df["recall"] = df.apply(
    lambda x: calculate_recall(x["actual_tools"], x["expected_tools"]), axis=1
)
df["CORRECT"] = df.apply(
    lambda x: "Y" if x["expected_tools"] == x["actual_tools"] else "N", axis=1
)

df


Unnamed: 0,query,expected_tools,actual_tools,time,precision,recall,CORRECT
0,Send an email to john@example.com about the pr...,[SendEmail],[SendEmail],1.7,1.0,1.0,Y
1,What meetings do I have scheduled for tomorrow?,[GetCalendarEvents],[GetCalendarEvents],1.39,1.0,1.0,Y
2,Set a reminder for my dentist appointment next...,[CreateReminder],[CreateReminder],1.31,1.0,1.0,Y
3,Check my calendar for next week's meetings and...,"[GetCalendarEvents, CreateReminder]",[GetCalendarEvents],1.46,1.0,0.5,N
4,Look up my team meeting schedule and send the ...,"[GetCalendarEvents, SendEmail]",[GetCalendarEvents],1.31,1.0,0.5,N
5,Set a reminder for the client call and send a ...,"[CreateReminder, SendEmail]","[SendEmail, CreateReminder]",1.49,1.0,1.0,N


In [95]:
round(df["recall"].mean().item(), 2), round(df["precision"].mean().item(), 2), round(df["time"].mean().item(), 2)

(0.83, 1.0, 1.44)

We can see a few things here 

1. In general, our model has a high precision - this means that when it decides to call a tool, it's almost always relevant to the user's query. 
2. It has a low recall when we combine certain tools together. In this case, it struggled with the query - `look at my team meeting schedule and send the agenda to all participants` and struggled to understand that it should call the `GetCalendarEvents` and `SendEmail` tools together.

Let's now explore other ways that we can evaluate the performance of our model in this case. These are simple examples, and hence we expect our model to perform well in a majority of cases. But as we increase the number of tools we have in our application, we'll see how precision and recall are affected.



In [96]:
# Create a table showing precision and recall per tool
all_tools = set()
for tools in df["expected_tools"] + df["actual_tools"]:
    all_tools.update(tools)

stats = []
for tool in all_tools:
    tool_stats = {
        "Tool": tool,
        "Precision": df[df["actual_tools"].apply(lambda x: tool in x)][
            "precision"
        ].mean(),
        "Recall": df[df["expected_tools"].apply(lambda x: tool in x)]["recall"].mean(),
    }
    stats.append(tool_stats)

tool_df = pd.DataFrame(stats).set_index("Tool").round(2)
tool_df


Unnamed: 0_level_0,Precision,Recall
Tool,Unnamed: 1_level_1,Unnamed: 2_level_1
SendEmail,1.0,0.83
GetCalendarEvents,1.0,0.67
CreateReminder,1.0,0.83


## Parallel Tool Calling

In our previous example, we needed to wait for an entire response from the model to call our tools. This meant that each tool needs to wait for prior tool calls to complete before it can be generated.

With Parallel tool calling, we can sidestep and generate multiple tool calls in a single request. We can benchmark and determine the impact of this improvement in latency on our model performance with our evals.

Let's see how we can do so.

In [97]:
from pydantic import BaseModel
from typing import Literal
from datetime import datetime


class SendEmail(BaseModel):
    email: str
    subject: str
    body: str


class GetCalendarEvents(BaseModel):
    calendar: Literal["work", "personal"]
    start_date: datetime
    end_date: datetime


class CreateReminder(BaseModel):
    title: str
    description: str
    due_date: datetime

In [98]:
import openai
import instructor
from typing import Iterable
from rich import print

client = instructor.from_openai(
    openai.AsyncOpenAI(), mode=instructor.Mode.PARALLEL_TOOLS
)  

function_calls = await client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You must always use tools"},
        {
            "role": "user",
            "content": "Can you fetch my calendar events for the next week and send an email to John(john@example.com) about the meeting we have tomorrow?",
        },
    ],
    response_model=Iterable[GetCalendarEvents | SendEmail | CreateReminder],  
)

for fc in function_calls:
    print(fc)
    #> location='Toronto' units='metric'
    #> location='Dallas' units='imperial'
    #> query='who won the super bowl'

Let's now see how we can adopt our previous unit test to evaluate the performance of our model with parallel tool calling.

In [99]:
client = instructor.from_openai(AsyncOpenAI(), mode=instructor.Mode.PARALLEL_TOOLS)

async def generate_parallel_tool_calls(query: str, sem: Semaphore):
    async with sem:
        start = time.time()
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You must always use tools"},
                {
                    "role": "user",
                    "content": query,
                },
            ],
            response_model=Iterable[GetCalendarEvents | SendEmail | CreateReminder],  
        )
        end = time.time()

        try:
            tools = [tool for tool in resp]
        except Exception as e:
            tools = []

        return {
            "response": tools,
            "time": end - start,
        }


In [104]:
import asyncio

tests = [
    # Single tool queries
    ["Send an email to john@example.com about the project update", [SendEmail]],
    ["What meetings do I have scheduled for tomorrow?", [GetCalendarEvents]],
    ["Set a reminder for my dentist appointment next week", [CreateReminder]],
    # Two tool combinations
    [
        "Check my calendar for next week's meetings and set reminders for each one",
        [GetCalendarEvents, CreateReminder],
    ],
    [
        "Look up my team meeting schedule and send the agenda to all participants",
        [GetCalendarEvents, SendEmail],
    ],
    [
        "Set a reminder for the client call and send a confirmation email to the team",
        [CreateReminder, SendEmail],
    ],
]

sem = asyncio.Semaphore(10)
coros = [generate_parallel_tool_calls(query, sem) for query, _ in tests]

results = await asyncio.gather(*coros)

In [105]:
import pandas as pd

df = pd.DataFrame(
    [
        {
            "query": test_item[0],
            "expected_tools": [tool.__name__ for tool in test_item[1]],
            "actual_tools": list(set([type(tool).__name__ for tool in result["response"]])),
            "time": round(result["time"],2),
        }
        for test_item, result in zip(tests, results)
    ]
)

df["precision"] = df.apply(
    lambda x: calculate_precision(x["actual_tools"], x["expected_tools"]), axis=1
)
df["recall"] = df.apply(
    lambda x: calculate_recall(x["actual_tools"], x["expected_tools"]), axis=1
)
df["CORRECT"] = df.apply(
    lambda x: "Y" if x["expected_tools"] == x["actual_tools"] else "N", axis=1
)

df


Unnamed: 0,query,expected_tools,actual_tools,time,precision,recall,CORRECT
0,Send an email to john@example.com about the pr...,[SendEmail],[],1.18,0.0,0.0,N
1,What meetings do I have scheduled for tomorrow?,[GetCalendarEvents],[GetCalendarEvents],1.64,1.0,1.0,Y
2,Set a reminder for my dentist appointment next...,[CreateReminder],[],1.34,0.0,0.0,N
3,Check my calendar for next week's meetings and...,"[GetCalendarEvents, CreateReminder]",[GetCalendarEvents],1.66,1.0,0.5,N
4,Look up my team meeting schedule and send the ...,"[GetCalendarEvents, SendEmail]",[GetCalendarEvents],1.89,1.0,0.5,N
5,Set a reminder for the client call and send a ...,"[CreateReminder, SendEmail]","[SendEmail, CreateReminder]",2.19,1.0,1.0,N


In [106]:
# Create a table showing precision and recall per tool
all_tools = set()
for tools in df["expected_tools"] + df["actual_tools"]:
    all_tools.update(tools)

stats = []
for tool in all_tools:
    # Get rows where tool was actually called
    tool_called_mask = df["actual_tools"].apply(lambda x: tool in x)
    # Get rows where tool was expected to be called
    tool_expected_mask = df["expected_tools"].apply(lambda x: tool in x)
    
    tool_stats = {
        "Tool": tool,
        "Precision": df[tool_called_mask]["precision"].mean() if tool_called_mask.any() else 0.0,  # Set to 0 if never called
        "Recall": df[tool_expected_mask]["recall"].mean(),
    }
    stats.append(tool_stats)

tool_df = pd.DataFrame(stats).set_index("Tool").round(2)
tool_df


Unnamed: 0_level_0,Precision,Recall
Tool,Unnamed: 1_level_1,Unnamed: 2_level_1
SendEmail,1.0,0.5
GetCalendarEvents,1.0,0.67
CreateReminder,1.0,0.5


In [107]:
round(df["recall"].mean().item(), 2), round(df["precision"].mean().item(), 2), round(df["time"].mean().item(), 2)

(0.5, 0.67, 1.65)

# Conclusion

## Parallel Tool calling vs Tool Calling


| Metric        | Tool Calling (Baseline) | Parallel Tool Calling | 
|---------------|--------------------------|------------------------|
| **Precision** | 0.83                    | 0.67 (-19.28%)         |
| **Recall**    | 1.00                    | 0.50 (-50.00%)         |
| **Avg Time**  | 1.44                    | 1.65 (+14.58%)         |

We can observe the following differences between the two methods

1. **Latency** : There's actually an increase in the latency with parallel tool calling which is surprising because we expect parallel tool calling to be faster. 
2. **Precision and Recall** : There's a 40% drop in precision and a 60% drop in recall with parallel tool calling. This means that our model is not only calling tools that are less relevant to the user's query but also calling less tools in general.

If we look at the specific test cases above, we can see that when we switch over from tool calling to parallel tool calling, our model struggles with the following issues.

1. Our model never calls the `CreateReminder` tool
2. It struggles to call tools in combination with one another

But this is still a huge improvement because now we have objective metrics that can measure the difference in performance between our two tool calling strategies. 

When we might want to iterate on the prompt, response model or even the specific language model itself, this is where these metrics come in handy.

## What's next?


In this notebook, we looked at two meterics - Precision and Recall that we could use to evaluate our model's tool calling capabilities. 

We then used these metrics to evaluate the two different tool calling strategies - tool calling and parallel tool calling and talked a bit about how we might use the results to decide which one to go with.

Now that we've built some intuition for how we might evaluate our model's tool calling capabilities, we'll start looking at how these metrics can be applied as we scale up the number of tools that we have in our application.

Specificaly, we'll start thinking about questions such as

- What sort of trade offs do we see when we scale up the number of tools that we have in our application?
- How do we decide which tools to include in our application?

and more when we start looking at some of the extensions and tools that Raycast provides.