# Week 6 - Systematically Improving Your Rag Application

> If you haven't ran the previous notebook `1. Evaluate Tools.ipynb`, please do so before continuing. A lot of the code in this notebook will be based off the evaluation methods that we cover in that notebook.

# Generating Our Dataset

We've defined a list of commands that we want to evaluate in our dataset in the `commands.yaml` file. In this specific notebook, we'll be generating a synthetic dataset of user requests that we can use to evaluate our mode's ability to call the commands that we've defined.

Before we start generating our dataset, let's first understand what our model might struggle with. When it comes to deciding what tools to call, there are a few common failure points that a personal assistant might face.

1. **A Lack Of Context** : It's common for users to use multiple note taking apps. We have Notion for company documents, Obsidian for personal notes and Apple Notes for quick notes that are more ephemeral. But it's difficult for a language model to know exactly which app to use when the user makes a request like `Sarah just got back to me on the project that our campaign was approved. can you make a note in our project page about it?` vs `I'm going shopping for groceries later, can you remind me that i need to buy milk, eggs and bread?`

We might guess that the user might want notion but we can't assume that.

2. **Multi-Step Tasks** : When we start scaling out our application to support calling multiple applications, we might find that our model struggles with multi-step tasks. For instance, given the request `create a new feature branch for the UI refactor and link all the tickets you've found about onboarding issues in the PR description`, we might find that our model struggles to call the required tools in the correct order or even call the tools at all.

Being able to simulate these failure points in our synthetic dataset will help us to understand how our model performs in these scenarios. In this notebook, we'll do so in a few steps

1. **Identifying Failure Points** : We'll first start by looking at how our model performs with simple/multi-step tasks to see how these failure points manifest themselves and start thinking about how we can generate synthetic queries that challenge our model which mimic these failure points

2. **Generating Synthetic Questions** : We'll start by generating some initial queries. Once we're satisfied with the quality of the queries, we'll then generate more queries by sampling from these generated queries. We'll then use these queries to benchmark the precision and recall of our model's function calls and establish an initial baseline

3. **Improving Performance** : Once we've established a simple baseline, we'll explore a few different strategies to improve the performance of our model. These include better descriptions of our commands, better prompts and some hard-coded few shot examples.

Throughout this process, we'll be using `braintrust` to log all of our experiments so that it's easy for us to see the results of our experiments and share them with the rest of the team. 

Once we've done so in this notebook, we'll then proceed to evaluate other strategies such as dynamic retrieval of tools and few shot examples in a subsequent notebook.

## Identifying Failure Points

Let's start by loading in our commands from the `commands.yaml` file. This contains a list of extensions and commands that we've manually cleaned and extracted from the Raycast Documentation.

We've also added a small `description` field here that explains what our hypothetical users uses the extension for. This will help us to generate more challenging queries that require the model to understand the context of the user's request.

When we generate synthetic queries, we want to randomly sample the commands to generate a diverse set of queries. Therefore, we want to flatten the dictionary of commands into a single flattened list.

In [5]:
import yaml
from rich import print

with open("commands.yaml", "r") as f:
    commands = yaml.safe_load(f)


print(commands["extensions"]["obsidian"])

In [2]:
import yaml


def read_and_flatten_commands(file_path: str):
    # Load and parse the commands.yaml file
    with open(file_path, "r") as file:
        commands_data = yaml.safe_load(file)

    # Extract extensions
    extensions = commands_data["extensions"]
    commands = []

    for extension_name, extension_data in extensions.items():
        for command in extension_data["commands"]:
            commands.append(
                {
                    "command_name": command["name"],
                    "extension_name": extension_name,
                    "extension_description": extension_data["description"],
                    "command_description": command["description"],
                }
            )

    return commands


commands = read_and_flatten_commands("commands.yaml")
commands[0]

{'command_name': 'searchNote',
 'extension_name': 'obsidian',
 'extension_description': "I use obsidian to take down notes when I'm studying or taking online courses. I've been taking a variety of different courses on topics such as machine learning, marketing, copywriting etc. I'll use this specific extension when I want to search for notes on a specific topic or ask questions about a topic. Sometimes I might also want to create a quick dailyNote to log my progress for the day.",
 'command_description': 'Full-text search across all vault notes and their metadata'}

Now that we've loaded in the commands, let's start by seeing how our model gets confused by the commands. Let's imagine we have three commands that we want to evalute

- `obsidian.search`
- `apple-notes.search`
- `notion.search`

Without much context or description of what the extension's command does, we might expect our model to get confused. This is ok because it provides a simple baseline to begin with. Let's see this in action below where our model gets the tool call wrong for every single query.


In [168]:
import instructor
import openai
from typing import Literal
from rich import print

client = instructor.from_openai(openai.OpenAI())

commands = ["obsidian.search", "apple-notes.search", "notion.search"]
queries = [
    [
        "Seems like the missing ingredient for the hamburger recipe was msg, let's add that in to our Beef Cheeseburger recipe note",
        ["apple-notes.search"],
    ],
    ["What did I write about LSTMs previously?", ["obsidian.search"]],
    [
        "Can you note down that Jeanine is the new person in charge of the onboarding team in our team docs?",
        ["notion.search"],
    ],
]

for query, expected_tool in queries:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that can call tools to help you with your tasks. You have access to the following tools: "
                + str(commands),
            }
        ],
        response_model=Literal[
            "obsidian.search", "apple-notes.search", "notion.search"
        ],
    )
    print(f"{query} - Expected: {expected_tool} - Actual: {response}")

We can see that our model struggles to call the correct tool for each query. Implicitly, it seems to call obsidian for all of the queries. This would be quite unacceptible for a personal assistant.

How about multi-step tasks? Let's imagine we have the following commands added to our existing commands

- `gmail.send_email`
- `gmail.search`
- `hubspot.create_contact`
- `hubspot.search`
- `hubspot.update_contact`

Let's see how our model performs with a few commands

In [15]:
import instructor
import openai
from typing import Literal

client = instructor.from_openai(openai.OpenAI())

commands = [
    "obsidian.search",
    "apple-notes.search",
    "notion.search",
    "gmail.search_email",
    "gmail.send_email",
    "gmail.search_contact",
    "gmail.search",
    "hubspot.create_contact",
    "hubspot.search_contact",
    "hubspot.update_contact",
]
queries = [
    [
        "Hey can you send an email to Sarah about the new project we've been approved for and cc our client on it? The POC is hubert from Nvidia",
        ["gmail.send_email", "hubspot.search_contact"],
    ],
    [
        "Send Rupert the booking confirmation for our Shinjuku hotel booking. It's inside my personal email at josh@gmail.com",
        ["gmail.search_email", "gmail.search_contact", "gmail.send_email"],
    ],
]

for query, expected_tool in queries:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that can call tools to help you with your tasks. You have access to the following tools: "
                + str(commands),
            }
        ],
        response_model=list[
            Literal[
                "obsidian.search",
                "apple-notes.search",
                "notion.search",
                "gmail.send_email",
                "gmail.search_contact",
                "gmail.search",
                "hubspot.create_contact",
                "hubspot.search_contact",
                "hubspot.update_contact",
            ]
        ],
    )
    print(f"\n\n{query} - " f"\nExpected: {expected_tool}" f"\nActual: {response}\n")

The performance is worse for the multi-step tasks. 

For the first and second example, we can see that it completely forgets to call the `gmail.send_email` task. It seems to default to search_contact for both gmail and hubspot by default instead of intuiting that hubspot should be called for professional contacts and gmail should be called for personal contacts.

This is something that a few shot example could help with potentially or by users providing more context (either through a prompt or by updating the command/extension descriptions manually in their application).

Now that we've identified some of these failure points, let's start generating our dataset.

## Generating Initial Queries

Let's think about what we want to generate

1. **Tool Calls** : Our queries should require 1-2 tools to be called
2. **As user commands** : Most of these queries will be in the form of user queries (Eg. `send an email to Sarah about the Orion project and cc james`) that are more direct rather than conversational in nature (Eg. `Hey I hope you're doing well on your end. I was thinking that with the new updates to the orion project, we should probbaly cc sarah, what do you think?`)
3. **Diverse** : We want to generate queries that span a diverse set of tone and lengths (Eg. we want queries that span from `send email sarah` to `hey can you send an email to Sarah, she's on my personal email`)

Each test query will be a json object with the following fields

```json
{
    "query": str
    "labels": list[str] # This is the `extension_name.command_name` of the tools that are called in the query - this helps us avoid issues where different extensions might have command names that are the same
}
```

We'll write all of our queries to a `queries.jsonl` file and use that to store the queries we've generated. Since it's relatively simple to look at these queries in the default text editor, we'll just use that to eyeball and delete queries that don't make the cut.


In [19]:
import yaml

# Load and parse the commands.yaml file
with open("commands.yaml", "r") as file:
    commands_data = yaml.safe_load(file)

# Extract extensions
extensions = commands_data["extensions"]
commands = []

for extension_name, extension_data in extensions.items():
    for command in extension_data["commands"]:
        commands.append(
            {
                "command_name": command["name"],
                "extension_name": extension_name,
                "extension_description": extension_data["description"],
                "command_description": command["description"],
            }
        )

commands[0]

{'command_name': 'searchNote',
 'extension_name': 'obsidian',
 'extension_description': "I use obsidian to take down notes when I'm studying or taking online courses. I've been taking a variety of different courses on topics such as machine learning, marketing, copywriting etc. I'll use this specific extension when I want to search for notes on a specific topic or ask questions about a topic. Sometimes I might also want to create a quick dailyNote to log my progress for the day.",
 'command_description': 'Full-text search across all vault notes and their metadata'}

In [71]:
from pydantic import BaseModel, field_validator, ValidationInfo
from asyncio import Semaphore
import random


class SyntheticQuery(BaseModel):
    chain_of_thought: str
    query: str
    commands: list[str]

    @field_validator("commands")
    def validate_commands(cls, v: list[str], info: ValidationInfo):
        context = info.context
        commands = context["commands"]
        command_names = set(
            [
                f"{command['extension_name']}.{command['command_name']}"
                for command in commands
            ]
        )
        for command in v:
            if command not in command_names:
                raise ValueError(f"Command {command} not found in commands")

        return v


client = instructor.from_openai(openai.AsyncOpenAI())


async def generate_synthetic_query(
    client: instructor.AsyncInstructor, commands: list[dict], sem: Semaphore
):
    tone = ["natural", "curt and concise", "impatient", "slightly verbose"]
    async with sem:
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": """
                Generate a natural user message that would require using one or more of the following commands:
                <commands>
                {% for command in commands %}
                - {{ command['extension_name'] }}.{{ command['command_name'] }} : {{ command['command_description'] }}
                {% endfor %}
                </commands>

                Requirements for the generated message:
                - Do not explicitly mention any extension names or command names
                - The message should clearly require using at least one of the commands' functionality
                - Keep the message natural and conversational. It should be in the form of a message, remember that users are lazy and will type very short messages if possible.
                - Focus on what the user is trying to achieve
                - Have a tone of {{ tone }} with a length that's around {{ length }} words
                - Make sure to have specific details in the query (Eg. the course, client name, project task achieved, specific feature being rolled out). These should be realistic (Eg. ACME Corp is a horrible name but Nike is a good one)
                - The message can require multiple commands if it makes sense for the user's goal

                <samples>
                1. Can you help send an email to Sarah about the new project we've been approved for and cc our client on it? The POC is hubert from Nvidia
                2. Create a new calendar event for the 18th of December at 10am
                3. What's the temperature now in Taipei?
                </samples>
                """,
                }
            ],
            context={
                "commands": commands,
                "tone": random.choice(tone),
                "length": random.randint(5, 20),
            },
            response_model=SyntheticQuery,
        )
        return {
            "query": resp.query,
            "labels": resp.commands,
        }


In [72]:
from tqdm.asyncio import tqdm_asyncio as asyncio
import json

sem = Semaphore(10)
with open("queries.jsonl", "a+") as f:
    chosen_commands = random.sample(commands, random.randint(1, 5))
    coros = [generate_synthetic_query(client, chosen_commands, sem) for _ in range(10)]
    results = await asyncio.gather(*coros)
    for result in results:
        f.write(json.dumps(result) + "\n")

100%|██████████| 10/10 [00:08<00:00,  1.24it/s]


This will generate some initial queries and add them to a `queries.jsonl` file. In my case, this generated the following queries.

```bash
{
    "query": "Create a note for today's meeting summary with Intel. Seems like they're keen on expanding in Asia and getting more into AI",
    "labels": ["notion.quickCapture"]
}
{
    "query": "Can you compile all of the open PRs that have been tagged under the UI refactor project and add them to a new page in Notion?",
    "labels": [
        "jira.searchTicket",
        "github.searchPullRequests", 
        "notion.createPage"
    ]
}
```

Now that we've generated these queries, let's use them to seed our prompt with some few shot examples by randomly sampling a few queries from the `queries.jsonl` file each time.

In [73]:
import random

client = instructor.from_openai(openai.AsyncOpenAI())


async def generate_synthetic_query(
    client: instructor.AsyncInstructor,
    commands: list[dict],
    queries: list[dict],
    sem: Semaphore,
):
    tone = ["natural", "curt and concise", "impatient", "slightly verbose"]
    async with sem:
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": """
                Generate a natural user message that would require using one or more of the following commands:
                <commands>
                {% for command in commands %}
                - {{ command['extension_name'] }}.{{ command['command_name'] }} : {{ command['command_description'] }}
                {% endfor %}
                </commands>

                Requirements for the generated message:
                - Do not explicitly mention any extension names or command names
                - The message should clearly require using at least one of the commands' functionality
                - Keep the message natural and conversational. It should be in the form of a message, remember that users are lazy and will type very short messages if possible.
                - Focus on what the user is trying to achieve
                - Have a tone of {{ tone }} with a length that's around {{ length }} words
                - Make sure to have specific details in the query (Eg. the course, client name, project task achieved, specific feature being rolled out). These should be realistic (Eg. ACME Corp is a horrible name but Nike is a good one)
                - The message can require multiple commands if it makes sense for the user's goal

                <samples>
                {% for query in queries %}
                - {{ query['query'] }} : {{ query['labels'] }}
                {% endfor %}
                </samples>
                """,
                }
            ],
            context={
                "commands": commands,
                "tone": random.choice(tone),
                "length": random.randint(5, 20),
                "queries": queries,
            },
            response_model=SyntheticQuery,
        )
        return {
            "query": resp.query,
            "labels": resp.commands,
        }


In [149]:
sem = Semaphore(10)
queries = [json.loads(line) for line in open("queries.jsonl", "r")]

with open("queries.jsonl", "a+") as f:
    coros = []

    for _ in range(15):
        chosen_commands = random.sample(commands, random.randint(1, 10))
        chosen_queries = random.sample(queries, random.randint(1, 10))
        coros.append(
            generate_synthetic_query(client, chosen_commands, chosen_queries, sem)
        )

    results = await asyncio.gather(*coros)
    for result in results:
        f.write(json.dumps(result) + "\n")

100%|██████████| 15/15 [00:05<00:00,  2.83it/s]


## Benchmarking Precision and Recall

Now that we've generated our dataset, we can start benchmarking the precision and recall of our model. We'll start by loading in the queries that we've generated and then seeing what the precion and recall of our model is. We've provided a function to evaluate the precision and recall of our model in a `helper.py` file that we defined in the`1. Evaluate Tools.ipynb` notebook.

In [140]:
from helpers import calculate_precision, calculate_recall


preds = ["a", "b", "c"]
labels = ["a"]

calculate_precision(preds, labels), calculate_recall(preds, labels)

(0.33, 1.0)

We can see here that precision is 0.33 since only 1 item is relevant but 3 items were predicted. Recall is 1 since all of the relevant items ( in labels ) appeared in the predictions.

In [151]:
import json

def validate_queries(queries: list[dict],commands: list[dict]):
    valid_commands = set([f"{command['extension_name']}.{command['command_name']}" for command in commands])
    for query in queries:
        for label in query["labels"]:
            if label not in valid_commands:
                raise ValueError(f"Command {label} not found in commands")
    return True


queries = [json.loads(line) for line in open("queries.jsonl", "r")]
commands = read_and_flatten_commands("commands.yaml")
validate_queries(queries, commands)

True

In [152]:
from pydantic import BaseModel, field_validator, ValidationInfo
import instructor


class Commands(BaseModel):
    commands: list[str]

    @field_validator("commands")
    def validate_commands(cls, v: list[str], info: ValidationInfo):
        context = info.context
        commands = context["commands"]
        command_names = set(
            [
                f"{command['extension_name']}.{command['command_name']}"
                for command in commands
            ]
        )
        for command in v:
            if command not in command_names:
                raise ValueError(f"Command {command} not found in commands")

        return v


async def generate_commands(
    query: str, client: instructor.AsyncInstructor, commands: list[dict]
):
    resp =  await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant that can call tools to help you with your tasks. Only use the following commands provided below.

                <commands>
                {% for command in commands %}
                - {{ command['extension_name'] }}.{{ command['command_name'] }}
                {% endfor %}
                </commands>
                """,
            },
            {"role": "user", "content": query},
        ],
        response_model=Commands,
        context={"commands": commands},
    )
    return resp.commands


In [159]:
from braintrust import Eval, Score
import openai

def evaluate_braintrust(input, output, **kwargs):
    return [
        Score(
            name="precision",
            score=calculate_precision(kwargs["expected"], output),
        ),
        Score(
            name="recall",
            score=calculate_recall(kwargs["expected"], output),
        ),
    ]

client = instructor.from_openai(openai.AsyncOpenAI())   
commands = read_and_flatten_commands("commands.yaml")

async def task(query: str, hooks):
    return await generate_commands(query, client, commands)

base_case = await Eval(
    "function-calling",
    data=lambda: [
        {
            "input": row["query"],
            "expected": row["labels"],
        }
        for row in queries
    ],
    task=task,
    max_concurrency=10,
    scores=[evaluate_braintrust],
)


Experiment week-6-1734443274 is running at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-1734443274
function-calling (data): 44it [00:00, 189475.75it/s]


function-calling (tasks):   0%|          | 0/44 [00:00<?, ?it/s]


week-6-1734443274 compared to week-6-1734443256:
70.82% (-06.82%) 'recall'    score	(2 improvements, 5 regressions)
67.80% (-06.43%) 'precision' score	(2 improvements, 6 regressions)

0.75s (-09.25%) 'duration'	(17 improvements, 27 regressions)

See results for week-6-1734443274 at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-1734443274


# Improving Performance

## Adding Detailed Descriptions

One of the common failure points that we identified was that given a simple function name, our model would struggle to call the correct tool. Let's try to see if we can improve the performance of our model by adding more detailed descriptions to our commands.

Luckily, we've already done so in the `commands.yaml` file. Let's see if this improves the performance of our model. Since all this requires is modifying the jinja templating in our original file, minimal changes are required.

In [158]:
async def generate_commands_with_descriptions(
    query: str, client: instructor.AsyncInstructor, commands: list[dict]
):
    resp =  await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant that can execute command(s) listed below to help answer a user query.

                Only use the following commands provided below. A description of each command and its name is provided to help you understand what executing the command will do.
                
                <commands>
                {% for command in commands %}
                - Command Name : {{ command['extension_name'] }}.{{ command['command_name'] }}
                  Description  : ({{ command['command_description'] }})
                {% endfor %}
                </commands>
                """,
            },
            {"role": "user", "content": query},
        ],
        response_model=Commands,
        context={"commands": commands},
    )
    return resp.commands

client = instructor.from_openai(openai.AsyncOpenAI())   
commands = read_and_flatten_commands("commands.yaml")

async def task(query: str, hooks):
    return await generate_commands_with_descriptions(query, client, commands)

descriptions = await Eval(
    "function-calling",
    data=lambda: [
        {
            "input": row["query"],
            "expected": row["labels"],
        }
        for row in queries
    ],
    task=task,
    max_concurrency=10,
    scores=[evaluate_braintrust],
)


Experiment week-6-1734443256 is running at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-1734443256
function-calling (data): 44it [00:00, 24136.72it/s]


function-calling (tasks):   0%|          | 0/44 [00:00<?, ?it/s]


week-6-1734443256 compared to week-6-1734443170:
77.64% (-10.61%) 'recall'    score	(3 improvements, 7 regressions)
74.23% (-11.00%) 'precision' score	(1 improvements, 7 regressions)

0.84s (+11.78%) 'duration'	(30 improvements, 14 regressions)

See results for week-6-1734443256 at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-1734443256


## Adding Few Shot Examples

If we were to deploy this in production, we could also imagine that we'd want to add some few shot examples. We want to do so because it allows the model to be able to use those few shot examples to understand the context of the user's request.

In a production application, we might obtain these examples by saving user queries and the tools they selected in the end in a local vector database. When the user makes a request, we'd then retrieve the most similar examples and include them in the prompt. 

One piece of low hanging fruit is providing context for which note taking app to choose - `obsidian`, `notion` or `apple-notes`. We'll add a few shot example to our prompt for each so our model can use this context to make a better decision.


In [160]:
async def generate_commands_with_few_shots(
    query: str, client: instructor.AsyncInstructor, commands: list[dict]
):
    resp =  await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant that can execute command(s) listed below to help answer a user query.

                Only use the following commands provided below. A description of each command and its name is provided to help you understand what executing the command will do.
                
                <commands>
                {% for command in commands %}
                - Command Name : {{ command['extension_name'] }}.{{ command['command_name'] }}
                  Description  : ({{ command['command_description'] }})
                {% endfor %}
                </commands>

                <examples>
                User: Create a new page for today's meeting with Asus
                Assistant: [notion.createPage]

                User: quick brain dump on the ASUS project to be added to the existing notes. I think they've got a huge amount of plans involving them expanding in europe. Specifically what I think is important in this specific case is that we need to work on building that relationship early on so that we become the company they always talk to. 
                Assistant: [notion.quickCapture]

                User: Get my notes on CS1423 from the last lecture
                Assistant: [obsidian.search]

                User: Can you send an email to Sarah about the new project we've been approved for and cc our client on it? The POC is hubert from Nvidia
                Assistant: [gmail.send_email, hubspot.search_contact]

                User: Can you fetch the onboarding documents for the interns to use?
                Assistant: [notion.search]

                User: Create a new document to summarize the actionables from the meeting with Asus
                Assistant: [notion.createPage]

                User: Can you pull up the meeting notes from the meeting with Thoughtworks?
                Assistant: [notion.search]

                User: What items did I need to buy for dinner tonight?
                Assistant: [apple-notes.search]

                User: Fetch my notes on CS4123 from the last lecture
                Assistant: [obsidian.search]
                </examples>
                """,
            },
            {"role": "user", "content": query},
        ],
        response_model=Commands,
        context={"commands": commands},
    )
    return resp.commands

client = instructor.from_openai(openai.AsyncOpenAI())   
commands = read_and_flatten_commands("commands.yaml")

async def task(query: str, hooks):
    return await generate_commands_with_few_shots(query, client, commands)

few_shots = await Eval(
    "function-calling",
    data=lambda: [
        {
            "input": row["query"],
            "expected": row["labels"],
        }
        for row in queries
    ],
    task=task,
    max_concurrency=10,
    scores=[evaluate_braintrust],
)


Experiment week-6-1734443301 is running at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-1734443301
function-calling (data): 44it [00:00, 36880.37it/s]


function-calling (tasks):   0%|          | 0/44 [00:00<?, ?it/s]


week-6-1734443301 compared to week-6-1734443274:
88.25% (+17.43%) 'recall'    score	(11 improvements, 4 regressions)
87.50% (+19.70%) 'precision' score	(12 improvements, 3 regressions)

1.24s (+48.82%) 'duration'	(20 improvements, 24 regressions)

See results for week-6-1734443301 at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-1734443301


In [166]:
import pandas as pd

results = [
    ["Base Case" ,base_case],
    ["Descriptions" ,descriptions],
    ["Few Shots" ,few_shots]
]

scores = []
for label, item in results:
    item_score = {
        metric: item.summary.scores[metric].score  for metric in item.summary.scores
    }
    scores.append({
        "Approach": label,
        **item_score
    })

pd.DataFrame(scores).round(2)

Unnamed: 0,Approach,recall,precision
0,Base Case,0.71,0.68
1,Descriptions,0.78,0.74
2,Few Shots,0.88,0.88


## Results

| Approach | Recall | Precision |
| -------- | --------- | ------ |
| Original | 0.71| 0.68 |
| With Descriptions | 0.78 | 0.74 |
| With Few Shots | 0.88 | 0.88 |

We can see that adding descriptions and few shots to our prompt has improved the precision and recall of our model. This is because it increases the context of the user's request and gives the model more examples to learn from. 

In the next notebook, we'll see how we can fetch relevant examples and tools from a vector database while ensuring performance remains high. 