In [1]:
%load_ext autoreload
%autoreload 2

# Week 6 - Systematically Improving Your Rag Application

> If you haven't ran the previous notebook `1. Evaluate Tools.ipynb`, please do so before continuing. A lot of the code in this notebook will be based off the evaluation methods that we cover in that notebook.

In this notebook, we'll look at how we can boost our RAG model's ability to select and execute the correct tooling based off user commands. 

We'll do so in 3 steps

1. **Generating Synthetic Queries**: We'll first start by generating some initial queries to benchmark our model's tool calling ability and identify failure points that our model might struggle with
2. **Benchmarking Model Performance**: Next, we'll use our initial queries to establish a baseline of our model's tool calling ability
3. **Improving Model Performance**: Finally, we'll look at how we can improve the performance of our model through the use of few shot examples and system prompts that a user can provide to the model to describe his/her usage patterns.

By starting off with a synthetic dataset, we can gain insight into specific areas where our model might struggle and use that to iteratively improve our model's tool calling ability. With enough user data, we'll also be able to blend in user queries into our dataset to make it more realistic.

In [3]:
import json
from pydantic import BaseModel, computed_field


class Command(BaseModel):
    extension_name: str
    command_name: str
    command_description: str

    @computed_field
    def key(self) -> str:
        return f"{self.extension_name}.{self.command_name}"


def load_commands(commands_path: str):
    with open(commands_path, "r") as f:
        return [
            Command(
                extension_name=command["extension_name"],
                command_name=command["source_name"],
                command_description=command["description"],
            )
            for command in json.load(f)
        ]


commands = load_commands("raw_commands.json")
print(len(commands))

54


Let's first create a dataset that tests single command calls

In [5]:
import instructor
import google.generativeai as genai
from pydantic import field_validator, ValidationInfo
import random
import re

client = instructor.from_gemini(
    genai.GenerativeModel("gemini-1.5-flash-latest"), use_async=True
)


class UserQuery(BaseModel):
    query: str

    @field_validator("query")
    def validate_query(cls, v, info: ValidationInfo):
        application = info.context["command"].extension_name
        if application.lower() in v.lower():
            raise ValueError(
                f"Do not mention the application name {application} in the query. Rewrite the query of {v} so that it does not mention the application name or add specific details that allude to it instead. This includes web links that might be in the query."
            )

        return re.sub(r"\\{1,}", "", v.replace('"', "'").strip())


async def generate_question(client, command: Command, commands: list[Command]):
    length = random.randint(
        10, 50
    )  # Increased max length to allow for more conversational queries
    tone = random.choice(
        [
            "formal_request (Eg. Could you please assist me in...)",  # "Could you please assist me in..."
            "imperative command (Eg. add a new event to the calendar for tomorrow at 10am with Jeffrey)",  # "hey can u help me with..."
            "voice_transcript (Eg. yeah um so I need to like...)",  # "yeah um so I need to like..."
        ]
    )
    modification = random.choice(
        [
            "perfect_english",
            "typos_and_misspellings",  # "recieved", "tommorow", etc
            "missing_punctuation",  # no periods or commas
            "abbreviated_words",  # "pls", "thx", "u", "tmrw"
            "voice_to_text_errors",  # misheard words, run-on sentences
            "random_capitalization",  # "Can You Help ME with"
            "extra_spaces_or_no_spaces",  # "help  me    with" or "helpme with"
        ]
    )

    response = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                Generate a user message that requires the following tool to be called. This should be in the imperative (Eg. "Create a new issue for the problems we're facing integrating next auth in the new deployment of the frontend") instead of a question (Can you help me to...)

                <command>
                Name: {{ command.key }}
                Description: {{ command.command_description }}
                </command>

                Here are all of the commands that the user has available to them. Make sure that you user query is formulated specifically to call the right command.

                <other_commands>
                {% for command in commands %}
                    <command>
                    Name: {{ command.key }}
                    Description: {{ command.command_description }}
                    </command>
                {% endfor %}
                </other_commands>

                Do not mention the extension name of {{ command.extension_name }} or the command name in the query where posible of adding specific details that allude to it instead.

                - The query should be {{ length }} words long with a {{tone}} tone. Note that the query should be written with {{ modification }}
                - Be specific with the query and give specific details. For instance, send a slack message is a bad example since don't have any of the information about the channel to call. A better example here is #general update that we're moving ahead with the second prototype for the new frontend auth feature.
                - Be creative - Eg. when generating a user query for get-unread-notifications, don't just say hey get me my open pull requests. Notifications are more broad than just pull requests - they include things like comments, mentions , security alerts and deployment status etc. So we might say "Can you help me check if I have any PRs to review and whether there are any security alerts for the repo?"

                Here are some bad examples. Refer to them when generating the query but do not copy the specific packages or examples that we're using here, make sure to invent something new.
                <bad examples>
                Eg. Query: Report a new issue: the new frontend auth prototype is moving to the second version
                    label: jira.create-issue
                
                Evaluation: This is not a great example because the query is so general. A better query here instead might be "create new issue for the problems we're facing integrating next auth in the new deployment of the frontend" because we give specific packages that are involved and the exact project it's from.
                </examples>

                Here's some context for how each individual extension is used. Make sure to refer closely and use the context to formulate the query.
                <context>
                - Slack is used for joining specific channels to get support for help. Eg. Send to Modal-support : having issues with deployment. Send to comfy-ui : having issues with the UI.
                - Microsoft Teams is used for work related communication - Eg. Schedule a quarterly review with Sarah, ping my team that I'll be 5 minutes late to standup
                - Discord is used for personal gaming and online communities -> Eg. ask the squad if anyone's up for some raids tonight? Tell Thomas i'll be on in 10 for Helldivers
                - Github is used for code related communication - Eg. Show me my PRs, create a pull request for the new feature and for code discussions about specific implementation
                - Jira is used to track the specific task/issue/bugs. Eg. Find all issues with the summary containing 'Bug Fix' and status 'Open' plz
                - Confluence is used for work related documenation - Eg. what's the onboarding process for new employees?
                - Apple Notes is used for personal note taking - Eg. Help create a note about the shopping list for the christmas dinner - need 3 eggs, some ham, and some cheese
                - Obsidian is used for personal knowledge base - Eg. Help me find the notes on the latest research paper on large language models
                - Notion is used for planning things like trips ( create pages as the default but let's use tables to track things like spending/other expenses that benefit from a table format)
                </context>
                """,
            }
        ],
        context={
            "command": command,
            "commands": commands,
            "length": length,
            "tone": tone,
            "modification": modification,
        },
        response_model=UserQuery,
    )
    return {"query": response.query, "labels": [command.key]}


await generate_question(client, commands[0], commands)

{'query': 'Create an issue: Problem integrating Next Auth into the new frontend deployment; authentication failures. Assign to John Smith.',
 'labels': ['jira.create-issue']}

In [6]:
import random
from tqdm.asyncio import tqdm_asyncio as asyncio

questions = await asyncio.gather(
    *[
        generate_question(client, command, commands)
        for command in random.sample(commands, 10)
    ]
)

I0000 00:00:1736257627.811979  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
100%|██████████| 10/10 [00:02<00:00,  3.76it/s]


In [7]:
with open("queries.jsonl", "w") as f:
    for question in questions:
        f.write(json.dumps(question) + "\n")

Now let's measure the performance of our model on these initial queries. We'll do so by loading in the queries and then using the `generate_commands` function to generate the commands that the model should call.

In [16]:
import json


def load_queries(commands: list[Command], query_path: str):
    valid_commands = set(command.key for command in commands)
    with open(query_path, "r") as f:
        queries = [json.loads(line) for line in f]
        for query in queries:
            for label in query["labels"]:
                if label not in valid_commands:
                    raise ValueError(f"Command {label} not found in commands")
    return queries


commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")

In [60]:
class Commands(BaseModel):
    chain_of_thought: str
    commands: list[str]

    @field_validator("commands")
    def validate_commands(cls, v: list[str], info: ValidationInfo):
        context = info.context
        commands = context["commands"]
        command_names = set([command.key for command in commands])
        for command in v:
            if command not in command_names:
                raise ValueError(f"Command {command} not found in commands")

        assert len(v) >= 1, "At least one command must be called"

        return v


async def generate_commands(
    query: str, client: instructor.AsyncInstructor, commands: list[Command]
):
    resp = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant. You must call a tool in response to the user query below.
                
                Refer to the command description to identify the right command to call and make sure to return the command name in your final response.

                Here is some information on the system

                <commands>
                {% for command in commands %}
                    <command>
                    Name: {{ command.key }}
                    Description: {{ command.description}}
                    </command>
                {% endfor %}
                </commands>
                """,
            },
            {"role": "user", "content": query},
        ],
        response_model=Commands,
        context={"commands": commands},
    )
    return resp.commands


In [61]:
from braintrust import Eval, Score
from helpers import calculate_precision, calculate_recall
import openai


def evaluate_braintrust(input, output, **kwargs):
    return [
        Score(
            name="precision",
            score=calculate_precision(output, kwargs["expected"]),
        ),
        Score(
            name="recall",
            score=calculate_recall(output, kwargs["expected"]),
        ),
    ]


client = instructor.from_gemini(
    genai.GenerativeModel("gemini-1.5-flash-latest"), use_async=True
)
commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")

async def task(query: str, hooks):
    return await generate_commands(query, client, commands)


base_case = await Eval(
    "function-calling",
    data=lambda: [
        {
            "input": row["query"],
            "expected": row["labels"],
        }
        for row in queries
    ],
    task=task,
    max_concurrency=10,
    scores=[evaluate_braintrust],
)


I0000 00:00:1736260567.542956  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736260567.555241  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736260567.566884  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736260567.576443  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736260567.588280  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736260567.597531  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736260567.606499  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
Experiment week-6-1736260569 is running at https://www.braintrust.dev/app/567/p/function-calling/experim

function-calling (tasks):   0%|          | 0/14 [00:00<?, ?it/s]


week-6-1736260569 compared to week-6-1736260526:
65.50% (+07.14%) 'recall'    score	(2 improvements, 1 regressions)
65.64% (-03.43%) 'precision' score	(2 improvements, 1 regressions)

1736260569.22s start
1736260570.65s end
1.42s (+29.75%) 'duration'	(7 improvements, 7 regressions)

See results for week-6-1736260569 at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-1736260569


Now that we've got a good baseline, let's generate more queries and see if we can improve our model's performance. We'll do so by reading from `queries.jsonl`, finding examples for the specific command we're interested in and then choosing another 3-4 other examples that are from other commands.

This will help us to be able to introduce some degree of diversity in our dataset and generate more tricky queries that test the model's ability to call the right command.

In [65]:
from collections import defaultdict

commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")

command_to_query = defaultdict(list)
for query in queries:
    for label in query["labels"]:
        command_to_query[label].append(query["query"])
all_queries = [query["query"] for query in queries]

With these new queries, we can then generate more complex queries that test the model's ability to call the right command.

In [69]:
import instructor
import google.generativeai as genai
import random
from rich import print

client = instructor.from_gemini(
    genai.GenerativeModel("gemini-2.0-flash-exp"), use_async=True
)

class MultiCommandQuery(BaseModel):
    query: str
    commands: list[str]

    @field_validator("commands")
    def validate_commands(cls, v: list[str], info: ValidationInfo):
        context = info.context
        commands = context["commands"]
        command_names = set([command.key for command in commands])
        for command in v:
            if command not in command_names:
                raise ValueError(f"Command {command} not found in commands")
        return v

async def generate_question_with_examples(
    client,
    command: Command,
    commands: list[Command],
    query_to_examples: dict,
    queries: list[str],
):
    length = random.randint(
        10, 20
    )  # Increased max length to allow for more conversational queries
    tone = random.choice(
        [
            "formal_request",  # "Could you please assist me in..."
            "imperative command",  # "hey can u help me with..."
            "voice_transcript",  # "yeah um so I need to like..."
        ]
    )

    command_examples = query_to_examples[command.key]
    selected_examples = random.sample(command_examples, min(3, len(command_examples)))

    other_examples = random.sample(
        [query for query in queries if query not in selected_examples], 3
    )
    selected_counter_examples = random.sample(other_examples, 3)

    response = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                Generate a hypothetical user query that requires the following command to be called in conjunction with at most 2 other commands that you are given below. This should make logical sense

                eg. user has a bug that they need to log in confluence + jira
                Eg. user wants to pull out his shopping list for the wine he likes and then see if it's on amazon
                Eg. user wants to search for the latest research paper on large language models and then create a new note in their obsidian vault

                Users for instance would not
                - track issues in github and jira at the same time. Instead they might start a discussion in github but then create a jira issue to track it
                
                This is the command you must include
                <command>
                Name: {{ command.key }}
                Description: {{ command.command_description }}
                </command>

                Here are the other commands that the user has available to them which you can combine with the command above to be called.

                <other_commands>
                {% for command in commands %}
                    <command>
                    Name: {{ command.key }}
                    Description: {{ command.command_description }}
                    </command>
                {% endfor %}
                </other_commands>

                Here are some examples of how the user might ask for this command. Refer closely to these examples for how to formulate the user command but do not copy any of the details in these examples. Come up with new examples that are similar but not exactly the same.
                <examples>
                {% for query in selected_examples %}
                    - {{ query }}
                {% endfor %}
                </examples>

                Here are some contrastive examples of how the user will not ask for this command. Look closely at these examples and make sure that your query is not similar to any of these.

                <counter_examples>
                {% for query in selected_counter_examples %}
                    - {{ query }}
                {% endfor %}
                </counter_examples>

                Do not mention the extension name of {{ command.extension_name }} or the command name in the query where posible of adding specific details that allude to it instead.

                - The query should be {{ length }} words long with a {{tone}} tone. The query should be written casually and naturally.
                - Be specific with the query and give specific details. For instance, send a slack message is a bad example since don't have any of the information about the channel to call. A better example here is #general update that we're moving ahead with the second prototype for the new frontend auth feature.
                - Be creative - Eg. when generating a user query for get-unread-notifications, don't just say hey get me my open pull requests. Notifications are more broad than just pull requests - they include things like comments, mentions , security alerts and deployment status etc. So we might say "Can you help me check if I have any PRs to review and whether there are any security alerts for the repo?"

                Here's some context for how each individual extension is used. Make sure to refer closely and use the context to formulate the query.
                <context>
                - Slack is used for joining specific channels to get support for help. Eg. Send to Modal-support : having issues with deployment. Send to comfy-ui : having issues with the UI.
                - Microsoft Teams is used for work related communication - Eg. Schedule a quarterly review with Sarah, ping my team that I'll be 5 minutes late to standup
                - Discord is used for personal gaming and online communities -> Eg. ask the squad if anyone's up for some raids tonight? Tell Thomas i'll be on in 10 for Helldivers
                - Github is used for code related communication - Eg. Show me my PRs, create a pull request for the new feature and for code discussions about specific implementation
                - Jira is used to track the specific task/issue/bugs. Eg. Find all issues with the summary containing 'Bug Fix' and status 'Open' plz
                - Confluence is used for work related documenation - Eg. what's the onboarding process for new employees?
                - Apple Notes is used for personal note taking - Eg. Help create a note about the shopping list for the christmas dinner - need 3 eggs, some ham, and some cheese
                - Obsidian is used for personal knowledge base - Eg. Help me find the notes on the latest research paper on large language models
                - Notion is used for planning things like trips ( create pages as the default but let's use tables to track things like spending/other expenses that benefit from a table format)
                </context>
                """,
            }
        ],
        context={
            "command": command,
            "commands": commands,
            "length": length,
            "tone": tone,
            "selected_examples": selected_examples,
            "selected_counter_examples": selected_counter_examples,
        },
        response_model=MultiCommandQuery,
    )
    return {"query": response.query, "labels": response.commands}


print(await generate_question_with_examples(client, commands[0], commands, command_to_query, all_queries))

In [72]:
import random
from tqdm.asyncio import tqdm_asyncio as asyncio

questions = await asyncio.gather(
    *[
        generate_question_with_examples(client, command, commands, command_to_query, all_queries)
        for command in random.sample(commands, 5)
    ]
)
with open("queries.jsonl", "a") as f:
    for question in questions:
        f.write(json.dumps(question) + "\n")

100%|██████████| 5/5 [00:01<00:00,  3.47it/s]


In [74]:
async def generate_commands_with_system_prompt(
    query: str, client: instructor.AsyncInstructor, commands: list[Command]
):
    resp = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant. You must call a tool in response to the user query below.
                
                Refer to the command description to identify the right command to call and make sure to return the command name in your final response.

                Here is some information on the system

                <commands>
                {% for command in commands %}
                    <command>
                    Name: {{ command.key }}
                    Description: {{ command.description}}
                    </command>
                {% endfor %}
                </commands>

                Here are some instructions on how to choose the right extension and command in response to the user query below.
                <instructions>
                Code Review
                - Jira is for tracking issues 
                - Github is for everything else, so direct discussions, pull requests etc there. But all github issues are going to be tracked in jira

                Note taking apps:
                - Apple Notes: Shopping lists and quick notes
                - Obsidian: Personal knowledge base (MOOCs, learning resources)
                - Confluence: All work documentation and wikis ( so in case there's an incident report, it goes here)
                - Notion: Trip planning (use pages by default, tables for tracking expenses)

                When available, use AI features to auto-generate content for new pages/documents
                </instructions>
                """,
            },
            {"role": "user", "content": query},
        ],
        response_model=Commands,
        context={"commands": commands},
    )
    return resp.commands


In [None]:
async def generate_commands_with_system_prompt(
    query: str, client: instructor.AsyncInstructor, commands: list[Command]
):
    resp = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant. You must call a tool in response to the user query below.
                
                Refer to the command description to identify the right command to call and make sure to return the command name in your final response.

                Here is some information on the system

                <commands>
                {% for command in commands %}
                    <command>
                    Name: {{ command.key }}
                    Description: {{ command.description}}
                    </command>
                {% endfor %}
                </commands>

                Here are some instructions on how to choose the right extension and command in response to the user query below.
                <instructions>
                Code Review
                - Jira is for tracking issues 
                - Github is for everything else, so direct discussions, pull requests etc there. But all github issues are going to be tracked in jira

                Note taking apps:
                - Apple Notes: Shopping lists and quick notes
                - Obsidian: Personal knowledge base (MOOCs, learning resources)
                - Confluence: All work documentation and wikis ( so in case there's an incident report, it goes here)
                - Notion: Trip planning (use pages by default, tables for tracking expenses)

                When available, use AI features to auto-generate content for new pages/documents
                </instructions>
                """,
            },
            {"role": "user", "content": query},
        ],
        response_model=Commands,
        context={"commands": commands},
    )
    return resp.commands

async def generate_commands_with_system_prompt_and_examples(
    query: str, client: instructor.AsyncInstructor, commands: list[Command]
):
    resp = await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant. You must call a tool in response to the user query below.
                
                Refer to the command description to identify the right command to call and make sure to return the command name in your final response.

                Here is some information on the system

                <commands>
                {% for command in commands %}
                    <command>
                    Name: {{ command.key }}
                    Description: {{ command.description}}
                    </command>
                {% endfor %}
                </commands>

                Here are some instructions on how to choose the right extension and command in response to the user query below.
                <instructions>
                Code Review
                - Jira is for tracking issues 
                - Github is for everything else, so direct discussions, pull requests etc there. But all github issues are going to be tracked in jira

                Note taking apps:
                - Apple Notes: Shopping lists and quick notes
                - Obsidian: Personal knowledge base (MOOCs, learning resources)
                - Confluence: All work documentation and wikis ( so in case there's an incident report, it goes here)
                - Notion: Trip planning (use pages by default, tables for tracking expenses)

                When available, use AI features to auto-generate content for new pages/documents
                </instructions>

                Here are some examples of how the user might ask for different commands

                <examples>
                - Query: Find my PR tagged with the issue number A2231
                  Expected: github.search-issues
                
                - Query : grab pinned shopping list in apple notes and search amazon for the items
                  Expected: apple-notes.menu-bar, amazon-search.index
                
                - Query : Can you check if the workflow for the PR I just created failed?
                  Expected: github.my-pull-requests, github.workflow-runs
                </examples>
                """,
            },
            {"role": "user", "content": query},
        ],
        response_model=Commands,
        context={"commands": commands},
    )
    return resp.commands


In [76]:
from braintrust import Eval, Score
from helpers import calculate_precision, calculate_recall
import openai


def evaluate_braintrust(input, output, **kwargs):
    return [
        Score(
            name="precision",
            score=calculate_precision(output, kwargs["expected"]),
        ),
        Score(
            name="recall",
            score=calculate_recall(output, kwargs["expected"]),
        ),
    ]


client = instructor.from_gemini(
    genai.GenerativeModel("gemini-1.5-flash-latest"), use_async=True
)
commands = load_commands("raw_commands.json")
queries = load_queries(commands, "queries.jsonl")

results = []

for select_command, label in [
    (generate_commands, "base"),
    (generate_commands_with_system_prompt, "system prompt"),
    (generate_commands_with_system_prompt_and_examples, "system prompt and examples"),
]:

    async def task(query: str, hooks):
        return await select_command(query, client, commands)

    result = await Eval(
        "function-calling",
        data=lambda: [
            {
                "input": row["query"],
                "expected": row["labels"],
            }
            for row in queries
        ],
        task=task,
        max_concurrency=10,
        scores=[evaluate_braintrust],
    )
    scores = {k: result.summary.scores[k].score for k in result.summary.scores}
    results.append(
        {
            "label": label,
            **scores,
        }
    )


I0000 00:00:1736262009.331828  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262009.345252  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262009.358910  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262009.369393  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262009.382507  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262009.392530  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262009.401599  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
Experiment week-6-1736262011 is running at https://www.braintrust.dev/app/567/p/function-calling/experim

function-calling (tasks):   0%|          | 0/18 [00:00<?, ?it/s]


week-6-1736262011 compared to week-6-1736261331:
55.56% (-08.33%) 'recall'    score	(1 improvements, 3 regressions)
66.78% (-11.00%) 'precision' score	(2 improvements, 3 regressions)

1736262012.07s start
1736262013.98s end
1.91s (+87.47%) 'duration'	(2 improvements, 16 regressions)

See results for week-6-1736262011 at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-1736262011


I0000 00:00:1736262018.920904  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262018.968558  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262018.981860  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262018.992721  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262019.005649  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262019.015496  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1736262019.025426  903017 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
Experiment week-6-1736262020 is running at https://www.braintrust.dev/app/567/p/function-calling/experim

function-calling (tasks):   0%|          | 0/18 [00:00<?, ?it/s]


week-6-1736262020 compared to week-6-1736262011:
66.67% (+11.11%) 'recall'    score	(4 improvements, 1 regressions)
80.56% (+13.78%) 'precision' score	(4 improvements, 2 regressions)

1736262020.70s start
1736262022.02s end
1.31s (-59.57%) 'duration'	(14 improvements, 4 regressions)

See results for week-6-1736262020 at https://www.braintrust.dev/app/567/p/function-calling/experiments/week-6-1736262020


In [122]:
import pandas as pd

pd.DataFrame(results).round(2)


Unnamed: 0,label,recall,precision
0,base,0.69,0.62
1,system prompt,0.72,0.66
