In [1]:
%load_ext autoreload
%autoreload 2

# Week 6 - Systematically Improving Your Rag Application

> If you haven't already, please run the [1. Evaluate Tools](1. Evaluate Tools.ipynb) notebook to understand how to evaluate your application's tool calling capabilities. That will help you understand what we're doing here in this notebook.

## Generating Synthetic Data

We want to evaluate the recall of our application's tool calling capabilities. 

To do so, we'll generate a synthetic dataset of queries using a small subset of Raycast's extensions. Similar to our previous notebook, we'll use a `.yaml` file to store the extensions we support so that it's easy to add or remove extensions from our function calls.

We'll do so in 3 steps

1. **Extracting Commands** : We'll first start by extracting out a list of commands from a short list of raycast extensions before writing them to a yaml file.
2. **Generating Queries** : We'll then generate some synthetic queries for each of these commands
3. **Benchmark** : Lastly, we'll benchmark our application's tool calling capabilities against this dataset and measure the precision and recall of our application for these queries to get an initial baseline.

# Creating Our Dataset

## Extracting Commands

Luckily for us, Raycast exposes the commands for each extension in a package.json file that we can download. These have the following structure

```json
{
    "name": "google-search",
    "title": "Google Search",
    "description": "Google search with autosuggestions",
    "commands":[
        {
            "name": "index",
            "title": "Google Search",
            "description": "Google search with autosuggestions",
            "mode": "view"
        },
        // ... other commands go here
    ],
    // ... other metadata goes here
}
```

We'll take these commands and parse them into a structured yaml file that will be in the following format

```
commands:
    - extension_name: 'google-search'
      command_name: 'index'
      command_description: 'Google search with autosuggestions'
      plugin_description: 'Google search with autosuggestions'
```

which makes it easy for us to read it in and use it to make a function call.


In [2]:
VOICE_COMMAND_EXTENSIONS = [
    "google-search",
    "jira-search",
    "notion",
    "apple-notes",
    "obsidian",
    "calendar",
    "wikipedia",
    "youtube",
    "amazon-search",
    "reddit-search",
    "spotify-player",
    "1password",
]

In [3]:
import os
import requests


def ensure_directory(directory: str):
    """Ensure the given directory exists."""
    if not os.path.exists(directory):
        os.makedirs(directory)


def fetch_package_json(extension_name: str) -> str:
    """Generate the URL to fetch the package.json of the extension."""
    return f"https://raw.githubusercontent.com/raycast/extensions/main/extensions/{extension_name}/package.json"


def get_commands(extension_name: str) -> bool:
    """Download the package.json for a specific extension and convert to YAML."""
    url = fetch_package_json(extension_name)
    response = requests.get(url)
    if response.status_code == 200:
        # Parse JSON response
        package_data = response.json()

        # Extract commands and format for YAML
        commands = []
        for command in package_data.get("commands", []):
            commands.append(
                {
                    "extension_name": extension_name,
                    "command_name": command["name"],
                    "command_description": command.get(
                        "description", "No description provided"
                    ),
                    "plugin_description": package_data.get(
                        "description", "No description provided"
                    ),
                }
            )

        return commands
    else:
        print(f"Failed to download package.json for {extension_name}")
        return False


We can then extract the commands from the `package.json` files and download them into a single `json` file that we can use to generate our synthetic dataset.

In [4]:
import yaml

# Get commands for each extension and flatten the list
commands = []
for extension in VOICE_COMMAND_EXTENSIONS:
    extension_commands = get_commands(extension)
    if extension_commands:
        commands.extend(extension_commands)

# Write commands to YAML file
ensure_directory("data")
with open("./commands.yaml", "w") as f:
    yaml.dump(
        {"commands": commands}, f, indent=4, default_flow_style=False, sort_keys=False
    )


## Generating Queries

We'll first measure the recall of our application's tool calling capabilities for a single command using `gpt-4o` and see how our application performs then.

Once we've done so, we'll move on to combining different query commands together to see how our recall and precision change with more complex queries.

In [32]:
import yaml

with open("./commands.yaml", "r") as f:
    commands = yaml.load(f, Loader=yaml.FullLoader)["commands"]

commands[0]

{'extension_name': 'google-search',
 'command_name': 'index',
 'command_description': 'Google search with autosuggestions',
 'plugin_description': 'Google search with autosuggestions'}

In [33]:
extension_commands = {}

for command in commands:
    if command["extension_name"] not in extension_commands:
        extension_commands[command["extension_name"]] = {}
    if command["command_name"] in extension_commands[command["extension_name"]]:
        raise ValueError(
            f"Command {command['command_name']} already exists for {command['extension_name']}"
        )
    extension_commands[command["extension_name"]][command["command_name"]] = command

In [36]:
from pydantic import BaseModel
import openai
import anthropic
import instructor
from rich import print
from asyncio import Semaphore


class Query(BaseModel):
    chain_of_thought: str
    user_query: str


async def generate_query(
    client: instructor.AsyncInstructor, sem: Semaphore, chosen_command: str, extension_commands: dict
) -> Query:
    async with sem:
        return await client.chat.completions.create(
            model="claude-3-5-sonnet-20240620",
            max_tokens=4096,
            messages=[
                {
                    "role": "user", 
                    "content":"""
                    You're going to be given a list of commands and extensions that we support in our application. Your job is to generate a user query that would require using a specific command that we support. 

                    Rules
                    - This must be a natural sounding user query that would require using the specific command that we support. This should be a question that a user would ask.
                    - Do not use the command name or extension name directly in the user query where possible. For instance, if there's a command that can look for directions using google maps, don't say "Hey can you help pull up Google Maps and help find directions to the nearest restaurant". Instead, say "how do I get to Formosa Chang?".
                    - If the command has specific properties or requirements, incorporate them naturally into the request without explicitly naming them.
                    - Avoid mentioning the specific application name in the user query. For instance, don't say "Hey can you help me find a restaurant using Yelp". Instead, say "Hey can you help me find a restaurant?".
                    - If possible, let's try to make it a query that has a bit of ambiguity

                    Here are the extensions and commands that we support:
                    <extensions>
                    {% for extension in extension_commands %}
                    {% for extension_command in extension_commands[extension] %}
                    - {{ extension }}.{{ extension_command }} : {{ extension_commands[extension][extension_command] }}
                    {% endfor %}
                    {% endfor %}
                    </extensions>

                    The specific command that we want to execute is {{ chosen_command }}. Make sure that the user query is a natural sounding question that would require using the command. 
                    """,
                },
                
            ],
            context={"extension_commands": extension_commands, "chosen_command": chosen_command},
            response_model=Query,
        )


sem = Semaphore(1)
print(
    await generate_query(instructor.from_anthropic(anthropic.AsyncAnthropic()), sem, commands[1], extension_commands)
)

Let's now generate a list of these commands for each of our extensions.

In [37]:
from tqdm.asyncio import tqdm_asyncio as asyncio


sem = Semaphore(10)
coros = [
    generate_query(instructor.from_anthropic(anthropic.AsyncAnthropic()), sem, command, extension_commands)
    for command in commands
]

queries = await asyncio.gather(*coros)

100%|██████████| 71/71 [00:54<00:00,  1.31it/s]


In [38]:
from pydantic import BaseModel, ValidationInfo, model_validator


class Command(BaseModel):
    command: str
    extension: str
    arguments: list[str]

    @model_validator(mode="after")
    def validate_arguments(self, info: ValidationInfo) -> "Command":
        context = info.context
        extension_commands = context["extension_commands"]

        if self.extension not in extension_commands:
            raise ValueError(
                f"Extension {self.extension} not found in any of the provided extensions"
            )

        if self.command not in extension_commands[self.extension]:
            raise ValueError(f"Command {self.command} not found for {self.extension}")

        return self


class ExecutionPlan(BaseModel):
    chain_of_thought: str
    commands: list[Command]

In [39]:
async def generate_execution_plan(
    client: instructor.AsyncInstructor, sem: Semaphore, query: str
) -> ExecutionPlan:
    async with sem:
        return await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """
                    You're an intelligent assistant that has access to a given list of commands that can be executed from an extension.

                    Your job is to decide on the commands that will be executed in response to a user's query.

                    Here are the extensions and commands that you have access to:
                    {% for extension in extension_commands %}
                    {% for extension_command in extension_commands[extension] %}
                    - {{ extension }}.{{ extension_command }} : {{ extension_commands[extension][extension_command] }}
                    {% endfor %}
                    {% endfor %}

                    The user's query is: {{ query }}
                    """,
                }
            ],
            context={"extension_commands": extension_commands, "query": query},
            response_model=ExecutionPlan,
        )

In [40]:
from rich import print


client = instructor.from_openai(openai.AsyncOpenAI())
sem = Semaphore(10)
print(await generate_execution_plan(client, sem, "What's the capital of Taiwan?"))

Let's now run this for all of our queries

In [29]:
import json

# Save our existing queries to a json file
with open("./single_queries.json", "w") as f:
    for query, command in zip(queries, commands):
        f.write(json.dumps({"query": query.user_query, "command_name": command["command_name"], "extension_name": command["extension_name"]}) + "\n")


In [44]:
with open("./single_queries.json", "r") as f:
    queries = [json.loads(line) for line in f]


In [54]:
query_and_expected = [
    [query["query"], [f'{query["extension_name"]}.{query["command_name"]}']]
    for query in queries
]

results = await asyncio.gather(
    *[generate_execution_plan(client, sem, query) for query in queries]
)

100%|██████████| 71/71 [00:12<00:00,  5.47it/s]


In [55]:
from helpers import calculate_precision, calculate_recall
import pandas as pd

df = pd.DataFrame(
    [
        {
            "query": test_item[0],
            "expected_tools": test_item[1],
            "actual_tools": [f"{item.extension}.{item.command}" for item in result.commands],
        }
        for test_item, result in zip(query_and_expected, results)
    ]
)

df["precision"] = df.apply(
    lambda x: calculate_precision(x["actual_tools"], x["expected_tools"]), axis=1
)
df["recall"] = df.apply(
    lambda x: calculate_recall(x["actual_tools"], x["expected_tools"]), axis=1
)
df["CORRECT"] = df.apply(
    lambda x: "Y" if x["expected_tools"] == x["actual_tools"] else "N", axis=1
)

df.head(10)


Unnamed: 0,query,expected_tools,actual_tools,precision,recall,CORRECT
0,What was that recent breakthrough in fusion energy that scientists achieved? I heard it was something about net energy gain but I'm not sure about the details.,[google-search.index],[google-search.index],1.0,1.0,Y
1,"I'm reading an article about space exploration, and it mentions something called ""Lagrange points"". What exactly are those?",[google-search.searchSelectText],[google-search.searchSelectText],1.0,1.0,Y
2,Can you help me find all the bug reports related to the login feature in our mobile app project?,[jira-search.issue],[jira-search.issue],1.0,1.0,Y
3,"Can you help me find the ""Product Roadmap"" board? I heard our team is using it for planning, but I can't seem to locate it.",[jira-search.board],[jira-search.board],1.0,1.0,Y
4,Can you help me find the saved search for all reported bugs in our current sprint?,[jira-search.filter],[jira-search.filter],1.0,1.0,Y
5,Can you help me find the project I'm supposed to be working on? I think it had something to do with the new mobile app we're developing.,[jira-search.project],[jira-search.project],1.0,1.0,Y
6,"I'm looking at some Jira tickets, but the images seem outdated. Is there a way to make sure I'm seeing the latest versions?",[jira-search.image],[jira-search.image],1.0,1.0,Y
7,Can you add a new task to my project tracker for the upcoming team meeting presentation?,[notion.create-database-page],[notion.create-database-page],1.0,1.0,Y
8,Can you help me find my project notes about the new marketing campaign?,[notion.search-page],[notion.search-page],1.0,1.0,Y
9,Can you quickly jot down this idea for my project? I just thought of a great feature we should add: a dark mode option for better nighttime viewing.,[notion.quick-capture],[notion.quick-capture],1.0,1.0,Y


In [56]:
df["recall"].mean(), df["precision"].mean()


(np.float64(0.971830985915493), np.float64(0.8990140845070422))

For single command queries, we can see that our model performs reasonably well with a recall of 0.96 and a precision of 0.93. Let's see if there are specific commands that are harder to call.

In [57]:
# Create a table showing precision and recall per tool
all_tools = set()
for tools in df["expected_tools"] + df["actual_tools"]:
    all_tools.update(tools)

stats = []
for tool in all_tools:
    # Get rows where tool was actually called
    tool_called_mask = df["actual_tools"].apply(lambda x: tool in x)
    # Get rows where tool was expected to be called
    tool_expected_mask = df["expected_tools"].apply(lambda x: tool in x)
    
    tool_stats = {
        "Tool": tool,
        "Precision": df[tool_called_mask]["precision"].mean() if tool_called_mask.any() else 0.0,  # Set to 0 if never called
        "Recall": df[tool_expected_mask]["recall"].mean(),
    }
    stats.append(tool_stats)

tool_df = pd.DataFrame(stats).set_index("Tool").round(2).sort_values("Precision")
tool_df.head(10)


Unnamed: 0_level_0,Precision,Recall
Tool,Unnamed: 1_level_1,Unnamed: 2_level_1
spotify-player.nowPlayingMenuBar,0.0,0.0
obsidian.searchMedia,0.0,0.0
spotify-player.dislike,0.33,1.0
spotify-player.removeSongFromPlaylist,0.42,1.0
apple-notes.new,0.5,1.0
spotify-player.nowPlaying,0.5,1.0
obsidian.openVaultCommand,0.5,1.0
spotify-player.volume,0.5,1.0
spotify-player.startDJ,0.5,1.0
obsidian.dailyNoteCommand,0.5,1.0


Let's look at the specific queries that had a low precision

In [60]:
pd.set_option('display.max_colwidth', None)
df[df['CORRECT'] == 'N']

Unnamed: 0,query,expected_tools,actual_tools,precision,recall,CORRECT
12,Can you help me jot down some quick thoughts about my upcoming vacation plans?,[apple-notes.new],"[apple-notes.new, apple-notes.add-text]",0.5,1.0,N
19,Can you help me access my research notes from last semester?,[obsidian.openVaultCommand],"[obsidian.openVaultCommand, obsidian.searchNoteCommand]",0.5,1.0,N
21,Can you help me jot down my thoughts for today?,[obsidian.dailyNoteCommand],"[obsidian.dailyNoteCommand, obsidian.dailyNoteAppendCommand]",0.5,1.0,N
24,Can you help me find that podcast episode I saved last week about artificial intelligence?,[obsidian.searchMedia],[spotify-player.search],0.0,0.0,N
40,What song is playing right now?,[spotify-player.nowPlayingMenuBar],[spotify-player.nowPlaying],0.0,0.0,N
43,I'm not really feeling this song right now. Can you take it off my workout mix?,[spotify-player.removeSongFromPlaylist],"[spotify-player.removeSongFromPlaylist, spotify-player.togglePlayPause]",0.5,1.0,N
47,I really don't enjoy this song that's playing right now. Can you mark it so it doesn't come up in my recommendations again?,[spotify-player.dislike],"[spotify-player.dislike, spotify-player.removeSongFromPlaylist, spotify-player.next]",0.33,1.0,N
48,"Hey, I want to listen to some music on my living room speakers. Can you help me switch to those?",[spotify-player.devices],"[spotify-player.devices, spotify-player.devices, spotify-player.devices, spotify-player.devices, spotify-player.devices]",1.0,1.0,N
49,Can you turn down the music a bit? It's a little too loud for my liking.,[spotify-player.volume],"[spotify-player.volumeDown, spotify-player.volume]",0.5,1.0,N
51,Can you turn down the music? It's a bit too loud for my liking.,[spotify-player.volume25],"[spotify-player.volume25, spotify-player.togglePlayPause]",0.5,1.0,N


Let's do a bit of analysis on the queries that had a low precision

Firstly, if we look at row number 49 with the query `Can you turn down the music a bit? It's a little too loud for my liking.`, it's calling the `volumeDown` command with the `volume` button. This doesn't make sense since we can already set the volume to a specific level with volume and so the `volumeDown` command is redundant.

```yaml
-   extension_name: spotify-player
    command_name: volume
    command_description: Set the volume to an arbitrary percent.
    plugin_description: Spotify's most common features, now at your fingertips. Search
        for music and podcasts, browse your library, and control the playback. Glance
        at what's currently playing directly from the menu bar.

-   extension_name: spotify-player
    command_name: volumeDown
    command_description: Turn the volume down by 10%.
    plugin_description: Spotify's most common features, now at your fingertips. Search
        for music and podcasts, browse your library, and control the playback. Glance
        at what's currently playing directly from the menu bar.
```

This is a good example of a response by the model that doesn't make sense.

Secondly, if we look at query number 24 - `Can you help me find that podcast episode I saved last week about artificial intelligence`, it's calling the `yourLibrary` command instead of `searchMedia`. 

If we look at our `commands.yml` file, we can see that `yourLibrary` is a command that shows you your saved artists, albums, songs, playlists, and podcasts while `searchMedia` is a command that allows you to search for podcasts.

```yaml
-   extension_name: spotify-player
    command_name: yourLibrary
    command_description: See your saved artists, albums, songs, playlists, and podcasts.
        Similar to the "Search" command, it includes a category dropdown and contextual
        actions.
    plugin_description: Spotify's most common features, now at your fingertips. Search
        for music and podcasts, browse your library, and control the playback. Glance
        at what's currently playing directly from the menu bar.

-   extension_name: obsidian
    command_name: searchMedia
    command_description: Search for media in your vault
    plugin_description: Control Obsidian with Raycast
```

In this specific case, perhaps we should switch out the query itself and ask about other kind of media files here. If we look at the Obsidian documentation, we can see that it supports searching for images, videos, and other media files. Therefore it might be better to ask about other kind of media files.


> Supported File Types
> 
> Markdown: .md
> JSON Canvas: .canvas (Learn more)
> Images: .avif, .bmp, .gif, .jpeg, .jpg, .png, .svg, .webp
> Audio: .flac, .m4a, .mp3, .ogg, .wav, .webm, .3gp
> Video: .mkv, .mov, .mp4, .ogv, .webm
> PDF: .pdf

Therefore a better query might have been `Can you help me make a list of the PDFs I've saved on Reinforcement learning in my ml vault ?`. This is a better query since (1) We don't explicitly mention Obsidian and (2) it tests the `searchMedia` command with a lot more specificity.

Now that we've done a bit of analysis on the queries that had a low precision, let's increase the complexity of our queries by combining multiple commands together and see how our recall and precision changes.