In [6]:
%load_ext autoreload
%autoreload 2

# Week 6 - Systematically Improving Your Rag Application

> If you haven't already, please run the [1. Evaluate Tools](1. Evaluate Tools.ipynb) notebook to understand how to evaluate your application's tool calling capabilities. That will help you understand what we're doing here in this notebook.

## Generating Synthetic Data

We want to evaluate the recall of our application's tool calling capabilities. 

To do so, we'll generate a synthetic dataset of queries using a small subset of Raycast's extensions. Similar to our previous notebook, we'll use a `.yaml` file to store the extensions we support so that it's easy to add or remove extensions from our function calls.

We'll do so in 3 steps

1. **Extracting Commands** : We'll first start by extracting out a list of commands from a short list of raycast extensions before writing them to a yaml file.
2. **Generating Queries** : We'll then generate some synthetic queries for each of these commands
3. **Benchmark** : Lastly, we'll benchmark our application's tool calling capabilities against this dataset and measure the precision and recall of our application for these queries to get an initial baseline.

# Creating Our Dataset

## Extracting Commands

Luckily for us, Raycast exposes the commands for each extension in a package.json file that we can download. These have the following structure

```json
{
    "name": "google-search",
    "title": "Google Search",
    "description": "Google search with autosuggestions",
    "commands":[
        {
            "name": "index",
            "title": "Google Search",
            "description": "Google search with autosuggestions",
            "mode": "view"
        },
        // ... other commands go here
    ],
    // ... other metadata goes here
}
```

We'll take these commands and parse them into a structured yaml file that will be in the following format

```
commands:
    - extension_name: 'google-search'
      command_name: 'index'
      command_description: 'Google search with autosuggestions'
      plugin_description: 'Google search with autosuggestions'
```

which makes it easy for us to read it in and use it to make a function call.


In [2]:
VOICE_COMMAND_EXTENSIONS = [
    "google-search",
    "jira-search",
    "notion",
    "apple-notes",
    "obsidian",
    "calendar",
    "wikipedia",
    "youtube",
    "amazon-search",
    "reddit-search",
    "spotify-player",
    "1password",
]

In [5]:
import os
import requests
import json


def ensure_directory(directory: str):
    """Ensure the given directory exists."""
    if not os.path.exists(directory):
        os.makedirs(directory)


def fetch_package_json(extension_name: str) -> str:
    """Generate the URL to fetch the package.json of the extension."""
    return f"https://raw.githubusercontent.com/raycast/extensions/main/extensions/{extension_name}/package.json"


def get_commands(extension_name: str) -> bool:
    """Download the package.json for a specific extension and convert to YAML."""
    url = fetch_package_json(extension_name)
    response = requests.get(url)
    if response.status_code == 200:
        # Parse JSON response
        package_data = response.json()

        # Extract commands and format for YAML
        commands = []
        for command in package_data.get("commands", []):
            commands.append(
                {
                    "extension_name": extension_name,
                    "command_name": command["name"],
                    "command_description": command.get(
                        "description", "No description provided"
                    ),
                    "plugin_description": package_data.get(
                        "description", "No description provided"
                    ),
                }
            )

        return commands
    else:
        print(f"Failed to download package.json for {extension_name}")
        return False


We can then extract the commands from the `package.json` files and download them into a single `json` file that we can use to generate our synthetic dataset.

In [47]:
import yaml

# Get commands for each extension and flatten the list
commands = []
for extension in VOICE_COMMAND_EXTENSIONS:
    extension_commands = get_commands(extension)
    if extension_commands:
        commands.extend(extension_commands)

# Write commands to YAML file
ensure_directory("data")
with open("./commands.yaml", "w") as f:
    yaml.dump(
        {"commands": commands}, f, indent=4, default_flow_style=False, sort_keys=False
    )


## Generating Queries

We'll first measure the recall of our application's tool calling capabilities for a single command using `gpt-4o` and see how our application performs then.

Once we've done so, we'll move on to combining different query commands together to see how our recall and precision change with more complex queries.

In [13]:
with open("./commands.yaml", "r") as f:
    commands = yaml.load(f, Loader=yaml.FullLoader)["commands"]

commands[0]

{'extension_name': 'google-search',
 'command_name': 'index',
 'command_description': 'Google search with autosuggestions',
 'plugin_description': 'Google search with autosuggestions'}

In [43]:
from pydantic import BaseModel
import openai
import instructor
from rich import print
from asyncio import Semaphore


class Query(BaseModel):
    chain_of_thought: str
    user_query: str


async def generate_query(
    client: instructor.AsyncInstructor, sem: Semaphore, command: str
) -> Query:
    async with sem:
        return await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": "You're going to be given some informaton about a command that an extension provides. Read the information carefully and then generate a message that a user would ask to execute the specific extension. Good examples (eg. Can you help me to find a restaurant that can seat 10 pax in Taipei?). Try to avoid using the command or extension name in the message.",
                },
                {
                    "role": "user",
                    "content": f"Here is the information about the command provided by the extension: {command}",
                },
            ],
            response_model=Query,
        )


sem = Semaphore(1)
print(
    await generate_query(instructor.from_openai(openai.AsyncOpenAI()), sem, commands[0])
)

Let's now generate a list of these commands for each of our extensions.

In [44]:
from tqdm.asyncio import tqdm_asyncio as asyncio


sem = Semaphore(10)
coros = [
    generate_query(instructor.from_openai(openai.AsyncOpenAI()), sem, command)
    for command in commands
]

queries = await asyncio.gather(*coros)

100%|██████████| 7/7 [00:01<00:00,  3.93it/s]


In [59]:
from pydantic import BaseModel, ValidationInfo, model_validator


class Command(BaseModel):
    command: str
    extension: list[str]
    arguments: list[str]

    @model_validator(mode="after")
    def validate_arguments(self, info: ValidationInfo) -> "Command":
        context = info.context
        extension_commands = context["extension_commands"]

        if self.extension not in extension_commands:
            raise ValueError(
                f"Extension {self.extension} not found in any of the provided extensions"
            )

        if self.command not in extension_commands[self.extension]:
            raise ValueError(f"Command {self.command} not found for {self.extension}")

        return self


class ExecutionPlan(BaseModel):
    chain_of_thought: str
    commands: list[Command]

In [54]:
extension_commands = {}

for command in commands:
    if command["extension_name"] not in extension_commands:
        extension_commands[command["extension_name"]] = {}
    if command["command_name"] in extension_commands[command["extension_name"]]:
        raise ValueError(
            f"Command {command['command_name']} already exists for {command['extension_name']}"
        )
    extension_commands[command["extension_name"]][command["command_name"]] = command

In [55]:
async def generate_execution_plan(
    client: instructor.AsyncInstructor, sem: Semaphore, query: str
) -> ExecutionPlan:
    async with sem:
        return await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """
                    You're an intelligent assistant that has access to a given list of commands that can be executed from an extension.

                    Your job is to decide on the commands that will be executed in response to a user's query.

                    Here are the extensions and commands that you have access to:
                    {% for extension in extension_commands %}
                    {% for extension_command in extension_commands[extension] %}
                    - {{ extension }}.{{ extension_command }} : {{ extension_commands[extension][extension_command] }}
                    {% endfor %}
                    {% endfor %}

                    The user's query is: {{ query }}
                    """,
                }
            ],
            context={"extension_commands": extension_commands, "query": query},
            response_model=ExecutionPlan,
        )

In [58]:
from rich import print


client = instructor.from_openai(openai.AsyncOpenAI())
sem = Semaphore(10)
print(await generate_execution_plan(client,sem, "What's the capital of Taiwan?"))

InstructorRetryException: 'commands'