# Week 6 - Systematically Improving Your Rag Application

> If you haven't ran the previous notebook `1. Evaluate Tools.ipynb`, please do so before continuing. A lot of the code in this notebook will be based off the evaluation methods that we cover in that notebook.

In this notebook, we'll be generating a synthetic dataset of user requests to evaluate our model's ability to call the right tool given a user request. This will help us understand how our model performs in this scenario and identify failure points that our model might struggle with.

We'll do so in 3 steps

1. **Failure Modes** : We'll first start by thinking about the specific failure modes that our model might struggle and show examples where our model might get confused

2. **Generating Synthetic Prompts** : We'll then generate synthetic prompts for each command that take advantage and specifically test these failure modes.

3. **Benchmarking** : We'll then use a naive approach to see how our model performs on these initial set of questions

Once we've done so, we'll then have established an initial baseline of how our model performs and start exploring ways to improve this performance in the next notebook.

We've downloaded a list of commands ahead of time in a `raw_commands.json` file. These consist of a set of commands that we've downloaded from the `raycast` application ahead of time as well as some additional commands that we've added to the application.

We'll load these commands into a list of Command objects as seen below. We'll store the following fields

- `description` : A short description of the comamnd from the extension's documentation
- `extension` : This is the name of the extension that the command belongs to
- `command_name` : This is the name of the command as it appears in the `raycast` extension

In order to ensure we have a unique command for each extension, we'll concatenate the `extension` and `command_name` fields together to form a unique key. This will help avoid a situation whereby we have multiple commands with the same name in different extensions.

Eg. `obsidian.search` vs `apple-notes.search` would be very confusing and hard to test. Let's see it in action below.


In [17]:
from pydantic import BaseModel, computed_field
import json


class Command(BaseModel):
    extension_name: str
    command_name: str
    command_description: str

    @computed_field
    def key(self) -> str:
        return f"{self.extension_name}.{self.command_name}"


def load_commands(file_path: str) -> list[Command]:
    with open(file_path, "r") as file:
        return [
            Command(
                extension_name=command["extension_name"],
                command_name=command["source_name"],
                command_description=command["description"],
            )
            for command in json.load(file)
        ]


commands = load_commands("raw_commands.json")
len(commands)

72

## Identifying Failure Modes

When it comes to deciding what tools to call for a personal assistant, there are two main failure points that our model might struggle with.    

1. **A Lack Of Context** : It's common for users to use multiple note taking apps. We have Notion for company documents, Obsidian for personal notes and Apple Notes for quick notes that are more ephemeral. But it's difficult for a language model to know exactly which app to use when the user makes a request like `Sarah just got back to me on the project that our campaign was approved. can you make a note in our project page about it?` vs `I'm going shopping for groceries later, can you remind me that i need to buy milk, eggs and bread?`

2. **Multi-Step Tasks** : When we start scaling out our application to support calling multiple applications, we might find that our model struggles with multi-step tasks. For instance, given the request `create a new feature branch for the UI refactor and link all the tickets you've found about onboarding issues in the PR description`, we might find that our model struggles to call the required tools in the correct order or even call the tools at all.

Let's explore these in more detail.


### Lack Of Context

Let's imagine we have three commands that we want to evalute

- `obsidian.search`
- `apple-notes.search`
- `notion.search`

Without much context or description of what the extension's command does, we might expect our model to get confused. 

For instance, it's perfectly valid for a user to use apple-notes for every single note they take, thus resulting in every other extension here being useless. Let's see this in action below where our model is provided with these commands and asked to call the correct tool given a user request.

We'll represent our tool calls as a list of commands that the model has selected. 

For now, we'll only be evaluating whether the model has selected the correct tool or not as an initial step. When you implement this for your use case, you'll also want to evaluate the arguments that the model has selected for each specific command.

In [18]:
from pydantic import BaseModel, field_validator, ValidationInfo


class UserCommandArgument(BaseModel):
    title: str
    value: str


class UserCommand(BaseModel):
    key: str
    arguments: list[UserCommandArgument]


class SelectedCommands(BaseModel):
    selected_commands: list[UserCommand]

    @field_validator("selected_commands")
    def validate_selected_commands(cls, v, info: ValidationInfo):
        if len(v) == 0:
            raise ValueError("You must select at least one command to be executed")

        commands: list[Command] = info.context["commands"]
        valid_command_keys = [command.key for command in commands]
        invalid_keys = [
            command.key for command in v if command.key not in valid_command_keys
        ]
        if invalid_keys:
            raise ValueError(
                f"Commands {invalid_keys} are not valid commands. Valid commands that can be used are {valid_command_keys}"
            )

        return v


In response to a user query like `fetch me my notes on CS325 tagged as important`, we might expect our model to select the `obsidian.search` command.

In this case, it would call it with the arguments `title: CS325` and `tag: important`. This would in turn translate to the following command call

```python
SelectedCommands(
    selected_commands=[
        UserCommand(
            key="obsidian.search",
            arguments=[
                UserCommandArgument(title="title", value="CS325"), 
                UserCommandArgument(title="tag", value="important")
            ]
        )
    ]
)
```

This could then be executed as a command in the `raycast` application with its own validation logic and return the results to the user. 

We're also able to modify the list of valid commands on demand by reading a shared list of commands from the `ValidationInfo` object which we can access in both our validation and prompt formatting logic.

Let's see this in action below where we provide our model with the list of commands we've provided in the `raw_commands.json` file and ask it to select the correct tool given a user request.

In [20]:
import instructor
import google.generativeai as genai
from rich import print

queries = [
    [
        "Seems like the missing ingredient for the hamburger recipe was msg, let's add that in to our Beef Cheeseburger recipe note",
        ["apple-notes.add-text"],
    ],
    ["What did I write about LSTMs previously?", ["obsidian.searchNoteCommand"]],
    [
        "Can you note down that Jeanine is the new person in charge of the onboarding team in our team docs?",
        ["confluence.add-text"],
    ],
]

client = instructor.from_gemini(genai.GenerativeModel("models/gemini-1.5-flash-latest"))


for query, expected_tool in queries:
    response = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant that can execute commands in response to a user query. You have access to the following commands:
                
                <commands>
                {% for command in commands %}
                - {{ command.key }} : {{ command.command_description }}
                {% endfor %}
                </commands>

                You must select at least one command to be called.
                """,
            },
            {
                "role": "user",
                "content": query,
            },
        ],
        response_model=SelectedCommands,
        context={"commands": commands},
    )
    print(
        f"\nQuery: {query}\nSelected commands: {[command.key for command in response.selected_commands]}\nExpected tool: {expected_tool}\n{'-' * 50}"
    )


Without any context, our model struggles to decide what the right tool to be called is in response to the user request. In fact in our three examples above, it gets only one of them right.

This is a good indication that in order for our model to be able to call the right tool, we need to provide it with more context.

### Multi-Step Tasks

How about multi-step tasks where we might need to chain multiple commands together or in parallel and execute them in a specific order?

Let's see this in action below where we have a few commands that we want to execute

In [23]:
queries = [
    [
        "Let's create a new release post about our latest deployment, also make sure to link the specific issues that were fixed in the latest sprint and send a message the #engineering channel to let them know about it",
        ["confluence-search.new-blog", "jira.active-sprints", "teams.sendMessage"],
    ],
    [
        "find weather taiwan dec and generate shopping list for it",
        ["google-search.index", "apple-notes.ai"],
    ],
]

client = instructor.from_gemini(genai.GenerativeModel("models/gemini-1.5-flash-latest"))


for query, expected_tool in queries:
    response = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                You are a helpful assistant that can execute commands in response to a user query. You have access to the following commands:
                
                <commands>
                {% for command in commands %}
                - {{ command.key }} : {{ command.command_description }}
                {% endfor %}
                </commands>

                You must select at least one command to be called.
                """,
            },
            {
                "role": "user",
                "content": query,
            },
        ],
        response_model=SelectedCommands,
        context={"commands": commands},
    )
    print(
        f"\nQuery: {query}\nSelected commands: {[command.key for command in response.selected_commands]}\nExpected tool: {expected_tool}\n{'-' * 50}"
    )


We can see that our model's performance is slightly worse here. 

In the first case, it's wrongly identifies that we should send a message using the `discord.sendMessage` command instead of the `teams.sendMessage` command. Additionally, it doesn't call the `jira.active-sprints` command as expected.

In the second case it's able to call the `google-search.index` command but then struggles to call the `apple-notes.ai` command to generate the shopping list from the search results.



# Generating Synthetic Queries

Now that we've identified some of the initial failure modes that our model might struggle with, let's start generating synthetic queries that specifically test these failure modes.

We'll do so by writing out a brief prompt that describes how our user uses the individual extensions and then generate a list of queries that are based off these extensions. Once we've generated an initial list of queries, we'll then use them to prompt our model to generate more queries.

By doing so, we'll be able to generate more diverse and unique queries in our dataset

In [35]:
user_behaviour = """
Currently our user uses the following extensions for the following purposes

- Confluence is used for company documentation, posts and notes. Note that we should use a post when it's a one time event or announcement (eg. Feature Release ) and a page when we'd like to keep it around for a longer period (Eg. Onboarding Document, Team Handbook, Incident Reports that we want to refer to down the line). Use filters for common queries/views that I need to refer to 
- Notion is used for planning trips, tracking expenses and other forms of long term planning. 
- Apple notes are used for quick notes that are more one-off. Examples of these notes includes recipes, shopping lists, notes about a movie we want to watch, things to note down etc, todo lists, reminders etc
- Obsidian is used for personal notes and reflections on a wide range of topics (Eg. Classes we've taken, books we've read, notes about a lecture we went to etc)

- Google Search is used for searching the web for information
- iMessage is used for sending messages to friends and family. These messages are short informal and mostly about the weather, plans for the weekend, coordinating certain events, looking up appointments etc
- Discord is used for gaming - so we'll use it for sending messages that are related to gaming and coordinating these gaming sessions with friends
- Teams is used for sending messages for work stuff - we might use it to send messages to a channel or to a specific person in response to certain work related projects, requests, developments etc

- Github is used for tracking pull requests, collaborating with other developers, running tests and deploying code (Eg. What's the update on the CI, is there a new release of the app, what's the status of the new feature branch, any new security vulnerabilities that we flagged, any new PRs to review etc)
- We use Jira to track outstanding bugs and issues that users have reported and we need to work on. Often times we'll be tracking the issue in jira and then creating PRs in github to fix it
"""

We'll choose a few extension commands randomly and then generate our initial list of queries based off these extension commands. 

Once we've done so, we'll have a few queries as seen below which we'll store in a `queries.jsonl` file. 

```bash

```

In [47]:
import random
from pydantic import BaseModel, field_validator, ValidationInfo
from rich import print
import re


class UserQuery(BaseModel):
    chain_of_thought: str
    user_query: str
    commands: list[UserCommand]

    @field_validator("user_query")
    def validate_valid_query(cls, v):
        return re.sub(r'\\{1,}"', "'", v)

    @field_validator("commands")
    def validate_commands(cls, v, info: ValidationInfo):
        commands: list[Command] = info.context["commands"]
        valid_command_keys = [command.key for command in commands]
        invalid_keys = [
            command.key for command in v if command.key not in valid_command_keys
        ]
        if invalid_keys:
            raise ValueError(
                f"Commands {invalid_keys} are not valid commands. Valid commands that can be used are {valid_command_keys}"
            )
        return v


async def generate_query(
    client: instructor.AsyncInstructor,
    command: Command,
    commands: list[Command],
    user_behaviour: str,
) -> UserQuery:
    query_length = random.randint(10, 30)

    return await client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": """
                Generate a hypothetical user message that is about {{ length }} uses the following command and at most 2 more commands from the list of commands below. Make sure to use the specific command name as the key for the command.

                command_name: {{command.key}}    
                description: {{command.command_description}}

                Here are a list of other commands that you can use in conjunction with the above command 

                <commands>
                {% for command in commands %}
                <command>
                    <command_name>{{ command.key }}</command_name>
                    <command_description>{{ command.command_description }}</command_description>
                </command>
                {% endfor %}
                </commands>

                Here is a rough description of how our user uses the application

                <user_behaviour>
                {{ user_behaviour }}
                </user_behaviour>

                Think carefully about what this specific command is used for, how it differs from other commands available in the same extension and other commands in the application. Lastly consider about how we could use this command in conjunction with other commands based off the user behaviour listed above. 

                Once you've done so, remember to generate a user message that uses the command in a way that is consistent with the user behaviour listed above and is written in the imperative as an demand/request.
                """,
            },
        ],
        context={
            "command": command,
            "commands": commands,
            "user_behaviour": user_behaviour,
            "length": query_length,
        },
        response_model=UserQuery,
    )


client = instructor.from_gemini(
    genai.GenerativeModel("models/gemini-1.5-flash-latest"), use_async=True
)

print(await generate_query(client, random.choice(commands), commands, user_behaviour))

In [48]:
from tqdm.asyncio import tqdm_asyncio as asyncio

queries = await asyncio.gather(
    *[
        generate_query(client, random.choice(commands), commands, user_behaviour)
        for _ in range(5)
    ]
)

with open("queries.jsonl", "a") as file:
    for query in queries:
        file.write(
            json.dumps(
                {
                    "query": query.user_query,
                    "labels": [command.key for command in query.commands],
                }
            )
            + "\n"
        )


100%|██████████| 5/5 [00:05<00:00,  1.13s/it]


# Benchmarking

Now that we've generated our initial list of queries, let's see how our model performs when we only provide the command name and description in it's context.

We'll use `braintrust` here to store and log the performance of our model.