# Homework Assignment 2: Recipe Bot Error Analysis

This notebook shows you how to run the second homework example using Galileo.

## Configuration

To be able to run this notebook, you need to have a Galileo account set up, along with an LLM integration to run an experiment to generate responses.

1. If you don't have a Galileo account, head to [app.galileo.ai/sign-up](https://app.galileo.ai/sign-up) and sign up for a free account
1. Once you have signed up, you will need to configure an LLM integration. Head to the [integrations page](https://app.galileo.ai/settings/integrations) and configure your integration of choice. The notebook assumes you are using OpenAI, but has details on what to change if you are using a different LLM.
1. Create a Galileo API key from the [API keys page](https://app.galileo.ai/settings/api-keys)
1. In this folder is an example `.env` file called `.env.example`. Copy this file to `.env`, and set the value of `GALILEO_API_KEY` to the API key you just created.
1. If you are using a custom Galileo deployment inside your organization, then set the `GALILEO_CONSOLE_URL` environment variable to your console URL. If you are using [app.galileo.ai](https://app.galileo.ai), such as with the free tier, then you can leave this commented out.
1. This code uses OpenAI to generate some values. Update the `OPENAI_API_KEY` value in the `.env` file with your OpenAI API key. If you are using another LLM, you will need to update the code to reflect this.


In [None]:
# Install the galileo and python-dotenv package into the current Jupyter kernel
%pip install "galileo[openai]" python-dotenv pydantic

## Environment setup

To use Galileo, we need to load the API key from the .env file

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Check that the GALILEO_API_KEY environment variable is set
if not os.getenv("GALILEO_API_KEY"):
    raise ValueError("GALILEO_API_KEY environment variable is not set. Please set it in your .env file.")

Next we need to ensure there is a Galileo project set up.

In [None]:
from galileo.projects import create_project, get_project

PROJECT_NAME = "AI Evals Course - Homework 2"
project = get_project(name=PROJECT_NAME)
if project is None:
    project = create_project(name=PROJECT_NAME)

print(f"Using project: {project.name} (ID: {project.id})")

In this notebook, you will be using the LLM integration you set up in Galileo to run an experiment, as well as calling OpenAI directly to generate some data. The default model used is GPT-5.1, and this assumes you have configured an OpenAI integration.

If you have another integration set up, or want to use a different model, update this value.

In [None]:
MODEL="gpt-5.1"

## Part 1: Generate Test Queries

### Pick your dimensions

Pick your dimensions that matter for your test queries, such as cuisine, dietary restrictions, meal type etc. Then add example values, ideally three values for each dimension.

Update the code below to reflect these dimensions and example values.

In [None]:
# Define the dimensions for the recipe generation task, along with some example values
# Update this to reflect the dimensions you want to test
dimensions = [
    {
        "name": "cuisine",
        "values:": ["Italian", "Chinese", "Mexican"]
    },
    {
        "name": "dietary restrictions",
        "values:": ["Vegetarian", "Vegan", "Gluten-Free", "Diabetic"]
    },
    {
        "name": "meal type",
        "values:": ["Breakfast", "Lunch", "Dinner", "Snack"]
    }
]

### Create combinations

You can use an LLM to generate queries using combinations of the different dimensions.

In [None]:
import json
from openai import OpenAI
from pprint import pprint

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Create a prompt to generate test queries using the dimensions
prompt = f"""Generate 20 unique combinations of the following dimensions:

{dimensions}

Ensure these combinations cover a diverse range of scenarios, and are realistic.

Return JSON with an array of objects. These objects contain all the dimension names as keys, and the selected value for that dimension as the value.
"""

# Get the response from OpenAI
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that generates combinations of values. Return only valid Python JSON arrays."},
        {"role": "user", "content": prompt}
    ]
)

# Extract the response and parse the JSON
combinations = json.loads(response.choices[0].message.content)
print("Generated Combinations:\n")
pprint(combinations)

### Turn Combinations into Queries

Now we have our combinations, we can use these with an LLM to create queries.

In [None]:
# Create a prompt to generate test queries using the dimensions
prompt = f"""Generate 7 queries that can be run against a recipe generation model using the following combinations of criteria. Each row in this JSON represents a different combination of criteria, so pick 7 rows randomly from this list, and use them to generate the queries.

{combinations}

Ensure these queries are realistic queries that a user might ask a recipe generation model. The user may be experienced with interacting with an LLM, or may not be, so vary the complexity of the queries. Also vary based on a selection of ages, writing styles, tones, or skills with English.

Return JSON with an array of queries. This JSON should be a simple array of strings, with each string being a different query.
"""

# Get the response from OpenAI
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that generates test queries. Return only valid Python lists."},
        {"role": "user", "content": prompt}
    ]
)

# Get the list of queries
queries = [str(q) for q in json.loads(response.choices[0].message.content)]
print("Generated Queries:\n")
pprint(queries)

## Part 2: Find and Categorize Errors

### Run Your Bot

Just link in the [previous homework](../hw1/README.md), we will be using an experiment in Galileo to run the recipe bot and generate the output. This bot is a simple LLM call, so we can replicate this using an experiment with a prompt and dataset.

Let's start by creating some unique names for the prompt and dataset.

In [None]:
from datetime import datetime

current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

PROMPT_NAME = f"Homework 2 Prompt - {current_time}"
DATASET_NAME = f"Homework 2 Dataset - {current_time}"

print(f"Prompt name: {PROMPT_NAME}")
print(f"Dataset name: {DATASET_NAME}")

Now let's generate the prompt in Galileo. This uses the basic prompt from the recipe chatbot starting example, so make sure to update this to reflect the prompt you generated in [homework 1](../hw1/hw1.ipynb).

In [None]:
from galileo import Message, MessageRole
from galileo.prompts import create_prompt, delete_prompt, get_prompt

# Define a system prompt. It is this prompt you need to configure
system_prompt = """
You are an expert chef recommending delicious and useful recipes. Present only one recipe at a time. If the user doesn't specify what ingredients they have available, assume only basic ingredients are available.Be descriptive in the steps of the recipe, so it is easy to follow.Have variety in your recipes, don't just recommend the same thing over and over.You MUST suggest a complete recipe; don't ask follow-up questions.Mention the serving size in the recipe. If not specified, assume 2 people.
"""

# Start by getting the prompt if it already exists.
# If it does, we can delete it and re-create, if not we create it.
prompt = get_prompt(name=PROMPT_NAME)

if prompt is not None:
    print(f"Prompt already exists with ID: {prompt.id}, deleting it to re-create.")
    prompt = delete_prompt(name=PROMPT_NAME)

prompt = create_prompt(
    name=PROMPT_NAME,
    template=[
        Message(
            role=MessageRole.system,
            content=system_prompt,
        ),
        Message(role=MessageRole.user, content="{{input}}"),
    ],
)

# Output a link to view the prompt in Galileo
print(f"Prompt created. You can view it at {os.environ.get('GALILEO_CONSOLE_URL', 'https://app.galileo.ai/').removesuffix('/')}/prompts/{prompt.id}")

Next we can generate the dataset using the queries you generated in the last section.

In [None]:
from galileo.datasets import get_dataset, create_dataset, delete_dataset

# Now we have the CSV file loaded, lets create a dataset. If the dataset already exists, we will delete it and re-create it.
dataset = get_dataset(
    name=DATASET_NAME
)

if dataset is not None:
    print(f"Dataset already exists with ID: {dataset.id}, deleting it to re-create.")
    dataset = delete_dataset(
        name=DATASET_NAME
    )

dataset = create_dataset(
    name=DATASET_NAME,
    content=[{"input": q} for q in queries],
)

print(f"Dataset created. You can view it at {os.environ.get('GALILEO_CONSOLE_URL', 'https://app.galileo.ai/').removesuffix('/')}/datasets/{dataset.id}")

Now we can use the prompt and dataset to generate the responses by running an experiment. Once the experiment has started, a link to the experiment will be in the output. Follow this link to monitor the progress of the experiment.

In [None]:
from galileo.experiments import run_experiment
from galileo.resources.models import PromptRunSettings

# Create the experiment prompt run settings to define the model
# Update the model_alias to the model you want to use for the experiment
prompt_run_settings = PromptRunSettings(
    model_alias=MODEL
)

current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
experiment_name = f"Homework 2 Experiment - {current_time}"

# Run the experiment using the prompt and dataset we created
results = run_experiment(
    experiment_name,
    dataset=dataset,
    prompt_template=prompt,
    project=PROJECT_NAME,
    prompt_settings=prompt_run_settings,
)

print(f"Experiment has started. You can view the experiment at {results['link']}")

### Open coding

Once the experiment is complete, it's time to use open coding to add detailed notes to each trace looking for errors or anything unusual. Galileo allows you to define annotations that you can then add to each trace, and in this case use annotations for open coding.

#### Define your annotations

Annotations are defined up front, so that anyone can consistently annotate a trace. Annotations can be text, categories, scores, or star or thumbs up/down ratings. You can also provide details on the criteria for the annotation, so that a domain expert can detail how an annotator should approach annotating these data, as well as providing a way for the annotator to note details or the rationale for their annotation.

To define an annotation:

1. Select the **Annotations** section from the Galileo sidebar for the "AI Evals Course - Homework 2" project. Then select **Add your first annotation**.

    <div>
    <img src="./images/annotations-sidebar.webp" width="550"/>
    </div>

1. Name the annotation "Open coding", and set the **Annotation type** to **Text**. Then select **Save to project**.

    <div>
    <img src="./images/create-annotation.webp" width="500"/>
    </div>

This annotation is now ready to use in your project.

#### Review and annotate your experiment

When you ran the experiment earlier in this notebook, traces were generated for each query, showing the input and output that was sent to the LLM. You can now view those traces, and using the annotation you just defined to open code the trace.

1. Open the experiment using the link that was output earlier, or selecting it in your project in Galileo. You should see 7 traces.

    <div>
    <img src="./images/experiment-traces.webp" width="800"/>
    </div>

1. Select a trace to see the details. You will see a tree showing the trace, with a single LLM span that represents the LLM call using the prompt and the relevant row from the dataset you created earlier. With the trace selected you will also see the input and output for that trace, with the user prompt sent to the LLM, and the response.

    <div>
    <img src="./images/experiment-selected-trace.webp" width="800"/>
    </div>

    If you want to see the system prompt as well, select the LLM span.

    <div>
    <img src="./images/experiment-selected-trace-llm-span.webp" width="570"/>
    </div>

1. With the trace selected, make sure the **Metrics Pane** is visible, and select **Annotations**.

    <div>
    <img src="./images/annotations-metric-pane.webp" width="570"/>
    </div>

1. Enter your annotation, noting any errors, inconsistencies, or any relevant information about how the output could be improved. Enter complete sentences with details as these will be sent to an LLM later to build failure modes.

    The annotation is automatically saved.
1. Continue the process for the rest of the traces. You can quickly navigate between traces using the arrows at the top.

    <div>
    <img src="./images/traces-navigation.webp" width="250"/>
    </div>


### Build your taxonomy

These annotations form the basis of the failure modes you now need to define. You can use these annotation to come up with two or more failure modes.

An easy way to get started doing this is to use an LLM to review the open codes and suggest failure modes. You can then use the LLM to define the failure mode of each trace based on the annotations. This way you can leverage an AI to look for patterns and groupings across your data.

#### Export your data

The easiest way to quickly visualize and create failure modes is to export the annotations with the input and output data, then you can pass this to an LLM to create failure modes.

1. From the experiment, select all the traces using the checkbox at the start of the column.

    <div>
    <img src="./images/selected-traces.webp" width="800"/>
    </div>

1. Select the **Export** button.
1. From the **Export Data** dialog, make sure the **Input**, **Output**, and **Open Coding** columns are selected. Give the exported file a name. then select the **Export** button to save the experiments as a CSV file.

    <div>
    <img src="./images/export-dialog.webp" width="500"/>
    </div>

#### Build the failure modes using an LLM

Once you have exported your data, you can use an LLM to create a set of failure modes, and assign these to each row of the export.

Start by loading the file you exported. Update the `EXPORT_FILE` constant below to map to the file path.

In [None]:
# Update this to reflect the file you exported your results to
EXPORT_FILE = "./export.csv"

Now load the file, extracting the input, output, and Open Coding columns.

In [None]:
import csv

exported_data = []
with open(EXPORT_FILE, mode="r", encoding="utf-8", newline="") as f:
    reader = csv.DictReader(f)
    if not reader.fieldnames:
        raise ValueError("CSV has no header")

    required = ["input", "output", "feedback/Open Coding"]
    missing = [c for c in required if c not in reader.fieldnames]
    if missing:
        raise ValueError(f"Missing required columns: {missing}")

    for row in reader:
        item = {
            "input": (row.get("input") or "").strip(),
            "output": (row.get("output") or "").strip(),
            "open_coding": (row.get("feedback/Open Coding") or "").strip(),
        }
        exported_data.append(item)


Next we can pass this to an LLM and ask it to create the failure modes for us.

In [None]:
# Create a prompt to generate the failure modes
prompt = f"""You are analyzing failures in a recipe bot. This bot has the following system prompt:

<system prompt>
{system_prompt}
</system prompt>

Analyze the following interactions with the bot. This data contains the input set to the bot, the bot's output, and human open coding feedback about the failure modes observed.

{exported_data}

Based on patterns you see in the data, generate a list of the top five most common failure modes observed in the bot's outputs. For each failure mode provide:
- A one or two word name
- A brief description of the failure mode
- A list of all the inputs from the data that exhibit this failure mode. You can just provide the name of the recipes, not the entire input.

Return this as a JSON array in this format:

[
    {{
        "name": "Failure Mode Name",
        "description": "Brief description of the failure mode.",
        "examples": ["Chicken pasta", "Ramen"]
    }}
]
"""

# Get the response from OpenAI
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant analyzes failure modes in AI applications."},
        {"role": "user", "content": prompt}
    ]
)

# Parse the response as JSON
failure_modes = json.loads(response.choices[0].message.content)

Finally we can output the failure modes in a more formatted fashion, saving them to a markdown file called `failure_mode_taxonomy.md`.

In [None]:
from pathlib import Path

md_lines = [
    "# Failure Mode Taxonomy",
    "",
    "This document outlines the failure modes observed or anticipated for the Recipe Chatbot. Each failure mode includes a title, a concise definition, and illustrative examples.",
    ""
]
for i, fm in enumerate(failure_modes, start=1):
    md_lines.append(f"## Failure Mode {i}: {fm['name']}")
    md_lines.append("")
    md_lines.append(f"* **Definition**: {fm['description']}")
    md_lines.append("* **Illustrative Examples**:")
    for j, ex in enumerate(fm.get("examples", []), start=1):
        md_lines.append(f"  {j}. {ex}")
    md_lines.append("")

content = "\n".join(md_lines)
Path("failure_mode_taxonomy.md").write_text(content, encoding="utf-8")
print("Wrote failure_mode_taxonomy.md")

## Track it

The final step is to track the new failure modes, assigning them as appropriate to the different rows in the experiment. We can do this by creating a new annotation that has a category type, then annotating each row with the relevant category.

### Create the annotation

1. Create a new annotation in the project. Call it "Failure Mode"
1. Set the **Annotation type** to **Categories**
1. Add categories for each of the failure modes

    <div>
    <img src="./images/annotation-failure-mode.webp" width="500"/>
    </div>

1. Save the annotation to the project

### Annotate the experiment

Once the annotation has been created, you can work through the experiment, assigning the relevant failure modes to each trace. You can select one or more failure modes as applicable.

<div>
<img src="./images/trace-with-failure-mode.webp" width="1000"/>
</div>

