# Homework Assignment 3: LLM-as-Judge for Recipe Bot Evaluation

This notebook shows you how to run the third homework example using Galileo. This homework involves creating your own LLM-as-a-judge prompt to validate if the response from the recipe chatbot follows the users dietary restrictions.

This homework example has three possible starting points, and this notebook is taking option three, starting from an already labelled data set. If you want to start with the other options, work through them to create the labelled data set, then use that instead of the pre-created labelled data set.

## Configuration

To be able to run this notebook, you need to have a Galileo account set up, along with an LLM integration to run an experiment to generate responses.

1. If you don't have a Galileo account, head to [app.galileo.ai/sign-up](https://app.galileo.ai/sign-up) and sign up for a free account
1. Once you have signed up, you will need to configure an LLM integration. Head to the [integrations page](https://app.galileo.ai/settings/integrations) and configure your integration of choice. The notebook assumes you are using OpenAI, but has details on what to change if you are using a different LLM.
1. Create a Galileo API key from the [API keys page](https://app.galileo.ai/settings/api-keys)
1. In this folder is an example `.env` file called `.env.example`. Copy this file to `.env`, and set the value of `GALILEO_API_KEY` to the API key you just created.
1. If you are using a custom Galileo deployment inside your organization, then set the `GALILEO_CONSOLE_URL` environment variable to your console URL. If you are using [app.galileo.ai](https://app.galileo.ai), such as with the free tier, then you can leave this commented out.
1. This code uses OpenAI to generate some values. Update the `OPENAI_API_KEY` value in the `.env` file with your OpenAI API key. If you are using another LLM, you will need to update the code to reflect this.


In [16]:
# Install the galileo and python-dotenv package into the current Jupyter kernel
%pip install "galileo[openai]" python-dotenv pydantic

Note: you may need to restart the kernel to use updated packages.


## Environment setup

To use Galileo, we need to load the API key from the .env file

In [17]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Check that the GALILEO_API_KEY environment variable is set
if not os.getenv("GALILEO_API_KEY"):
    raise ValueError("GALILEO_API_KEY environment variable is not set. Please set it in your .env file.")

Next we need to ensure there is a Galileo project set up.

In [18]:
from galileo.projects import create_project, get_project

PROJECT_NAME = "AI Evals Course - Homework 3"
project = get_project(name=PROJECT_NAME)
if project is None:
    project = create_project(name=PROJECT_NAME)

print(f"Using project: {project.name} (ID: {project.id})")

Using project: AI Evals Course - Homework 3 (ID: a56824ba-12e4-4d41-9a0a-57deee0b84ea)


In this notebook, you will be using the LLM integration you set up in Galileo to run an experiment, as well as calling OpenAI directly to generate some data. The default model used is GPT-5.1, and this assumes you have configured an OpenAI integration.

If you have another integration set up, or want to use a different model, update this value.

In [19]:
MODEL="gpt-5.1"

## Step 2: Split your data (skipping step 1)

We are starting at step 2, using the already labelled data set. We'll start by loading the data set, and divide into pass and fail sets. We divide into pass and fail to run each as a separate experiment, so that it is easier to see the true positive and true negative traces.

In [20]:
import json
from urllib.request import urlopen

# Load the labelled traces
source_path = "https://raw.githubusercontent.com/ai-evals-course/recipe-chatbot/refs/heads/main/homeworks/hw3/reference_files/labeled_traces.jsonl"

# Open and read the labelled traces into a JSON array
with urlopen(source_path) as resp:
    lines = (ln.decode("utf-8") for ln in resp)
    labelled_traces = [json.loads(line) for line in lines]

# Divide into pass and fail sets. These are defined by the label property as PASS or FAIL.
passed_traces = [trace for trace in labelled_traces if trace["label"] == "PASS"]
failed_traces = [trace for trace in labelled_traces if trace["label"] == "FAIL"]

print(f"Total traces: {len(labelled_traces)}")
print(f"Passed traces: {len(passed_traces)}")
print(f"Failed traces: {len(failed_traces)}")

Total traces: 101
Passed traces: 75
Failed traces: 26


Now for each set, split into 10% train, 40% dev, and 50% test. Do this randomly.

In [21]:
import random

def split_traces(traces, train_frac=0.10, dev_frac=0.40, seed=42, name=None):
    rng = random.Random(seed)
    shuffled = traces.copy()
    rng.shuffle(shuffled)
    total = len(shuffled)
    train_size = int(total * train_frac)
    dev_size = int(total * dev_frac)
    train = shuffled[:train_size]
    dev = shuffled[train_size:train_size + dev_size]
    test = shuffled[train_size + dev_size:]
    label = f"{name} " if name else ""
    print(f"{label}split: Train={len(train)} Dev={len(dev)} Test={len(test)} (total={total})")
    return train, dev, test

# Split passed and failed traces into train/dev/test sets
passed_train, passed_dev, passed_test = split_traces(passed_traces, name="Passed")
failed_train, failed_dev, failed_test = split_traces(failed_traces, name="Failed")

Passed split: Train=7 Dev=30 Test=38 (total=75)
Failed split: Train=2 Dev=10 Test=14 (total=26)


To make it easier to view the data, let's upload these as datasets in Galileo. First let's create unique names for these datasets.

In [22]:
from datetime import datetime

current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

PASSED_TRAINING_SET_NAME = f"Homework 3 Passed training set - {current_time}"
FAILED_TRAINING_SET_NAME = f"Homework 3 Failed training set - {current_time}"

print(f"Passed training set name: {PASSED_TRAINING_SET_NAME}")
print(f"Failed training set name: {FAILED_TRAINING_SET_NAME}")

Passed training set name: Homework 3 Passed training set - 2025-12-17 19:17:17
Failed training set name: Homework 3 Failed training set - 2025-12-17 19:17:17


Next we create the actual datasets, with the query, response, and some additional information as metadata, such as the reasoning behind the label.

A link to the datasets is output after they are created, so you can view these rows.

In [None]:
from galileo.datasets import get_dataset, create_dataset, delete_dataset

def create_training_set_dataset(dataset_name, rows):

    # Now we have the CSV file loaded, lets create a dataset. If the dataset already exists, we will delete it and re-create it.
    dataset = get_dataset(
        name=dataset_name
    )

    if dataset is not None:
        print(f"Dataset already exists with ID: {dataset.id}, deleting it to re-create.")
        dataset = delete_dataset(
            name=dataset_name
        )

    dataset = create_dataset(
        name=dataset_name,
        content=[
            {
                "input": row["query"],
                "output": row["response"],
                "metadata": {
                    "query_id": row["query_id"],
                    "reasoning": row["reasoning"],
                },
            } for row in rows
        ],
    )

    return dataset

# Create the passed and failed training set datasets
passed_dataset = create_training_set_dataset(
    dataset_name=PASSED_TRAINING_SET_NAME,
    rows=passed_train
)
failed_dataset = create_training_set_dataset(
    dataset_name=FAILED_TRAINING_SET_NAME,
    rows=failed_train
)

print(f"Passed training dataset created. You can view it at {os.environ.get('GALILEO_CONSOLE_URL', 'https://app.galileo.ai/').removesuffix('/')}/datasets/{passed_dataset.id}")
print(f"Failed training dataset created. You can view it at {os.environ.get('GALILEO_CONSOLE_URL', 'https://app.galileo.ai/').removesuffix('/')}/datasets/{failed_dataset.id}")

Dataset already exists with ID: 2c7c66c3-e8fb-447c-a67d-7b55bd2b9382, deleting it to re-create.
Dataset already exists with ID: 2f515ce2-ec3e-46c3-a462-086205166da7, deleting it to re-create.
Dataset created. You can view it at https://app.galileo.ai/datasets/d91a6ceb-d516-4ca0-99bc-bee2992a1081
Dataset created. You can view it at https://app.galileo.ai/datasets/3ef1ef48-2f5b-4459-b49a-fe6b95a7f49b


## Step 3: Write your judge prompt

Now we have the train, dev, and test data sets, we can build the LLM-as-a-judge prompt.

Update the `custom_metric_prompt` below with your judge prompt. Remember to include:
- The task and criterion
- Clear Pass/Fail definitions
- 2-3 few-shot examples from your Train set with input, output, reasoning, and pass/fail label. Refer to the datasets created in the last section for these.

For the expected output, the metric should return `true` if the output follows the dietary restrictions defined in the input, otherwise return `false`.

This prompt will be used by Galileo to evaluate the outputs. Refer to the [LLM-as-a-judge prompt engineering guide in the Galileo documentation](https://v2docs.galileo.ai/concepts/metrics/custom-metrics/prompt-engineering) for more guidance on how to structure a good LLM-as-a-judge prompt.

In [None]:
# The prompt for the custom dietary adherence metric.
# Make sure to fill in the examples section with relevant examples from the training datasets, with both pass and fail examples.
custom_metric_prompt = """
You are an expert nutritionist and dietary specialist evaluating whether recipe responses properly adhere to specified dietary restrictions.

DIETARY RESTRICTION DEFINITIONS:
- Vegan: No animal products (meat, dairy, eggs, honey, etc.)
- Vegetarian: No meat or fish, but dairy and eggs are allowed
- Gluten-free: No wheat, barley, rye, or other gluten-containing grains
- Dairy-free: No milk, cheese, butter, yogurt, or other dairy products
- Keto: Very low carb (typically <20g net carbs), high fat, moderate protein
- Paleo: No grains, legumes, dairy, refined sugar, or processed foods
- Pescatarian: No meat except fish and seafood
- Kosher: Follows Jewish dietary laws (no pork, shellfish, mixing meat/dairy)
- Halal: Follows Islamic dietary laws (no pork, alcohol, proper slaughter)
- Nut-free: No tree nuts or peanuts
- Low-carb: Significantly reduced carbohydrates (typically <50g per day)
- Sugar-free: No added sugars or high-sugar ingredients
- Raw vegan: Vegan foods not heated above 118°F (48°C)
- Whole30: No grains, dairy, legumes, sugar, alcohol, or processed foods
- Diabetic-friendly: Low glycemic index, controlled carbohydrates
- Low-sodium: Reduced sodium content for heart health

Rubric:
- true: The recipe in the output clearly adheres to the dietary preferences defined in the input with appropriate ingredients and preparation methods
- false: The recipe in the output contains ingredients or methods that violate the dietary preferences defined in the input
- Consider both explicit ingredients and cooking methods

Here are some examples of how to evaluate dietary adherence:

1.
2.
3.
"""