# Homework Assignment 3: LLM-as-Judge for Recipe Bot Evaluation

This notebook shows you how to run the third homework example using Galileo. This homework involves creating your own LLM-as-a-judge prompt to validate if the response from the recipe chatbot follows the users dietary restrictions.

This homework example has three possible starting points, and this notebook is taking option three, starting from an already labelled data set. If you want to start with the other options, work through them to create the labelled data set, then use that instead of the pre-created labelled data set.

## Configuration

To be able to run this notebook, you need to have a Galileo account set up, along with an LLM integration to run an experiment to generate responses.

1. If you don't have a Galileo account, head to [app.galileo.ai/sign-up](https://app.galileo.ai/sign-up) and sign up for a free account
1. Once you have signed up, you will need to configure an LLM integration. Head to the [integrations page](https://app.galileo.ai/settings/integrations) and configure your integration of choice. The notebook assumes you are using OpenAI, but has details on what to change if you are using a different LLM.
1. Create a Galileo API key from the [API keys page](https://app.galileo.ai/settings/api-keys)
1. In this folder is an example `.env` file called `.env.example`. Copy this file to `.env`, and set the value of `GALILEO_API_KEY` to the API key you just created.
1. If you are using a custom Galileo deployment inside your organization, then set the `GALILEO_CONSOLE_URL` environment variable to your console URL. If you are using [app.galileo.ai](https://app.galileo.ai), such as with the free tier, then you can leave this commented out.
1. This code uses OpenAI to generate some values. Update the `OPENAI_API_KEY` value in the `.env` file with your OpenAI API key. If you are using another LLM, you will need to update the code to reflect this.


In [None]:
# Install the galileo and python-dotenv package into the current Jupyter kernel
%pip install "galileo[openai]" python-dotenv pydantic

## Environment setup

To use Galileo, we need to load the API key from the .env file

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Check that the GALILEO_API_KEY environment variable is set
if not os.getenv("GALILEO_API_KEY"):
    raise ValueError("GALILEO_API_KEY environment variable is not set. Please set it in your .env file.")

Next we need to ensure there is a Galileo project set up.

In [None]:
from galileo.projects import create_project, get_project

PROJECT_NAME = "AI Evals Course - Homework 3"
project = get_project(name=PROJECT_NAME)
if project is None:
    project = create_project(name=PROJECT_NAME)

print(f"Using project: {project.name} (ID: {project.id})")

In this notebook, you will be using the LLM integration you set up in Galileo to run an experiment, as well as calling OpenAI directly to generate some data. The default model used is GPT-5.1, and this assumes you have configured an OpenAI integration.

If you have another integration set up, or want to use a different model, update this value.

In [None]:
MODEL="gpt-5.1"

## Step 2: Split your data (skipping step 1)

We are starting at step 2, using the already labelled data set. We'll start by loading the data set, and divide into pass and fail sets. We divide into pass and fail to run each as a separate experiment, so that it is easier to see the true positive and true negative traces.

In [None]:
import json
from urllib.request import urlopen

# Load the labelled traces
source_path = "https://raw.githubusercontent.com/ai-evals-course/recipe-chatbot/refs/heads/main/homeworks/hw3/reference_files/labeled_traces.jsonl"

# Open and read the labelled traces into a JSON array
with urlopen(source_path) as resp:
    lines = (ln.decode("utf-8") for ln in resp)
    labelled_traces = [json.loads(line) for line in lines]

# Divide into pass and fail sets. These are defined by the label property as PASS or FAIL.
passed_traces = [trace for trace in labelled_traces if trace["label"] == "PASS"]
failed_traces = [trace for trace in labelled_traces if trace["label"] == "FAIL"]

print(f"Total traces: {len(labelled_traces)}")
print(f"Passed traces: {len(passed_traces)}")
print(f"Failed traces: {len(failed_traces)}")

Now for each set, split into 10% train, 40% dev, and 50% test. Do this randomly.

In [None]:
import random

def split_traces(traces, train_frac=0.10, dev_frac=0.40, seed=42, name=None):
    rng = random.Random(seed)
    shuffled = traces.copy()
    rng.shuffle(shuffled)
    total = len(shuffled)
    train_size = int(total * train_frac)
    dev_size = int(total * dev_frac)
    train = shuffled[:train_size]
    dev = shuffled[train_size:train_size + dev_size]
    test = shuffled[train_size + dev_size:]
    label = f"{name} " if name else ""
    print(f"{label}split: Train={len(train)} Dev={len(dev)} Test={len(test)} (total={total})")
    return train, dev, test

# Split passed and failed traces into train/dev/test sets
passed_train, passed_dev, passed_test = split_traces(passed_traces, name="Passed")
failed_train, failed_dev, failed_test = split_traces(failed_traces, name="Failed")

To make it easier to view the data, let's upload these as datasets in Galileo. First let's create unique names for these datasets.

In [None]:
from datetime import datetime

current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

PASSED_TRAINING_SET_NAME = f"Homework 3 Passed training set - {current_time}"
PASSED_DEV_SET_NAME = f"Homework 3 Passed dev set - {current_time}"
PASSED_TEST_SET_NAME = f"Homework 3 Passed test set - {current_time}"
FAILED_TRAINING_SET_NAME = f"Homework 3 Failed training set - {current_time}"
FAILED_DEV_SET_NAME = f"Homework 3 Failed dev set - {current_time}"
FAILED_TEST_SET_NAME = f"Homework 3 Failed test set - {current_time}"

print(f"Passed training set name: {PASSED_TRAINING_SET_NAME}")
print(f"Passed dev set name: {PASSED_DEV_SET_NAME}")
print(f"Passed test set name: {PASSED_TEST_SET_NAME}")
print(f"Failed training set name: {FAILED_TRAINING_SET_NAME}")
print(f"Failed dev set name: {FAILED_DEV_SET_NAME}")
print(f"Failed test set name: {FAILED_TEST_SET_NAME}")

Next we create the actual datasets, with the query, response, and some additional information as metadata, such as the reasoning behind the label.

A link to the datasets is output after they are created, so you can view these rows.

In [None]:
from galileo.datasets import get_dataset, create_dataset, delete_dataset

def create_or_replace_dataset(dataset_name, rows):

    # Now we have the CSV file loaded, lets create a dataset. If the dataset already exists, we will delete it and re-create it.
    dataset = get_dataset(
        name=dataset_name
    )

    if dataset is not None:
        print(f"Dataset already exists with ID: {dataset.id}, deleting it to re-create.")
        dataset = delete_dataset(
            name=dataset_name
        )

    dataset = create_dataset(
        name=dataset_name,
        content=[
            {
                "input": row["query"],
                "output": row["response"],
                "metadata": {
                    "query_id": row["query_id"],
                    "reasoning": row["reasoning"],
                },
            } for row in rows
        ],
    )

    return dataset

# Create the datasets
passed_training_dataset = create_or_replace_dataset(
    dataset_name=PASSED_TRAINING_SET_NAME,
    rows=passed_train
)
passed_dev_dataset = create_or_replace_dataset(
    dataset_name=PASSED_DEV_SET_NAME,
    rows=passed_dev
)
passed_test_dataset = create_or_replace_dataset(
    dataset_name=PASSED_TEST_SET_NAME,
    rows=passed_test
)
failed_training_dataset = create_or_replace_dataset(
    dataset_name=FAILED_TRAINING_SET_NAME,
    rows=failed_train
)
failed_dev_dataset = create_or_replace_dataset(
    dataset_name=FAILED_DEV_SET_NAME,
    rows=failed_dev
)
failed_test_dataset = create_or_replace_dataset(
    dataset_name=FAILED_TEST_SET_NAME,
    rows=failed_test
)

print("Training datasets - refer to these when getting examples for your prompt:")
print(f"Passed training dataset created. You can view it at {os.environ.get('GALILEO_CONSOLE_URL', 'https://app.galileo.ai/').removesuffix('/')}/datasets/{passed_training_dataset.id}")
print(f"Failed training dataset created. You can view it at {os.environ.get('GALILEO_CONSOLE_URL', 'https://app.galileo.ai/').removesuffix('/')}/datasets/{failed_training_dataset.id}")

print("Dev datasets - these will be used to test your prompt during development:")
print(f"Passed dev dataset created. You can view it at {os.environ.get('GALILEO_CONSOLE_URL', 'https://app.galileo.ai/').removesuffix('/')}/datasets/{passed_dev_dataset.id}")
print(f"Failed dev dataset created. You can view it at {os.environ.get('GALILEO_CONSOLE_URL', 'https://app.galileo.ai/').removesuffix('/')}/datasets/{failed_dev_dataset.id}")

print("Test datasets - these will be used to evaluate your final prompt:")
print(f"Passed test dataset created. You can view it at {os.environ.get('GALILEO_CONSOLE_URL', 'https://app.galileo.ai/').removesuffix('/')}/datasets/{passed_test_dataset.id}")
print(f"Failed test dataset created. You can view it at {os.environ.get('GALILEO_CONSOLE_URL', 'https://app.galileo.ai/').removesuffix('/')}/datasets/{failed_test_dataset.id}")

## Step 3: Write your judge prompt

Now we have the train, dev, and test data sets, we can build the LLM-as-a-judge prompt.

Update the `custom_metric_prompt` below with your judge prompt. Remember to include:
- The task and criterion
- Clear Pass/Fail definitions
- 2-3 few-shot examples from your Train set with input, output, reasoning, and pass/fail label. Refer to the datasets created in the last section for these.

For the expected output, the metric should return `true` if the output follows the dietary restrictions defined in the input, otherwise return `false`. You also do not need to ask for reasoning, this is handled automatically by Galileo.

This prompt will be used by Galileo to evaluate the outputs. Refer to the [LLM-as-a-judge prompt engineering guide in the Galileo documentation](https://v2docs.galileo.ai/concepts/metrics/custom-metrics/prompt-engineering) for more guidance on how to structure a good LLM-as-a-judge prompt.

> Instead of creating this metric in code, you can also create and test it in the Galileo console, including using Prompt Assist to get the prompt created for you. You can read more in the [Galileo custom LLM-as-a-judge metrics documentation](https://v2docs.galileo.ai/concepts/metrics/custom-metrics/custom-metrics-ui-llm#create-a-new-llm-as-a-judge-metric-in-the-galileo-console).

In [None]:
# The prompt for the custom dietary adherence metric.
# Make sure to fill in the examples section with relevant examples from the training datasets, with both pass and fail examples.
custom_metric_prompt = """
You are an expert nutritionist and dietary specialist evaluating whether recipe responses properly adhere to specified dietary restrictions.

DIETARY RESTRICTION DEFINITIONS:
- Vegan: No animal products (meat, dairy, eggs, honey, etc.)
- Vegetarian: No meat or fish, but dairy and eggs are allowed
- Gluten-free: No wheat, barley, rye, or other gluten-containing grains
- Dairy-free: No milk, cheese, butter, yogurt, or other dairy products
- Keto: Very low carb (typically <20g net carbs), high fat, moderate protein
- Paleo: No grains, legumes, dairy, refined sugar, or processed foods
- Pescatarian: No meat except fish and seafood
- Kosher: Follows Jewish dietary laws (no pork, shellfish, mixing meat/dairy)
- Halal: Follows Islamic dietary laws (no pork, alcohol, proper slaughter)
- Nut-free: No tree nuts or peanuts
- Low-carb: Significantly reduced carbohydrates (typically <50g per day)
- Sugar-free: No added sugars or high-sugar ingredients
- Raw vegan: Vegan foods not heated above 118°F (48°C)
- Whole30: No grains, dairy, legumes, sugar, alcohol, or processed foods
- Diabetic-friendly: Low glycemic index, controlled carbohydrates
- Low-sodium: Reduced sodium content for heart health

Rubric:
- true: The recipe in the output clearly adheres to the dietary preferences defined in the input with appropriate ingredients and preparation methods
- false: The recipe in the output contains ingredients or methods that violate the dietary preferences defined in the input
- Consider both explicit ingredients and cooking methods

Here are some examples of how to evaluate dietary adherence:

true:

User asks for vegan pasta → Bot suggests nutritional yeast instead of parmesan
User asks for gluten-free bread → Bot uses almond flour and xanthan gum
User asks for keto dinner → Bot provides cauliflower rice with high-fat protein

false:

User asks for vegan pasta → Bot includes honey (not vegan)
User asks for gluten-free bread → Bot uses regular soy sauce (contains wheat)
User asks for keto dinner → Bot includes sweet potato (too many carbs)
"""

Once you are happy with the prompt, it can be used to create a custom metric in Galileo. This metric will be a boolean metric, and will operate at the trace level, so that it assesses the results of the end to end user question to answer flow, for example including all agents and tools that a recipe bot might use in the process. The `cot_enabled` parameter turns on reasoning, so you get an explanation with each result. This also uses the model from the `MODEL` constant you set earlier, and runs this against the LLM 3 times to get a consensus.

In [None]:
from galileo.metrics import create_custom_llm_metric, OutputTypeEnum, StepType, delete_metric
from galileo.scorers import Scorers

METRIC_NAME = "Dietary requirements adherence"

if len(Scorers().list(name=METRIC_NAME)) > 0:
    print(f"Metric '{METRIC_NAME}' already exists. Deleting it to re-create.")
    delete_metric(name=METRIC_NAME)

# Create the metric
metric = create_custom_llm_metric(
    name=METRIC_NAME,
    user_prompt=custom_metric_prompt,
    node_level=StepType.trace,
    cot_enabled=True,
    model_name=MODEL,
    num_judges=3,
    description="""
This metric determines if the response from the recipe bot adheres to the
specified dietary requirements of the user.
""",
    output_type=OutputTypeEnum.BOOLEAN,
)

print(f"Custom metric created. You can view it at {os.environ.get('GALILEO_CONSOLE_URL', 'https://app.galileo.ai/').removesuffix('/')}/metrics/{metric.scorer_id}'")

## Step 4: Test and refine

The next step is to test the new custom metric on the dev sets, then use the results of this to refine the prompt. This testing will run an experiment against the pass and fail dev datasets, and use these to calculate the True Positive Rate (TPR) and the True Negative Rate (TNR). The TPR is the percentage of rows from the pass dev dataset that actually passes the metric, and the TNR is the percentage of rows from the fail dev dataset that actually fail the metric.

### Test the metric

Experiments can be run against a prompt or a custom function (such as one that calls an agent). In this case the outputs are already defined in our source data, so we need to define a custom function that takes the dataset row, and looks up then returns the output. It will also create the relevant LLM span to simulate the LLM returning the response. We can then use this in the experiments.

In [None]:
from galileo import galileo_context

def mock_llm_function(input: str) -> str:
    # Get the logger to log an LLM span
    logger = galileo_context.get_logger_instance()

    # Look up the response for the given input from the labelled traces
    row = [trace for trace in labelled_traces if trace["query"] == input][0]
    
    # Log the LLM span
    logger.add_llm_span(input=input, output=row["response"], model=MODEL, name="Recipe generation")

    # Return the response
    return row["response"]
    

Next we can define some unique names for the 2 experiments.

In [None]:
current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

TPR_DEV_EXPERIMENT_NAME = f"Homework 3 TPR Experiment (dev) - {current_time}"
TNR_DEV_EXPERIMENT_NAME = f"Homework 3 TNR Experiment (dev) - {current_time}"

And define a function to run the experiment, then wait for the results to be calculated.

In [None]:
import time
from galileo.experiments import get_experiment, run_experiment

def run_dietary_adherence_experiment(experiment_name, dataset):
    # Run the experiment
    experiment_response = run_experiment(
        experiment_name=experiment_name,
        dataset=dataset,
        function=mock_llm_function,
        metrics=[METRIC_NAME],
        project=PROJECT_NAME,  
    )

    # Poll until we have the metrics calculated - waiting 5 seconds between polls
    experiment = get_experiment(
        project_id=experiment_response["experiment"].project_id,
        experiment_name=experiment_response["experiment"].name,
    )
    while (
        experiment.aggregate_metrics is None
        or f"average_{METRIC_NAME}" not in experiment.aggregate_metrics
    ):
        # If we don't have the metrics calculated, Sleep for 5 seconds before polling again
        time.sleep(5)

        # Reload the experiment to see if we have the metrics
        experiment = get_experiment(
            project_id=experiment_response["experiment"].project_id,
            experiment_name=experiment_response["experiment"].name,
        )
    
    return experiment, experiment_response['link']

We'll start by running the pass dev dataset to get the TPR.

In [None]:
experiment, link = run_dietary_adherence_experiment(
    experiment_name=TPR_DEV_EXPERIMENT_NAME,
    dataset=passed_dev_dataset,
)

true_positive_rate = experiment.aggregate_metrics[f"average_{METRIC_NAME}"]

print(f"True Positive Rate (TPR) on passed dev set: {true_positive_rate:.2%}")
print(f"You can view the experiment at {link}")

Next we can do the same with the fail dev set to get the TNR. As the metric measures adherence to dietary requirements, we instead have to calculate this as the number that fail, so 1 - the score.

In [None]:
experiment, link = run_dietary_adherence_experiment(
    experiment_name=TNR_DEV_EXPERIMENT_NAME,
    dataset=failed_dev_dataset,
)

# This metric measures what passes the dietary requirements check, so the true negative rate is 1 - the metric value
true_negative_rate = 1.0 - experiment.aggregate_metrics[f"average_{METRIC_NAME}"]

print(f"True Negative Rate (TNR) on failed dev set: {true_negative_rate:.2%}")
print(f"You can view the experiment at {link}")

### Refine the prompt

Ideally we want the true positive and true negative rates to be as close to 100% as possible. Refine the `custom_metric_prompt`, then re-run the code to recreate the metric.

Once recreated, re-run the experiments to get the updated TPR and TNR. Keep refining until you have a prompt you are happy with.

## Step 5: Evaluate new traces

Now we have a working metric, we can run it on the test set to get a better idea how it is working. We can repeat the same experiments with the test dataset, and hopefully we should now get a good score for the TPR and TNR.

In [None]:
TPR_TEST_EXPERIMENT_NAME = f"Homework 3 TPR Experiment (test) - {current_time}"
TNR_TEST_EXPERIMENT_NAME = f"Homework 3 TNR Experiment (test) - {current_time}"

tpr_experiment, tpr_link = run_dietary_adherence_experiment(
    experiment_name=TPR_TEST_EXPERIMENT_NAME,
    dataset=passed_test_dataset,
)

tnr_experiment, tnr_link = run_dietary_adherence_experiment(
    experiment_name=TNR_TEST_EXPERIMENT_NAME,
    dataset=failed_test_dataset,
)

true_positive_rate = tpr_experiment.aggregate_metrics[f"average_{METRIC_NAME}"]
true_negative_rate = 1.0 - tnr_experiment.aggregate_metrics[f"average_{METRIC_NAME}"]

print(f"True Positive Rate (TPR) on passed test set: {true_positive_rate:.2%}")
print(f"True Negative Rate (TNR) on failed test set: {true_negative_rate:.2%}")
print(f"You can view the TPR experiment at {tpr_link}")
print(f"You can view the TNR experiment at {tnr_link}")