# Homework Assignment 4: Recipe Bot Retrieval Evaluation

This notebook shows you how to run the fourth homework example using Galileo. This homework involves building and evaluating a BM25 retrieval system for recipes.

## Configuration

To be able to run this notebook, you need to have a Galileo account set up, along with an LLM integration to run an experiment to generate responses.

1. If you don't have a Galileo account, head to [app.galileo.ai/sign-up](https://app.galileo.ai/sign-up) and sign up for a free account
1. Once you have signed up, you will need to configure an LLM integration. Head to the [integrations page](https://app.galileo.ai/settings/integrations) and configure your integration of choice. The notebook assumes you are using OpenAI, but has details on what to change if you are using a different LLM.
1. Create a Galileo API key from the [API keys page](https://app.galileo.ai/settings/api-keys)
1. In this folder is an example `.env` file called `.env.example`. Copy this file to `.env`, and set the value of `GALILEO_API_KEY` to the API key you just created.
1. If you are using a custom Galileo deployment inside your organization, then set the `GALILEO_CONSOLE_URL` environment variable to your console URL. If you are using [app.galileo.ai](https://app.galileo.ai), such as with the free tier, then you can leave this commented out.
1. This code uses OpenAI to generate some values. Update the `OPENAI_API_KEY` value in the `.env` file with your OpenAI API key. If you are using another LLM, you will need to update the code to reflect this.


In [17]:
# Install the galileo and python-dotenv package into the current Jupyter kernel
%pip install "galileo[openai]" python-dotenv litellm rank-bm25

Collecting litellm
  Downloading litellm-1.81.5-py3-none-any.whl.metadata (30 kB)
Collecting rank-bm25
  Using cached rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Collecting aiohttp>=3.10 (from litellm)
  Downloading aiohttp-3.13.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.1 kB)
Collecting fastuuid>=0.13.0 (from litellm)
  Downloading fastuuid-0.14.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (1.1 kB)
Collecting importlib-metadata>=6.8.0 (from litellm)
  Using cached importlib_metadata-8.7.1-py3-none-any.whl.metadata (4.7 kB)
Collecting jinja2<4.0.0,>=3.1.2 (from litellm)
  Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting tiktoken>=0.7.0 (from litellm)
  Downloading tiktoken-0.12.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (6.7 kB)
Collecting tokenizers (from litellm)
  Using cached tokenizers-0.22.2-cp39-abi3-macosx_11_0_arm64.whl.metadata (7.3 kB)
Collecting MarkupSafe>=2.0 (from jinja2<4.0.0,>=3.1.2->litellm)
  Using cached markupsafe-3.0.3-cp313-cp

## Access to the Recipe Chatbot code

This homework uses classes from the [Recipe Chatbot](https://github.com/ai-evals-course/recipe-chatbot) codebase. Clone this code and add it to the module path so we can import modules.

In [None]:
import os

# Clone the repo
if not os.path.exists("./recipe-chatbot"):
    !git clone https://github.com/ai-evals-course/recipe-chatbot.git

# Set the sys path to include the cloned repo
import sys
sys.path.append("./recipe-chatbot")

# Test we can import the relevant modules
from backend.query_rewrite_agent import QueryRewriteAgent
from backend.retrieval import create_retriever, retrieve_bm25

## Environment setup

To use Galileo, we need to load the API key from the .env file

In [3]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Check that the GALILEO_API_KEY environment variable is set
if not os.getenv("GALILEO_API_KEY"):
    raise ValueError("GALILEO_API_KEY environment variable is not set. Please set it in your .env file.")

Next we need to ensure there is a Galileo project set up.

In [5]:
from galileo.projects import create_project, get_project

PROJECT_NAME = "AI Evals Course - Homework 4"
project = get_project(name=PROJECT_NAME)
if project is None:
    project = create_project(name=PROJECT_NAME)

print(f"Using project: {project.name} (ID: {project.id})")

Using project: AI Evals Course - Homework 4 (ID: a4a314c7-5a9e-4d6b-9d4c-9e9adceba739)


## Step 1: Look at your data

We'll start by loading the datasets from GitHub and exploring a few rows.

First we load the recipes.

In [9]:
import json
from urllib.request import urlopen

# Load the source data from the GitHub repository
processed_recipes_source_path = "https://raw.githubusercontent.com/ai-evals-course/recipe-chatbot/refs/heads/main/homeworks/hw4/reference_files/processed_recipes.json"

with urlopen(processed_recipes_source_path) as resp:
    recipes = json.load(resp)

print(f"Loaded {len(recipes)} recipes")

Loaded 200 recipes


We can print the first one to see the data.

In [10]:
# Look at one recipe
recipe = recipes[0]
print(f"Name: {recipe['name']}")
print(f"Cooking time: {recipe['minutes']} minutes")
print(f"Ingredients: {recipe['ingredients'][:5]}...")  # First 5
print(f"Steps: {len(recipe['steps'])} steps")

Name: 5 cheese crab lasagna with roasted garlic and vegetables
Cooking time: 245 minutes
Ingredients: ['garlic', 'extra virgin olive oil', 'dry white wine', 'fresh asparagus', 'cooking spray']...
Steps: 108 steps


Next we load the synthetic queries.

In [11]:
synthetic_queries_source_path = "https://raw.githubusercontent.com/ai-evals-course/recipe-chatbot/refs/heads/main/homeworks/hw4/reference_files/synthetic_queries.jsonl"

with urlopen(synthetic_queries_source_path) as resp:
    lines = (ln.decode("utf-8") for ln in resp)
    queries = [json.loads(line) for line in lines]

print(f"Loaded {len(queries)} synthetic queries")

Loaded 200 synthetic queries


Again we can print the first one to see the data.

In [12]:
# Look at one query
q = queries[0]
print(f"Query: {q['query']}")
print(f"\nSource recipe: {q['source_recipe_name']}")
print(f"Source recipe ID: {q['source_recipe_id']}")
print(f"\nSalient fact (what makes this query answerable):\n{q['salient_fact'][:300]}...")

Query: What temperature should I set my oven to and how long do I need to bake this sweet, yeast-based bread for it to turn out fluffy and perfectly cooked?

Source recipe: amish friendship bread
Source recipe ID: 246125

Salient fact (what makes this query answerable):
1. **Appliance Settings**: The recipe specifies to "preheat oven to 325°F," which is a precise temperature setting necessary for baking the Amish friendship bread.

2. **Timing Specifics**: The recipe indicates a baking time of "1 hour," which is crucial for ensuring the bread is cooked properly and...


## Step 2: Build BM25 Retriever

Now we can build the BM25 retriever in the same way as the original homework.

In [19]:
from typing import Dict

def recipe_to_text(recipe: Dict) -> str:
    """Combine recipe fields into searchable text."""
    parts = [
        recipe['name'],
        ' '.join(recipe.get('ingredients', [])),
        ' '.join(recipe.get('steps', [])),
        ' '.join(recipe.get('tags', []))
    ]
    return ' '.join(parts).lower()

# Create corpus
corpus_texts = [recipe_to_text(r) for r in recipes]
print(f"Example text (first 300 chars):\n{corpus_texts[0][:300]}...")

Example text (first 300 chars):
5 cheese crab lasagna with roasted garlic and vegetables garlic extra virgin olive oil dry white wine fresh asparagus cooking spray garlic salt salt & freshly ground black pepper red bell peppers fresh basil dry lasagna noodles roma tomatoes dried oregano parmesan-romano cheese mix butter sweet onio...


In [20]:
from rank_bm25 import BM25Okapi

# Simple tokenization (split on whitespace)
tokenized_corpus = [text.split() for text in corpus_texts]

# Build BM25 index
bm25 = BM25Okapi(tokenized_corpus)
print(f"BM25 index built with {len(tokenized_corpus)} documents")

BM25 index built with 200 documents


In [22]:
from typing import List, Tuple

def retrieve(query: str, top_k: int = 5) -> List[Tuple[int, float, str]]:
    """Retrieve top-k recipes for a query.
    
    Returns: List of (recipe_index, score, recipe_name)
    """
    tokenized_query = query.lower().split()
    scores = bm25.get_scores(tokenized_query)
    
    # Get top-k indices
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    
    results = []
    for idx in top_indices:
        results.append((recipes[idx]['id'], scores[idx], recipes[idx]['name']))
    return results

In [23]:
# Test it
test_query = "air fryer chicken crispy"
results = retrieve(test_query, top_k=5)

print(f"Query: {test_query}\n")
print("Top 5 results:")
for i, (recipe_id, score, name) in enumerate(results, 1):
    print(f"  {i}. {name} (score: {score:.2f})")

Query: air fryer chicken crispy

Top 5 results:
  1. 7 layer elote dip (score: 6.75)
  2. amazingly juicy grilled lemon chicken (score: 6.19)
  3. algerian chicken preserved lemon bourek (score: 5.78)
  4. a grape picker s lunch sausages and lentils with thyme and wine (score: 5.57)
  5. alton s french onion soup attacked by sandi (score: 5.54)


## Step 2 additional - build a simple chatbot app that uses the retriever and logs to Galileo

Evals make more sense in terms of measuring part of an AI interaction. We should just score the `retrieve` function in code, but that doesn't simulate measuring a real world interaction. Instead, let's build a chat function with an LLM that retrieves data from the `retrieve` function to pass to the LLM, simulating a RAG system.

In [None]:
def process_query(query: str) -> str:
    # Get the results from the RAG system
    results = retrieve(query, top_k=5)
    print(f"Query: {test_query}\n")
    print("Top 5 results:")
    for i, (recipe_id, score, name) in enumerate(results, 1):
        print(f"  {i}. {name} (score: {score:.2f})")