# Week 1: Building Your RAG Evaluation Foundation

Evaluating RAG systems is challenging, especially when you're just starting out. This notebook introduces a practical approach to assessment using synthetic data generation - a crucial first step before diving into more complex metrics.

## Why This Matters

Traditional RAG evaluation focuses on generated content quality, but this approach has significant drawbacks:

| Aspect | Content Generation | Retrieval Metrics |
|--------|-------------------|-------------------|
| Speed | 1-10s per test | 10-800ms per test |
| Cost | $100s per run | Negligible |
| Objectivity | Subjective | Quantitative |
| Iteration Speed | Hours | Minutes |
| Scale | Limited | Automated |

Instead, we'll focus on retrieval metrics that are:
- Fast to compute (milliseconds vs seconds)
- Objective and reproducible
- Easy to automate
- Cost-effective at scale

## What You'll Learn

Through this hands-on tutorial, you'll learn to build a comprehensive evaluation framework:

1. **Synthetic Data Generation**
   - Create diverse, realistic test questions
   - Generate comprehensive datasets without real user data
   - Learn techniques for systematic question generation

2. **Maintain Query Diversity**
   - Identify different types of questions to test
   - Cover various query patterns and edge cases
   - Build representative test scenarios

3. **Evaluation Setup**
   - Establish measurement foundations
   - Set up automated testing pipelines
   - Create reproducible evaluation workflows

By the end of this notebook, you'll have a good understanding of what retrieval metrics are, how we can generate synthetic questions to benchmark our retrieval system and how we can use these questions to evaluate our retrieval system.





## Evaluating Retrieval

Before looking at our case study, let's first understand some of the metrics that we'll be using to evaluate our retrieval system. These metrics will form the basis of our evaluation framework throughout this course.

### Key Retrieval Metrics

**Precision** measures how many of our retrieved items are actually relevant:

$$ \text{Precision} = \frac{\text{Number of Relevant Items Retrieved}}{\text{Total Number of Retrieved Items}} $$ 

For example, if your system retrieves 10 documents but only 5 are relevant, that's 50% precision. Low precision indicates your system is wasting resources processing irrelevant content.

**Recall** measures how many of the total relevant items we managed to find:

$$ \text{Recall} = \frac{\text{Number of Relevant Items Retrieved}}{\text{Total Number of Relevant Items}} $$ 

If there are 20 relevant documents in your database but you only retrieve 10 of them, that's 50% recall. Low recall suggests you're missing important information.

In practice, we often measure these metrics at specific cutoff points (like top-5 or top-10 results), denoted as Precision@K or Recall@K. This reflects real-world usage where users typically only look at the first few results.

Consider an example where we're trying to build a Text-2-SQL application. This is an application where we take in a user query and output a SQL query which can be used to retrieve the relevant information as seen below.

```
Text : Hey could you help me find the top 5 most popular items in our store?
Query: SELECT item_name, COUNT(*) as popularity FROM items GROUP BY item_name ORDER BY popularity DESC LIMIT 5
```

This feels like a generation task but ultimately can be greatly improved by including few-shot examples. This is very similar to how we might look for relevant text chunks in a classic Question Answer RAG application.

However,by framing this as a retrieval task, we can start by looking at precision and recall of our retrieval system before we even start looking at the generated SQL queries. This has two main benefits.

1. When we do evaluate the generated SQL queries, we can identify edge cases early on and add them to our list of snippets. We can then verify that these few shot examples are retrieved when we encounter these questions to help generate better SQL snippets.
2. Different companies have unique business logic or calculation methods. Being able to retrieve the relevant snippets when these specific measurements are required is crucial. 


## Case Study : Bird-Bench

For this case-study, we'll be using the Bird-Bench dataset. This is a huge Text-2-SQL dataset which contains a collection of text questions to a corresponding sql query. 

We'll be using the dev split of this dataset for this case study that provides ~1500+ sql snippets that involves ~95 different tables that we can use. 

We've cleaned the dataset ahead of time and uploaded it to `567-labs/bird-rag`. Each example in our dataset contains three things

- `id` : This is a unique identifier for each query
- `query` : This is a sample SQL query 
- `difficulty` : This is a label that indicates how difficult the query is to generate. It can be either `simple`, `moderate` or `challenging`. 

For this case study, we'll only be using the `challenging` queries so that we can generate more difficult questions. This allows us to test our retrieval system under more demanding conditions, ensuring that it performs well even with complex queries

With that in mind, let's take a look at our dataset

In [1]:
import datasets
from rich import print

dataset = datasets.load_dataset("567-labs/bird-rag")["train"]

print(dataset[0])
for item in dataset:
    if item["difficulty"] == "challenging":
        print(item["query"])
        break



Let's analyze a sample SQL query to understand what kind of synthetic questions we can generate:

> SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` FROM frpm WHERE `County Name` = 'Alameda' ORDER BY (CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) DESC LIMIT 1

This query:
1. Calculates the percentage of students receiving free meals
2. Filters to Alameda County schools only
3. Returns the school with the highest percentage

Some relevant natural language questions could be:

- "What school in Alameda County has the highest proportion of students on free meal programs?"
- "Which school has the highest free meal participation rate in Alameda County?"

By generating a dataset of similar questions, we can evaluate how well our retrieval system matches user queries to the appropriate SQL snippets.

## Generating Synthetic Questions

Now let's start generating our synthetic questions. We're going to begin by defining some Pydantic models that represent the format of the data that we're working with.

We're doing so because of the following reasons

1. It helps us to be explicit about the data we're working with 
2. We can use these models with the `instructor` library to obtain structured outputs from our LLM calls


In [1]:
from pydantic import BaseModel


# This represents how we're representing our data from the dataset
class Chunk(BaseModel):
    chunk_id: str
    text: str


# This is the synthetic question that we want our model to generate
class Question(BaseModel):
    chain_of_thought: str
    question: str


# This represents a single question-chunk pair that we'll be using for our evaluation later on
class ChunkEval(BaseModel):
    chunk_id: str
    question: str
    chunk: str



We're using `instructor` because it handles prompt templating with `jinja` for us and provides validated structured outputs. 

All we need to do is to define a Pydantic model that represents a desired output and the library will handle the rest. 

Remember that we want to generate a question that should either be answerable by the data returned by the SQL snippet directly or with some small tweaks.



In [8]:
import openai
import instructor
from asyncio import Semaphore
from tqdm.asyncio import tqdm_asyncio as asyncio
from tenacity import retry, stop_after_attempt, wait_fixed
from rich import print

client = instructor.from_openai(openai.OpenAI())

sql_snippet = """\
SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` 
FROM frpm 
WHERE `County Name` = 'Alameda' 
ORDER BY (CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) DESC 
LIMIT 1"""

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": """
        Generate a hypothetical question that can be answered using the following SQL snippet. 

        SQL Snippet:
        {{ snippet }}

        Rules
        - If there are specific values in the snippet, do not use them directly in the question if possible. 
        - The question should be at most 2 sentences long
        - if necessary, consider making the question more challenging using the following constraint - If there's a time period mentioned in the snippet, modify it slightly (Eg. if the snippet is looking at the entire year, change it to 6 months or 1.5 years)
        - The question must be answerable using the SQL snippet or at most with a small tweak
        """,
        }
    ],
    response_model=Question,
    context={
        "snippet": sql_snippet
    },  # This is the context that we're passing to the model
)

print(resp.question)

> ''What is the highest ratio of free meal counts to total enrollments in K-12 schools for a specific county over a recent semester?'


This is a question which the SQL snippet would be highly relevant for. In order to answer this query, we just need to make two changes

1. add in a new time filter of a recent semester
2. change the county to a variable

### The Diversity Problem

We cannot use the same prompt and expect a diverse set of questions. Therefore we need to introduce slight variations in the prompt to generate questions that are different in wording, intent and content. This is crucial in identifying blindspots in our retrieval system. 

In the example below, we're using the same prompt but introducing randomly chosen constraints at each point. This forces the model to write and generate different questions each time, allowing us to collect a more diverse set of questions. The key here is to really introduce different sources of variation when doing these generations.

### Scaling Up Our Questions

With those points in mind, let's scale our question generation up.

We'll do so by generating a question for each SQL snippet marked as challenging. Since this will be a large number of requests, we're going to be doing so asynchronously with the `asyncio` library.

Additionally, to make sure we stay within our rate limits , we'll be using a semaphore to limit the number of concurrent requests.

We're also making sure that we have a good diversity of questions by randomly selecting a constraint from a set of constraints to make the question more challenging.

In [41]:
import random
from tqdm.asyncio import tqdm_asyncio
import asyncio

# Define Instructor Client
client = instructor.from_openai(openai.AsyncOpenAI())

# Define some constraints to make the question more challenging
constraints = [
    "If there's a time period mentioned in the snippet, modify it slightly (Eg. if the snippet is looking at the entire year, change it to 6 months or 1.5 years)",
    "Add in some irrelevant context (Eg. Add information about the weather, a random event or a backstory that isn't mentioned in the snippet)",
    "Changing the value of the filter (Eg. if the snippet is looking at the results in Canada, change the question to ask about another country or city instead)",
]


@retry(stop=stop_after_attempt(3), wait=wait_fixed(10))
async def generate_questions(chunk: Chunk, sem: Semaphore) -> ChunkEval:
    async with sem:
        coro = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "user",
                    "content": """
                Generate a hypothetical question that can be answered using the following SQL snippet. 

                SQL Snippet:
                {{ snippet }}

                Rules
                - If there are specific values in the snippet, do not use them directly in the question if possible. 
                - The question should be at most 2 sentences long
                - if necessary, consider making the question more challenging using the following constraint of {{ constraint }}
                - The question must be answerable using the SQL snippet or at most with a small tweak
                """,
                }
            ],
            response_model=Question,
            context={"snippet": chunk.text, "constraint": random.choice(constraints)},
        )
        resp = await asyncio.wait_for(coro, timeout=30)

        return ChunkEval(
            chunk_id=chunk.chunk_id,
            question=resp.question,
            chunk=chunk.text,
        )


sem = Semaphore(10)
dataset = [
    item
    for item in datasets.load_dataset("567-labs/bird-rag")["train"]
    if item["difficulty"] == "challenging"
]
dataset = [Chunk(chunk_id=item["id"], text=item["query"]) for item in dataset]

coros = []

num_samples = 2
for chunk in dataset:
    for _ in range(num_samples):
        coros.append(generate_questions(chunk, sem))

questions: list[ChunkEval] = await tqdm_asyncio.gather(*coros)

100%|██████████| 290/290 [01:02<00:00,  4.62it/s]


Now that we've generated our questions, let's take a look at what they look like.

In [44]:
from rich import print


for i in range(2):
    print(
        f"""
    Question: {questions[i].question}

    SQL Snippet: {questions[i].chunk}
    """
    )

If we look at both of the generated questions, they're essentially asking about the same thing - that is the schools that are locally funded and have an enrollment difference that's above average. However, the questions are slightly different in wording and intent. The first has a time frame of 1 year while the second one is looking at schools in France specifically. This is a small change but it's enough to create diversity in our questions.

We can scale this up further by adding more constraints and generating more questions. This is crucial in uncovering blindspots in our retrieval system.

## Saving our Questions

Pydantic AI makes it easy for us to load and save datasets. We'll be taking advantage of that to save our data to a `.yaml` file

In [16]:
from pydantic_evals import Case, Dataset

cases = [
    Case(
        name=f"question_{index}",
        inputs=question.question,
        expected_output=[question.chunk_id],
        metadata={"chunk_id": question.chunk},
    )
    for index,question in enumerate(questions)
]

dataset = Dataset(cases=cases)
dataset.to_file("questions.yaml")

## Conclusion

In this notebook, we covered key metrics like precision and recall, and demonstrated how to generate diverse synthetic questions to benchmark our retrieval system using a Text-2-SQL retrieval system as our example.

In the next notebook, we'll be using these same questions to benchmark different retrieval strategies. This will be followed by Notebook 3 where we'll learn to validate our improvements using bootstrapping and confidence intervals.

Looking ahead to Week 2, we'll leverage these concepts to fine-tune our retrieval models. Using both Cohere's managed re-ranker and the open-source BGE embedding model, we'll see how fine-tuning can improve our recall and MRR metrics on domain-specific queries. 

It's important to note that while synthetic data is valuable for development, it should eventually be augmented with real user queries in production.