---

### ðŸŽ“ **Professor**: Apostolos Filippas

### ðŸ“˜ **Class**: AI Engineering

### ðŸ“‹ **Topic**: Evaluations

ðŸš« **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

## Welcome!

In our previous lecture, we built a full RAG pipeline. That's great â€” you can just build things! But how do you know if they're any good? How do you know if one approach is better than another? How do you improve your system.. systematically?

The answer to that is **evaluations** â€” the systematic measurement of how well your AI system performs. Without evals, you're flying blind. All big AI labs and companies have extensive evaluation harnesses for their systems.

---

# 1. Hugging Face

> ðŸ“š **TERM: Hugging Face**  
> An open platform for sharing ML models, datasets, and applications. Think of it as GitHub for AI â€” anyone can upload models or datasets, and you can download and use them with a few lines of code.

Hugging Face hosts:
- **Models** â€” pre-trained models for text, images, audio, etc.
- **Datasets** â€” curated datasets for training and evaluation
- **Spaces** â€” interactive demos and apps

We'll use the `datasets` library to load **nfcorpus**, a biomedical information retrieval dataset from the [BEIR benchmark](https://github.com/beir-cellar/beir). It contains:
- A **corpus** of 3,633 biomedical documents (titles + abstracts)
- **Queries** â€” 3,237 plain-English health/nutrition queries
- **Relevance judgments** (qrels) â€” 12,334 human-labeled query-to-document mappings

Tips:
- Browse datasets at [huggingface.co/datasets](https://huggingface.co/datasets)
- The `datasets` library handles downloading, caching, and format conversion
- Datasets have **configurations** (sub -datasets) and **splits** (train/test/etc.)

In [None]:
from datasets import load_dataset

# Load the nfcorpus dataset from Hugging Face
# It has separate "corpus" and "queries" configurations (like sub-datasets)
corpus = load_dataset("BeIR/nfcorpus", "corpus", split="corpus")
queries = load_dataset("BeIR/nfcorpus", "queries", split="queries")

# Relevance judgments (qrels) are stored in a separate dataset
# We use the "test" split â€” these are the queries we'll evaluate on
qrels = load_dataset("BeIR/nfcorpus-qrels", split="test")

print(f"Documents: {len(corpus)}")
print(f"Queries:   {len(queries)}")
print(f"Relevance judgments: {len(qrels)}")

In [None]:
import pandas as pd

corpus_df = corpus.to_pandas()
queries_df = queries.to_pandas()
qrels_df = qrels.to_pandas()

# Remove some duplicate documents (same title+text, different IDs)
n_before = len(corpus_df)
corpus_df = corpus_df.drop_duplicates(subset=["title", "text"], keep="first")
print(f"Deduplicated corpus: {n_before} -> {len(corpus_df)} documents")

# Inspect the column structure of each DataFrame
print(f"\nCorpus columns:  {list(corpus_df.columns)}")
print(f"Queries columns: {list(queries_df.columns)}")
print(f"Qrels columns:   {list(qrels_df.columns)}")

# Show sample documents â€” each has an ID, title, and full text (abstract)
print("\n--- Sample documents ---")
for _, row in corpus_df.head(3).iterrows():
    print(f"ID:    {row['_id']}")
    print(f"Title: {row['title']}")
    print(f"Text:  {row['text']}...")
    print()

# Show sample queries â€” plain-English health/nutrition questions
print("--- Sample queries ---")
for _, row in queries_df.head(5).iterrows():
    print(f"ID:    {row['_id']}")
    print(f"Text:  {row['text']}")
    print()

# Show sample relevance judgments â€” each row links a query to a document with a score
print("\n--- Sample relevance judgments ---")
print(qrels_df.head(10).to_string(index=False))

---

# 2. Explore the Data

Before building anything, **look at your data**.

We have three DataFrames:
- `corpus_df` â€” 3,633 documents with `_id`, `title`, and `text`
- `queries_df` â€” 3,237 queries with `_id`, `title`, and `text`
- `qrels_df` â€” 12,334 relevance judgments linking queries to documents with a `score` (1 = relevant, 2 = highly relevant)

This dataset has one amazing feature: it comes with **human-labeled ground truth** â€” we know exactly which documents are relevant to which queries. This is the gold standard for evaluating retrieval.

In [None]:
# --- Document statistics ---
print("Document text length (characters):")
print(corpus_df["text"].str.len().describe().round(0))
print()

# Relevance scores: 1 = relevant, 2 = highly relevant
print("Relevance score distribution (1 = relevant, 2 = highly relevant):")
print(qrels_df["score"].value_counts().sort_index().to_string())
print()

# Not all queries have ground truth â€” only 323 of 3,237 have qrels
n_queries_with_qrels = qrels_df["query-id"].nunique()
print(f"Queries with relevance judgments: {n_queries_with_qrels} (out of {len(queries_df)})")

# How many relevant docs per query? This affects recall interpretation later.
# If a query has 50 relevant docs but we only retrieve 20, recall can't exceed 0.4
per_query = qrels_df.groupby("query-id")["corpus-id"].count()
print("\nRelevant docs per query:")
print(per_query.describe().round(1))

# --- Example: look at one query and all its relevant documents ---
# Build a quick lookup by doc ID for the full corpus (before dedup)
corpus_lookup = corpus.to_pandas().set_index("_id")

sample_qid = qrels_df["query-id"].iloc[0]
sample_rels = qrels_df[qrels_df["query-id"] == sample_qid]
q_row = queries_df[queries_df["_id"] == sample_qid].iloc[0]
print(f"\nExample query [{sample_qid}]: {q_row['text']}")
print(f"  Has {len(sample_rels)} relevant documents:")
for _, r in sample_rels.iterrows():
    doc = corpus_lookup.loc[r["corpus-id"]]
    label = "highly relevant" if r["score"] == 2 else "relevant"
    print(f"    ({label}) {doc['title'][:80]}")

---

# 3. Set Up Search with LanceDB

> ðŸ“š **TERM: Vector Database**  
> A database optimized for storing and searching over embeddings (vectors). Instead of exact keyword matching, vector databases find items that are *semantically similar* to a query.

> ðŸ“š **TERM: LanceDB**  
> An open-source, embedded vector database. It runs locally (no server, no account needed), handles embeddings automatically, and supports vector search, lexical search, and hybrid search.

LanceDB uses Pydantic-type models to define your table schema
- `SourceField()` â€” tells LanceDB which column to embed
- `VectorField()` â€” tells LanceDB where to store the embedding vector

LanceDB can use OpenAI's embedding API automatically through its **registry**. We pick an embedding model, define our schema, and LanceDB handles the rest.

LanceDB supports three search modes:
- `"vector"` â€” embedding-based semantic search (like what you built with cosine similarity)
- `"fts"`    â€” **full-text search**, i.e. lexical search (like the BM25 you built from scratch). 
- `"hybrid"` â€” combines both vector and lexical search

We'll concatenate each document's title and text into a single `content` field for both embedding and search.

In [None]:
from dotenv import load_dotenv

load_dotenv()

import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector

# Set up the embedding function â€” LanceDB will call OpenAI's API automatically
func = get_registry().get("openai").create(name="text-embedding-3-small")

# Define the table schema
# - SourceField() marks the column to embed
# - VectorField() stores the resulting vector
class Document(LanceModel):
    doc_id: str
    title: str
    content: str = func.SourceField()  # this column gets embedded automatically
    vector: Vector(func.ndims()) = func.VectorField()  # embedding stored here


# Create a local LanceDB database and table
db = lancedb.connect("../temp/lancedb")
table = db.create_table("nfcorpus", schema=Document, mode="overwrite")

# Combine title + text into a single content field for richer embeddings
data = [
    {"doc_id": row["_id"], "title": row["title"], "content": row["title"] + "\n" + row["text"]}
    for _, row in corpus_df.iterrows()
]

# Add documents in batches
batch_size = 500
for i in range(0, len(data), batch_size):
    table.add(data[i : i + batch_size])
    print(f"  Embedded {min(i + batch_size, len(data))}/{len(data)} documents")

# Build a full-text search (FTS) index for lexical/BM25 search
table.create_fts_index("content", replace=True)

print(f"\n{table.count_rows()} documents indexed")

We will now test the three search modes on some sample queries

In [None]:
# Test all three search modes on a sample query
query = "How to prevent heart disease"

print(f"Query: {query}\n")
print("=" * 80)

for mode in ["vector", "fts", "hybrid"]:
    results = table.search(query, query_type=mode).limit(3).to_list()
    print(f"\n--- {mode} search ---")
    for i, r in enumerate(results):
        print(f"  [{i + 1}] {r['title'][:80]}")

---

# 4. Retrieval Metrics

Now that we have a search system, how do we measure how well it's working? We need **retrieval metrics**.

> ðŸ“š **TERM: Precision@k**  
> Of the items you retrieved in the top *k*, what fraction are actually relevant?
> $$\text{Precision@k} = \frac{\text{\# relevant items in top-k}}{k}$$

> ðŸ“š **TERM: Recall@k**  
> Out of all the relevant items that exist, what fraction did you find in the top *k* results?
> $$\text{Recall@k} = \frac{\text{\# relevant items in top-k}}{\text{total \# relevant items}}$$

As k increases, 
- recall goes up -- you find more relevant items
- precision tends to drop -- you pull in more irrelevant items


Nfcorpus comes with **human relevance judgments** (qrels). For each test query, we know exactly which documents are relevant. This allows us to compute clean, reliable metrics.

In [None]:
# Build a lookup: for each query, the set of relevant document IDs
qrels_by_query = qrels_df.groupby("query-id")["corpus-id"].apply(set).to_dict()

# Only evaluate queries that have ground truth relevance judgments
test_query_ids = list(qrels_by_query.keys())
test_queries = queries_df[queries_df["_id"].isin(test_query_ids)]
print(f"Evaluating on {len(test_queries)} queries with ground truth\n")

# We'll compute metrics at multiple values of k
k_values = [1, 3, 5, 10, 20]
max_k = 20

# Store results in tidy format: one row per (query, metric, k) combination
tidy_rows = []

for _, row in test_queries.iterrows():
    query_id = row["_id"]
    query_text = row["text"]
    relevant_ids = qrels_by_query[query_id]

    # Retrieve top-k documents using vector search
    retrieved = table.search(query_text, query_type="vector").limit(max_k).to_list()
    retrieved_ids = [r["doc_id"] for r in retrieved]

    for k in k_values:
        # Get the set of doc IDs in the top-k results
        ids_at_k = set(retrieved_ids[:k])
        # Count how many of the top-k are actually relevant
        n_relevant_at_k = len(ids_at_k & relevant_ids)

        # Precision@k = fraction of retrieved docs that are relevant
        precision = n_relevant_at_k / k
        # Recall@k = fraction of all relevant docs that we found
        recall = n_relevant_at_k / len(relevant_ids)

        tidy_rows.append({"metric": "precision", "k": k, "search_type": "vector", "score": precision, "query_id": query_id})
        tidy_rows.append({"metric": "recall", "k": k, "search_type": "vector", "score": recall, "query_id": query_id})

eval_df = pd.DataFrame(tidy_rows)

# Show average precision and recall at each k
print("Retrieval evaluation (vector search, ground truth qrels):\n")
print(eval_df.groupby(["search_type", "metric", "k"])["score"].mean().round(4).to_string())

---

# 5. Synthetic Question Generation

Nfcorpus comes with human relevance judgments, but most real-world datasets don't. When you build a RAG system over your company's docs -- or a new AI system more generally -- you will often have no ground truth.

One workaround for this probem is to **generate synthetic data**. The idea is simple:
1. Pick a document from your corpus
2. Ask an LLM to generate a question that this document can answer
3. Now you have a (question, document_id) pair with **known ground truth**

Because we know exactly which document the question came from, we can compute retrieval metrics: did the search system return the source document in its top-k?

**Diversifying our synthetic data** If you use the same prompt every time, you'll get repetitive questions. One trick is to add randomly some prompt "constraints" to force variety.

In [None]:
import litellm
import asyncio
import random
import textwrap
from pydantic import BaseModel, Field


# 'chain_of_thought' makes the LLM reason before generating the question
class SyntheticQuestion(BaseModel):
    chain_of_thought: str = Field(description="Step-by-step reasoning about what makes a good question for this document")
    question: str = Field(description="A natural, specific question that can be answered using the document")
    answer: str = Field(description="The answer to the question")


# random "constraints" 
constraints = [
    "The question should be answerable in one word or a short phrase",
    "The question should require synthesizing multiple facts from the document",
    "Frame the question as something a patient might ask their doctor",
    "Ask about a specific number, date, or finding mentioned in the document",
]


async def generate_question(doc_id: str, title: str, text: str) -> dict:
    """Generate a synthetic question for a single document using an LLM."""
    constraint = random.choice(constraints)
    response = await litellm.acompletion(
        model="gpt-5.1",
        messages=[
            {
                "role": "user",
                "content": textwrap.dedent(f"""
                
                I will give you a document from BEIR's nfcorpus -- a dataset that has a collection of biomedical documents. Please Generate a question that can be answered using the following document.
                
                Title: {title}
                Text: {text}
                
                Rules:
                - Your question should be natural and specific and concise
                - Your question should not assume that someone is reading the document, but rather that they are asking a general biomedical questions
                - Your question must be answerable using the document that I gave you
                - {constraint}
                - Do not reference \"the document\" or \"the study\" in your question
                """
                ),
            }
        ],
        response_format=SyntheticQuestion,
    )
    
    # Parse the JSON response into our Pydantic model
    result = SyntheticQuestion.model_validate_json(response.choices[0].message.content)
    return {"doc_id": doc_id, "question": result.question, "answer": result.answer}


# Sample 80 documents to generate questions for
sample_docs = corpus_df.sample(n=80, random_state=42)

# Generate all questions concurrently using asyncio.gather
tasks = [generate_question(row["_id"], row["title"], row["text"]) for _, row in sample_docs.iterrows()]
synthetic_results = await asyncio.gather(*tasks)

synthetic_df = pd.DataFrame(synthetic_results)
print(f"Generated {len(synthetic_df)} synthetic questions\n")

# Show some examples with their source documents
for _, row in synthetic_df.head(5).iterrows():
    doc = corpus_df[corpus_df["_id"] == row["doc_id"]].iloc[0]
    print(f"Q: {row['question']}")
    print(f"   Source: {doc['title'][:80]}")
    print(f"   Answer: {row['answer']}")
    print()

In [None]:
# Evaluate retrieval on synthetic questions
# Key difference from qrels: each synthetic question has exactly 1 relevant doc (its source)
syn_rows = []

for _, row in synthetic_df.iterrows():
    source_id = row["doc_id"]
    question = row["question"]

    # Search for the synthetic question
    retrieved = table.search(question, query_type="vector").limit(max_k).to_list()
    retrieved_ids = [r["doc_id"] for r in retrieved]

    for k in k_values:
        ids_at_k = retrieved_ids[:k]

        # Binary relevance: did we find the source document in top-k?
        found = source_id in ids_at_k
        precision = (1.0 if found else 0.0) / k  # at most 1 relevant doc
        recall = 1.0 if found else 0.0  # found it or didn't

        syn_rows.append({"metric": "precision", "k": k, "search_type": "vector", "score": precision, "question": question})
        syn_rows.append({"metric": "recall", "k": k, "search_type": "vector", "score": recall, "question": question})

syn_eval_df = pd.DataFrame(syn_rows)

# Compare with qrels results above â€” synthetic questions are typically "easier" for retrieval
# because the LLM generates questions using the document's own language
print("Retrieval evaluation (synthetic questions, vector search):\n")
print(syn_eval_df.groupby(["search_type", "metric", "k"])["score"].mean().round(4).to_string())

**Synthetic vs. human-labeled evaluation**: Notice the difference â€” synthetic questions give much higher precision than human qrels. Why? Because the LLM generates questions using the document's own vocabulary, making them easier to retrieve via embedding similarity. Real user queries are messier and more diverse. This is an important caveat: **synthetic evals can overestimate your system's real-world performance**. They're great when you have no ground truth at all, but treat the numbers as an upper bound.

Having said that we can make much more difficult synthetic questions..

**Question: How would you do that?**

---

# 6. Experiment and Improve

Let's use our ground truth qrels to compare different search strategies. We'll run vector, full-text, and hybrid search and organize results in **tidy data format** â€” each row is one observation.

**A note on recall**: as we saw in Section 2, queries in nfcorpus have a **median of 16 relevant documents** (and a mean of 38). Since we only retrieve up to k=20, we can't possibly find them all â€” so recall values will be low. This is expected and not a problem with our search system. In practice, **precision tells us about the quality of our top results** (are the docs we return actually useful?), while recall tells us how much of the total relevant information we're capturing. Both matter, but for RAG â€” where we feed a handful of documents to an LLM â€” precision is usually more important.

In [None]:
import matplotlib.pyplot as plt

# Compare all three search strategies on the same 323 test queries
search_types = ["vector", "fts", "hybrid"]
experiment_rows = []

for search_type in search_types:
    print(f"Evaluating {search_type} search...")
    for _, row in test_queries.iterrows():
        query_id = row["_id"]
        query_text = row["text"]
        relevant_ids = qrels_by_query[query_id]

        # Retrieve top-k documents using this search strategy
        retrieved = table.search(query_text, query_type=search_type).limit(max_k).to_list()
        retrieved_ids = [r["doc_id"] for r in retrieved]

        # Compute precision and recall at each k
        for k in k_values:
            ids_at_k = set(retrieved_ids[:k])
            n_relevant_at_k = len(ids_at_k & relevant_ids)

            precision = n_relevant_at_k / k
            recall = n_relevant_at_k / len(relevant_ids)

            experiment_rows.append(
                {"metric": "precision", "k": k, "search_type": search_type, "score": precision, "query_id": query_id}
            )
            experiment_rows.append(
                {"metric": "recall", "k": k, "search_type": search_type, "score": recall, "query_id": query_id}
            )

# Tidy data format: each row is one observation (metric, k, search_type, score, query_id)
experiment_df = pd.DataFrame(experiment_rows)

# Show average scores across all queries
print("\nResults:\n")
print(experiment_df.groupby(["search_type", "metric", "k"])["score"].mean().round(4).to_string())

In [None]:
# Plot precision@k and recall@k curves side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: precision@k â€” how precise are the top results?
for search_type in search_types:
    data = experiment_df[(experiment_df["metric"] == "precision") & (experiment_df["search_type"] == search_type)]
    means = data.groupby("k")["score"].mean()
    ax1.plot(means.index, means.values, marker="o", label=search_type)

ax1.set_xlabel("k")
ax1.set_ylabel("Precision@k")
ax1.set_title("Precision@k by Search Type")
ax1.legend()
ax1.grid(True)
ax1.set_ylim(0, 1.05)

# Right plot: recall@k â€” what fraction of relevant docs did we find?
# Recall is low because queries have many relevant docs (median 16, mean 38)
# but we only retrieve up to k=20
for search_type in search_types:
    data = experiment_df[(experiment_df["metric"] == "recall") & (experiment_df["search_type"] == search_type)]
    means = data.groupby("k")["score"].mean()
    ax2.plot(means.index, means.values, marker="o", label=search_type)

ax2.set_xlabel("k")
ax2.set_ylabel("Recall@k")
ax2.set_title("Recall@k by Search Type")
ax2.legend()
ax2.grid(True)
ax2.set_ylim(0)

plt.tight_layout()
plt.show()

---

# Summary

| What we learned | Key takeaway |
|---|---|
| **Hugging Face** | A one-stop ecosystem for datasets and models. The `datasets` library makes loading and caching easy. |
| **LanceDB** | An embedded vector database â€” no server, no account. Define a Pydantic schema and it handles embeddings + search for you. |
| **Precision & Recall** | Precision@k measures the quality of your top results. Recall@k measures how much relevant information you're capturing. Both matter. |
| **Synthetic questions** | When you don't have ground truth, generate test questions with LLMs. Useful but can overestimate real-world performance. |
| **Experiments** | Compare strategies (vector, lexical, hybrid) on the same queries with the same metrics. Let the data tell you what works. |

**The big picture**: Evaluations turn "I think this works" into "I measured this and it works." Every time you change your retrieval strategy, embedding model, chunking approach, or prompt â€” run your evals and compare. That's how you improve systematically.