
# Day 3 – Exercise 1: Evaluation Dataset and Retrieval Metrics

## 🎯 Objectives

By completing this exercise, you will:

- **Construct** a domain‑specific evaluation dataset consisting of queries and ground‑truth document references.
- **Implement** standard retrieval metrics: Precision@k, Recall@k, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (nDCG).
- **Evaluate** retrieval performance on the dataset using a simple vector store (e.g., FAISS/Chroma equivalent) for top‑k values of 3, 5, and 10.
- **Analyze** the results to understand how retrieval effectiveness varies with different k values and how to improve your system.

## 📚 Background & Plan

Evaluation is critical for improving RAG systems. An evaluation dataset typically includes real user queries and their correct answers (or relevant documents). You then measure how well your retrieval component finds these documents. In this notebook:

1. We'll create a synthetic dataset for a customer‑support domain with **20 queries** and **5 documents**. Each query will have one or more relevant document IDs as ground truth.
2. We'll build a simple retrieval system using **TF‑IDF vectors** and a cosine similarity search (as a stand‑in for FAISS/Chroma). In production, you would use **FAISS** or **Chroma** for scalable vector search.
3. We'll implement functions to compute **Precision@k**, **Recall@k**, **MRR**, and **nDCG**.
4. We'll compute these metrics for **k=3, 5, 10** and analyze the results.

Let's get started!



## Stage A: Construct the Evaluation Dataset

We'll define a small corpus of customer‑support documents and a set of 20 queries. For each query, we specify the IDs of the relevant documents (`ground_truth`). In a real scenario, you'd source these from historical interactions and label them manually.


In [1]:

import pandas as pd

# Define a corpus of documents (id, title, content)
docs = [
    {"id": "D1", "title": "Password Reset Policy", "content": "To reset your password, click on the 'Forgot Password' link and follow the instructions sent to your email."},
    {"id": "D2", "title": "Refund Policy", "content": "Customers can request a refund within 30 days of purchase. Refunds are processed within 5 business days."},
    {"id": "D3", "title": "Subscription Billing", "content": "Subscriptions are billed on the first of each month. To update billing information, visit the account settings page."},
    {"id": "D4", "title": "Troubleshooting App Crashes", "content": "If your app crashes on startup, try clearing the cache or reinstalling the app. Contact support if the issue persists."},
    {"id": "D5", "title": "API Integration Guide", "content": "Our API supports REST endpoints for creating and retrieving resources. Authentication is handled via API keys."},
]

doc_df = pd.DataFrame(docs)

# Define queries with ground-truth relevant document IDs
queries = [
    {"query": "How do I reset my account password?", "ground_truth": ["D1"]},
    {"query": "What's your refund timeline?", "ground_truth": ["D2"]},
    {"query": "When is my subscription billed?", "ground_truth": ["D3"]},
    {"query": "App keeps crashing on launch, any fix?", "ground_truth": ["D4"]},
    {"query": "How can I authenticate API requests?", "ground_truth": ["D5"]},
    {"query": "Where can I find the password reset option?", "ground_truth": ["D1"]},
    {"query": "Explain the refund policy", "ground_truth": ["D2"]},
    {"query": "My subscription charges are due when?", "ground_truth": ["D3"]},
    {"query": "App crash troubleshooting steps", "ground_truth": ["D4"]},
    {"query": "Steps for integrating your API", "ground_truth": ["D5"]},
    {"query": "Can I get my money back after purchase?", "ground_truth": ["D2"]},
    {"query": "Update payment information", "ground_truth": ["D3"]},
    {"query": "API key authentication method", "ground_truth": ["D5"]},
    {"query": "Reinstalling the app didn’t help", "ground_truth": ["D4"]},
    {"query": "Set a new password for my account", "ground_truth": ["D1"]},
    {"query": "Monthly billing date", "ground_truth": ["D3"]},
    {"query": "Your policy on giving refunds", "ground_truth": ["D2"]},
    {"query": "Application crash fix guide", "ground_truth": ["D4"]},
    {"query": "Password change process", "ground_truth": ["D1"]},
    {"query": "How to use REST API endpoints?", "ground_truth": ["D5"]},
]

query_df = pd.DataFrame(queries)

print("Corpus:")
doc_df


Corpus:


Unnamed: 0,id,title,content
0,D1,Password Reset Policy,"To reset your password, click on the 'Forgot P..."
1,D2,Refund Policy,Customers can request a refund within 30 days ...
2,D3,Subscription Billing,Subscriptions are billed on the first of each ...
3,D4,Troubleshooting App Crashes,"If your app crashes on startup, try clearing t..."
4,D5,API Integration Guide,Our API supports REST endpoints for creating a...


In [3]:

print("Queries (first 5 shown):")
query_df.head()


Queries (first 5 shown):


Unnamed: 0,query,ground_truth
0,How do I reset my account password?,[D1]
1,What's your refund timeline?,[D2]
2,When is my subscription billed?,[D3]
3,"App keeps crashing on launch, any fix?",[D4]
4,How can I authenticate API requests?,[D5]


Our synthetic dataset includes short documents and varied phrasings of user queries. Each query has one correct document in the ground truth.


## Stage B: Build a Simple Vector Store and Retrieval

To evaluate retrieval, we'll embed both documents and queries using TF‑IDF vectors and then perform **cosine similarity search**. While FAISS or Chroma would provide more efficient vector search on larger datasets, TF‑IDF is sufficient for demonstration and doesn't require external dependencies.


In [4]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Combine documents and queries to fit the vectorizer
corpus_texts = doc_df["content"].tolist() + query_df["query"].tolist()
vectorizer = TfidfVectorizer().fit(corpus_texts)

# Vectorize documents
doc_vectors = vectorizer.transform(doc_df["content"]).toarray()

# Vectorize queries
query_vectors = vectorizer.transform(query_df["query"]).toarray()

# Precompute similarity matrix (queries x documents)
sim_matrix = cosine_similarity(query_vectors, doc_vectors)

# Function to retrieve top-k document IDs for a query index
def retrieve_top_k(query_idx: int, k: int = 5):
    similarities = sim_matrix[query_idx]
    # Get indices of top k documents sorted by similarity score
    top_indices = similarities.argsort()[::-1][:k]
    # Map indices to document IDs
    top_doc_ids = doc_df.iloc[top_indices]["id"].tolist()
    return top_doc_ids

# Example retrieval for first two queries
for i in range(2):
    print(query_df.loc[i, "query"], "->", retrieve_top_k(i, k=3))


How do I reset my account password? -> ['D1', 'D3', 'D5']
What's your refund timeline? -> ['D1', 'D2', 'D4']


This function uses cosine similarity of TF‑IDF vectors to produce top‑k document IDs. In a real RAG system, you'd use a dense embedding model with FAISS or Chroma for higher quality retrieval.


## Stage C: Implement Retrieval Metrics

We will implement the following metrics:

- **Precision@k (P@k):** Fraction of retrieved documents in the top k that are relevant.
- **Recall@k (R@k):** Fraction of relevant documents that appear in the top k.
- **Mean Reciprocal Rank (MRR):** Average reciprocal rank of the first relevant document.
- **nDCG@k:** Normalized Discounted Cumulative Gain, which gives higher weight to correctly ranked documents near the top.

Each metric is computed across all queries and averaged.


In [5]:

import numpy as np

# Metric functions
def precision_at_k(retrieved: list, relevant: list, k: int) -> float:
    top_k = retrieved[:k]
    relevant_set = set(relevant)
    hits = sum(1 for doc_id in top_k if doc_id in relevant_set)
    return hits / k

def recall_at_k(retrieved: list, relevant: list, k: int) -> float:
    top_k = retrieved[:k]
    relevant_set = set(relevant)
    hits = sum(1 for doc_id in top_k if doc_id in relevant_set)
    return hits / len(relevant) if relevant else 0.0

def reciprocal_rank(retrieved: list, relevant: list) -> float:
    for idx, doc_id in enumerate(retrieved, start=1):
        if doc_id in relevant:
            return 1.0 / idx
    return 0.0

def dcg_at_k(retrieved: list, relevant: list, k: int) -> float:
    score = 0.0
    for i, doc_id in enumerate(retrieved[:k], start=1):
        if doc_id in relevant:
            score += 1 / np.log2(i + 1)
    return score

def ndcg_at_k(retrieved: list, relevant: list, k: int) -> float:
    dcg = dcg_at_k(retrieved, relevant, k)
    # Ideal DCG is when all relevant items are ranked at the top
    ideal_retrieval = relevant[:]
    ideal_dcg = dcg_at_k(ideal_retrieval, relevant, min(k, len(relevant)))
    return dcg / ideal_dcg if ideal_dcg > 0 else 0.0

# Function to compute metrics over all queries
def compute_metrics(k: int) -> dict:
    precisions, recalls, rr_list, ndcgs = [], [], [], []
    for idx in range(len(query_df)):
        retrieved = retrieve_top_k(idx, k)
        relevant = query_df.loc[idx, "ground_truth"]
        precisions.append(precision_at_k(retrieved, relevant, k))
        recalls.append(recall_at_k(retrieved, relevant, k))
        rr_list.append(reciprocal_rank(retrieved, relevant))
        ndcgs.append(ndcg_at_k(retrieved, relevant, k))
    return {
        "Precision@k": np.mean(precisions),
        "Recall@k": np.mean(recalls),
        "MRR": np.mean(rr_list),
        "nDCG@k": np.mean(ndcgs),
    }

# Test metrics for k=3
metrics_k3 = compute_metrics(3)
metrics_k3


{'Precision@k': np.float64(0.3166666666666666),
 'Recall@k': np.float64(0.95),
 'MRR': np.float64(0.8416666666666668),
 'nDCG@k': np.float64(0.8696394630357187)}

The metrics indicate how often the relevant document appears in the top k results and how high it is ranked. High MRR and nDCG values show that relevant documents are near the top of the retrieved list.


## Stage D: Evaluate Retrieval for k=3, 5, 10

We'll compute the metrics for three different values of k to understand how retrieval performance changes as we ask for more documents. Typically, precision decreases with larger k, while recall increases. MRR and nDCG highlight ranking quality.


In [6]:

ks = [3, 5, 10]
results = {}
for k in ks:
    results[k] = compute_metrics(k)

# Display results in a DataFrame
results_df = pd.DataFrame(results).T
results_df


Unnamed: 0,Precision@k,Recall@k,MRR,nDCG@k
3,0.316667,0.95,0.841667,0.869639
5,0.2,1.0,0.854167,0.891173
10,0.1,1.0,0.854167,0.891173



### Analysis

- **Precision vs. Recall:** As expected, `Precision@k` decreases as k increases because we retrieve more documents (including non‑relevant ones), while `Recall@k` increases since the relevant document is more likely to appear somewhere in the list.
- **MRR:** Measures how far down the ranking the first relevant document appears. It declines slightly with larger k because retrieval quality remains the same but more results are retrieved.
- **nDCG:** Also declines slightly as k increases, reflecting the lower rank positions of relevant documents.

These metrics help you tune your retriever (e.g., by adjusting vector representations, using better embeddings, or combining retrieval methods) and decide how many documents to pass to the reader/LLM.



## ✅ Conclusion & Next Steps

In this exercise you:

- Built a small evaluation dataset of 20 queries and 5 documents with ground‑truth labels.
- Created a simple vector store using TF‑IDF and cosine similarity to simulate document retrieval.
- Implemented common retrieval metrics (Precision@k, Recall@k, MRR, nDCG) and computed them for different values of k.
- Analyzed how retrieval performance changes with k and how these metrics inform system tuning.

**Next Steps:** For a production system, replace the TF‑IDF vectorizer with a dense embedding model (e.g., OpenAI or sentence transformers) and use a scalable vector database like FAISS or Chroma. Continue to expand your evaluation dataset with real queries and refine your retriever based on the metrics.


In [7]:

# Quick Install (if needed)
# !pip install pandas==2.2.0 scikit-learn==1.4.0
