# Evaluating RAG Pipeline - Recall

In this notebook, we will calculate a metric called **recall** at multiple top-k levels. Here, `k` is the number of contexts retrieved from our RAG system, and the recall metric evaluates how well the retrieval system performs by measuring the proportion of relevant documents that are successfully retrieved at different top-k cutoffs (1, 3, 5, 10).

By default, we will set `k=10` in this notebook to calculate Recall@1, Recall@3, Recall@5, and Recall@10. You can modify the `top_k` variable to retrieve more or fewer documents as needed.

**Prerequisites**: This notebook assumes you have already completed [`evaluation_01_ragas`](./evaluation_01_ragas.ipynb), your `rag-server` and `ingestor-server` are running, and the financebench data has been ingested. Since the evaluation data was processed in the previous notebook, we can directly use the retrieval endpoints.


In [None]:
# Installing required Python packages
! pip install pandas tqdm requests

# Installing required Python packages for gt_page_mapper.py
! pip install pypdf2


In [None]:
import os
import json
import requests
from tqdm import tqdm
import pandas as pd

## 1. Map Contexts to Documents

### Prerequisites
Before proceeding with this notebook, ensure that you have completed the following:

1. **Completed [`evaluation_01_ragas`](./evaluation_01_ragas.ipynb) notebook**: This notebook downloads the `financebench` dataset.
2. **RAG server is running**: `rag-server` should be up and running.
3. **Data is ingested**: The financebench PDF documents should be ingested into your RAG system with collection name `financebench`.

### Ground Truth Context Mapping

In this notebook, we will use the `financebench` dataset to evaluate recall metrics. We'll use the `gt_page_mapper.py` script to map ground truth contexts to their corresponding document pages in the dataset. 

The script processes various document formats (PDF, TXT) and determines the most relevant pages using either explicit page numbers or text similarity. In our case, the financebench dataset contains only PDFs.

**Note**: If you are using a different dataset, you will likely need to update the `gt_page_mapper.py` script according to your ground truth data formatting.

The script will generate a new JSON file (`gt_file_pages-financebench.json`) that contains the ground truth context metadata (document name and page number) for each question.

In [None]:
!python3 gt_page_mapper.py --dataset financebench

## 2. Define Variables


In [None]:
# local data
DATASET_BASE_DIR ="../data/financebench"
GT_FILENAME_PAGE_DATA = "./gt_file_pages-financebench.json"
EVAL_DATA = "data/financebench_open_source.jsonl"

# endpoints
IPADDRESS = "rag-server" if os.environ.get("AI_WORKBENCH", "false") == "true" else "localhost" #Replace this with the correct IP address
RAG_SERVER_PORT = "8081"
BASE_URL = f"http://{IPADDRESS}:{RAG_SERVER_PORT}"  # Replace with your server URL

collection_name = "financebench"
dataset_name = "financebench"

error_count = 0
top_k = 10 # change this to retrieve more/less documents

TIMEOUT = 180


## 3. Define all methods

In [None]:
def get_eval_data(dataset_path):

    data =[]

    # Open and load the JSONL file
    with open(dataset_path, 'r') as file:
        for line in file:
            entry = json.loads(line) # Load JSON data into a Python dictionary
            filtered_entry = {
                "id": entry["financebench_id"],
                "question": entry["question"],
                "answer": entry["answer"],
                "context": entry["evidence"],
            }
            data.append(filtered_entry)

    
    print("    - Loaded Evaluation data")
    print("    - Number of data points", len(data))
    return data

In [None]:
def get_retrieved_documents(query, BASE_URL, collection_name, top_k):
    data_search = {
        "query": query,
        "top_k": top_k,
        "collection_name": collection_name
    }
        
    try:
        docs = []
        # Use HTTPS and construct URL without port if it's a standard HTTPS service
        if BASE_URL.startswith('https://') or BASE_URL.startswith('http://'):
            url_search = f"{BASE_URL}/v1/search"
        else:
            url_search = f"http://{BASE_URL}/v1/search"
        with requests.post(url_search, json=data_search, timeout=TIMEOUT, verify=False) as req:
            req.raise_for_status()
            context_doc = req.json()
            for doc in context_doc.get("results", []):
                if doc.get("document_name", None):
                    docs.append(doc.get("document_name")+"_"+str(doc.get("metadata").get("content_metadata").get("page_number")))
        return docs
    except Exception as e:
        print(f"Failed to get response from /search endpoint of rag-server. Error details:"
                    f"{e}. Refer to rag-server logs for details.")
        return []

In [None]:
def create_eval_dict(eval_dataset, BASE_URL, collection_name, top_k):
    """Create a evaulation dictionary with generated response"""
    eval_data = []
    total_questions = len(eval_dataset)  # Total number of queries to process
        
    for d in eval_dataset:
        try:
            retrieved_docs = get_retrieved_documents(d.get('question'), BASE_URL, collection_name, top_k)
            result = {
                'id': d.get('id'),
                'question': d.get('question'),
                "retrieved_docs": retrieved_docs,
            }
            eval_data.append(result)
        except Exception as e:
            print(f"Error processing question {d.get('question')}: {e}")
            error_count += 1
            eval_data.append(None)
            
    return eval_data


In [None]:
def get_recall_metric(dataset_name, eval_data, granularity):
    """
    Calculate recall metrics for RAG (Retrieval-Augmented Generation) evaluation at multiple top-k levels.
    
    This function evaluates how well the retrieval system performs by measuring the proportion 
    of relevant documents that are successfully retrieved at different top-k cutoffs (1, 3, 5, 10).
    It supports both page-level and document-level granularity to accommodate different evaluation needs.
    
    The recall metric is calculated as:
        Recall@k = (Number of relevant documents in top-k retrieved) / (Total number of relevant documents)
    """
    # Load ground truth data containing relevant documents and pages for each query
    gt_files_pages = {}
    
    # Validate ground truth file exists
    if not os.path.exists(GT_FILENAME_PAGE_DATA):
        print(f"Error: Ground truth file '{GT_FILENAME_PAGE_DATA}' not found. Skipping recall calculation.")
        return None, None, None, None, None
    
    # Load ground truth data
    try:
        with open(GT_FILENAME_PAGE_DATA, 'r', encoding='utf-8') as f:
            gt_files_pages = json.load(f)
    except (json.JSONDecodeError, IOError) as e:
        print(f"Error loading ground truth file: {e}. Skipping recall calculation.")
        return None, None, None, None, None

    # Validate ground truth data is not empty
    if not gt_files_pages:
        print(f"Error: Ground truth file contains no data for dataset '{dataset_name}'. Skipping recall calculation.")
        return None, None, None, None, None

    # Initialize recall metric storage for different top-k values
    recall_metrics = {
        1: [],   # Recall@1 scores
        3: [],   # Recall@3 scores  
        5: [],   # Recall@5 scores
        10: []   # Recall@10 scores
    }
    
    # Process each evaluation sample to calculate recall metrics
    for idx, sample in enumerate(eval_data):
        # Validate that ground truth exists for this sample index
        if idx >= len(gt_files_pages):
            print(f"Warning: No ground truth found for sample index {idx}, skipping...")
            continue

        # Skip samples with no retrieved documents
        if not sample.get("retrieved_docs") or len(sample["retrieved_docs"]) == 0:
            print(f"Warning: No retrieved documents for question: '{sample.get('question', 'Unknown')}'")
            continue

        # Extract top-k retrieved documents for different recall levels
        retrieved_docs_by_k = {
            1: sample["retrieved_docs"][:1],
            3: sample["retrieved_docs"][:3], 
            5: sample["retrieved_docs"][:5],
            10: sample["retrieved_docs"][:10]
        }

        # Determine evaluation granularity and extract relevant documents from ground truth
        gt_contexts = gt_files_pages[idx]["contexts"]
        page_number_available = bool(gt_contexts[0].get("page", "").strip())
        
        if page_number_available and granularity == "page":
            # Page-level evaluation: include page numbers in document identifiers
            relevant_docs = [f"{context['filename']}_{context['page']}" for context in gt_contexts]
        else:
            # Document-level evaluation: strip page numbers for comparison
            # This handles both cases: when page numbers are unavailable OR when granularity="document"
            relevant_docs = [context["filename"] for context in gt_contexts]
            
            # Strip page numbers from retrieved documents to match granularity
            for k in retrieved_docs_by_k:
                retrieved_docs_by_k[k] = [doc.rsplit("_", 1)[0] for doc in retrieved_docs_by_k[k]]
        
        # Calculate recall@k for each k value using set intersection for efficiency
        for k in recall_metrics.keys():
            retrieved_set = set(retrieved_docs_by_k[k])
            relevant_set = set(relevant_docs)
            
            # Count relevant documents that were successfully retrieved
            num_relevant_retrieved = len(retrieved_set.intersection(relevant_set))
            
            # Calculate recall as fraction of relevant documents retrieved
            recall_score = num_relevant_retrieved / len(relevant_docs) if relevant_docs else 0.0
            recall_metrics[k].append(recall_score)

    # Return page availability flag and recall metrics for all k values
    return (page_number_available, 
            recall_metrics[1], 
            recall_metrics[3], 
            recall_metrics[5], 
            recall_metrics[10])

In [None]:

def evaluate_result(dataset_name, eval_data, all_results, BASE_URL, collection_name, top_k):
    eval_dict = create_eval_dict(eval_data, BASE_URL, collection_name, top_k)
    # Calculate page-level recall metrics (includes page numbers in matching)
    page_number_available, page_level_recall_1, page_level_recall_3, page_level_recall_5, page_level_recall_10 = get_recall_metric(dataset_name, eval_dict, "page")
    
    # Calculate document-level recall metrics (ignores page numbers in matching)
    _, document_level_recall_1, document_level_recall_3, document_level_recall_5, document_level_recall_10 = get_recall_metric(dataset_name, eval_dict, "document")

    # Add recall@1 metrics if top_k supports it
    if int(top_k) >= 1:
        if page_number_available and page_level_recall_1:
            all_results["page_level_recall_1"] = page_level_recall_1
        if document_level_recall_1:
            all_results["document_level_recall_1"] = document_level_recall_1
    
    # Add recall@3 metrics if top_k supports it
    if int(top_k) >= 3:
        if page_number_available and page_level_recall_3:
            all_results["page_level_recall_3"] = page_level_recall_3
        if document_level_recall_3:
            all_results["document_level_recall_3"] = document_level_recall_3
    
    # Add recall@5 metrics if top_k supports it        
    if int(top_k) >= 5:
        if page_number_available and page_level_recall_5:
            all_results["page_level_recall_5"] = page_level_recall_5
        if document_level_recall_5:
            all_results["document_level_recall_5"] = document_level_recall_5
    
    # Add recall@10 metrics if top_k supports it
    if int(top_k) >= 10:
        if page_number_available and page_level_recall_10:
            all_results["page_level_recall_10"] = page_level_recall_10
        if document_level_recall_10:
            all_results["document_level_recall_10"] = document_level_recall_10

    return pd.DataFrame(all_results)

## 4. Run Evaluation

Finally, let's kick off our evaluation pipeline for Recall and print out the metrics. 

In [None]:
# Initialize results storage
all_results = {}

eval_data_path = os.path.join(DATASET_BASE_DIR, EVAL_DATA)
eval_data = get_eval_data(dataset_path=eval_data_path)

all_result = evaluate_result(dataset_name, eval_data, all_results, BASE_URL, collection_name, top_k)

In [None]:
print("- Recall Metrics (document-level)")
if int(top_k) >= 1 and 'document_level_recall_1' in all_result:
    print(f"     -Recall@1:                      {all_result['document_level_recall_1'].mean()}")
if int(top_k) >= 3 and 'document_level_recall_3' in all_result:
    print(f"     -Recall@3:                      {all_result['document_level_recall_3'].mean()}")
# The following lines were added to support recall@5 and recall@10
if int(top_k) >= 5 and 'document_level_recall_5' in all_result:
    print(f"     -Recall@5:                      {all_result['document_level_recall_5'].mean()}")
if int(top_k) >= 10 and 'document_level_recall_10' in all_result:
    print(f"     -Recall@10:                     {all_result['document_level_recall_10'].mean()}")
# The following lines were added to support page-level recall metrics
if 'page_level_recall_1' in all_result:
    print("- Recall Metrics (page-level)")
    if int(top_k) >= 1 and 'page_level_recall_1' in all_result:
        print(f"     -Recall@1:                      {all_result['page_level_recall_1'].mean()}")
    if int(top_k) >= 3 and 'page_level_recall_3' in all_result:
        print(f"     -Recall@3:                      {all_result['page_level_recall_3'].mean()}")
    if int(top_k) >= 5 and 'page_level_recall_5' in all_result:
        print(f"     -Recall@5:                      {all_result['page_level_recall_5'].mean()}")
    if int(top_k) >= 10 and 'page_level_recall_10' in all_result:
        print(f"     -Recall@10:                     {all_result['page_level_recall_10'].mean()}")
