### Creating Evaluation Sets
The idea behind creating these evaluation sets it to test the performance of the retriever in the RAG system. The issue lies in the fact that most of the times there is no ground truth to compare the performance of the retriever. This notebook explores the process of creating evaluation sets.
These will be the steps followed to create the evaluation sets:
1. The document chunker will be used to chunk the all the document into smaller chunks.
2. The chunks are fed to the LLM to generate the evaluation sets which contain questions, answers, difficulty level and the chunk IDs that the question and answer are based on. The issue is that there is no ground truth to compare the performance of the LLM generating the evlaution set
3. Since cross encoders capture deep relationships between the question and the chunk, we will use them to compare the performance of the LLM generating the evaluation set.
4. The objective is to maximise the overlap between the synthetic evluation set generator and the cross encoder. The final ground truth will be the overlap between the evaluation set and the cross encoder. Another strategy is to use the combination of the synthetic LLM eval generator and the cross encoder.



In [1]:
# Use the environment variable if the user doesn't provide Project ID.
import os

import vertexai
from vertexai.preview.generative_models import (
    FunctionDeclaration,
    GenerativeModel,
    Tool,
    ToolConfig,
    Part,
    GenerationConfig,
)
import json
from pathlib import Path
import logging
from typing import Dict, List

PROJECT_ID = "104916006626"  # @param {type: "string", placeholder: "[your-project-id]" isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "xyz":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "australia-southeast1")

vertexai.init(project="104916006626", location=LOCATION)

In [2]:
import os
from google.oauth2 import service_account

# Path to your service account key file
key_path = "C:\\Users\\shres\\Projects\\RAG-case-study\keys\\keyproject-401005-6e1cdcbb5996.json"

# Create credentials using the service account key file
credentials = service_account.Credentials.from_service_account_file(
    key_path
)

# Set the credentials for the current environment
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = key_path
# auth_request = transport.requests.Request()
# credentials.refresh(auth_request)

In [3]:
response_schema = {
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "question": {
                "type": "string",
            },
            "answer": {
                "type": "string",
            },
            "difficulty": {
                "type": "string",
                "enum": ["easy", "medium", "hard"],
            },
            "chunk_ids": {
                "type": "array",
                "items": {
                    "type": "string"
                },
                "description": "List of chunk IDs that the question and answer are based on. List exactly 10 chunk ids"
            }
        },
        "required": ["question", "answer", "difficulty", "chunk_ids"],
    },
}

In [21]:
from prompts import system_prompt_QA_eval_bot
def generate_questions(context, num_questions=10):
    """
    Generate a set of questions and answers from a given context.

    Args:
    context: The context to generate questions from.
    num_questions: The number of questions to generate.

    Returns:
    A list of questions and answers.
    """
    model = GenerativeModel("gemini-1.5-pro-002")

    response = model.generate_content(
    system_prompt_QA_eval_bot.format(chunk_set=context, num_questions=num_questions),
    generation_config=GenerationConfig(
        response_mime_type="application/json", response_schema=response_schema
    ),
    )
    return response.text




### Creating a cross encode to get the similarity between the eval question and the document chunks
 The cross encoder will output a score between 0 and 1 for each question and document chunk.


In [3]:
from sentence_transformers.cross_encoder import CrossEncoder

model = CrossEncoder("cross-encoder/stsb-distilroberta-base")
scores = model.predict([["My first", "sentence pair"], ["Second text", "pair"]])

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [4]:
scores

array([0.11382268, 0.11522377], dtype=float32)

In [5]:

from transformers import AutoTokenizer
from docling.chunking import HybridChunker
from docling.document_converter import DocumentConverter

class DocumentChunker:
    def __init__(self, base_dir: str = "processed_docs", model_id: str = "answerdotai/ModernBERT-base"):
        """
        Initialize the DocumentChunker with necessary components.
        
        Args:
            base_dir: Base directory containing markdown files
            model_id: Model ID for the tokenizer
        """
        # Setup logging
        logging.basicConfig(level=logging.INFO)
        self.logger = logging.getLogger(__name__)
        
        self.base_dir = Path(base_dir)
        self.model_id = model_id
        
        # Initialize components
        self._setup_components()
        
        # Store results
        self.document_chunks: Dict[str, List[str]] = {}

    def _setup_components(self) -> None:
        """Initialize tokenizer, chunker and document converter."""
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
        self.chunker = HybridChunker(
            tokenizer=self.tokenizer,
            merge_peers=True,
        )
        self.doc_converter = DocumentConverter()
        
    def process_single_document(self, file_path: Path) -> List[str]:
        """
        Process a single markdown file and return its chunks.
        
        Args:
            file_path: Path to the markdown file
            
        Returns:
            List of chunks for the document
        """
        chunks = []
        
        try:
            # Convert markdown to docling document
            doc = self.doc_converter.convert(source=str(file_path)).document
            
            # Generate and store chunks in order
            for chunk in self.chunker.chunk(dl_doc=doc):
                chunks.append(self.chunker.serialize(chunk=chunk))
                
            self.logger.info(f"Successfully processed {file_path.name} - Generated {len(chunks)} chunks")
            
        except Exception as e:
            self.logger.error(f"Error processing {file_path.name}: {str(e)}")
        
        return chunks

    def process_directory(self) -> Dict[str, List[str]]:
        """
        Process all markdown files in the directory and its subdirectories.
        
        Returns:
            Dictionary mapping document names to their ordered chunks
        """
        # Find all markdown files
        md_files = list(self.base_dir.glob("**/*-with-image-refs.md"))
        
        if not md_files:
            self.logger.warning(f"No markdown files found in {self.base_dir}")
            return self.document_chunks
        
        self.logger.info(f"Found {len(md_files)} markdown files to process")
        
        # Process each file
        for md_file in md_files:
            self.logger.info(f"Processing {md_file.relative_to(self.base_dir)}")
            
            # Store chunks with document name as key
            doc_key = md_file.stem
            self.document_chunks[doc_key] = self.process_single_document(md_file)
        
        self.logger.info(f"Completed processing all documents")
        return self.document_chunks
    
    def get_document_statistics(self) -> None:
        """Print statistics about processed documents and their chunks."""
        if not self.document_chunks:
            print("No documents have been processed yet.")
            return
            
        print("\nDocument Processing Statistics:")
        print("-" * 30)
        for doc_name, chunks in self.document_chunks.items():
            print(f"\nDocument: {doc_name}")
            print(f"Number of chunks: {len(chunks)}")
            if chunks:
                avg_chunk_length = sum(len(self.tokenizer.tokenize(chunk)) 
                                     for chunk in chunks) / len(chunks)
                print(f"Average chunk length: {avg_chunk_length:.2f} tokens")



  from .autonotebook import tqdm as notebook_tqdm


In [6]:
doc_chunker = DocumentChunker()

# Process all documents
document_chunks = doc_chunker.process_directory()

# Print statistics
doc_chunker.get_document_statistics()

INFO:__main__:Found 3 markdown files to process
INFO:__main__:Processing AI_ACT\AI_ACT-with-image-refs.md
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document AI_ACT-with-image-refs.md
INFO:docling.document_converter:Finished converting document AI_ACT-with-image-refs.md in 293.09 sec.
Token indices sequence length is longer than the specified maximum sequence length for this model (8230 > 8192). Running this sequence through the model will result in indexing errors
INFO:__main__:Successfully processed AI_ACT-with-image-refs.md - Generated 152 chunks
INFO:__main__:Processing Cybersecurity_California_Privacy\Cybersecurity_California_Privacy-with-image-refs.md
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document Cybersecurity_California_Privacy-with-image-refs.md
INFO:docling.document_converter:Finished converting document Cybersecurity_California_Pr


Document Processing Statistics:
------------------------------

Document: AI_ACT-with-image-refs
Number of chunks: 152
Average chunk length: 1133.82 tokens

Document: Cybersecurity_California_Privacy-with-image-refs
Number of chunks: 41
Average chunk length: 266.54 tokens

Document: GDPR-with-image-refs
Number of chunks: 122
Average chunk length: 938.01 tokens


In [11]:
import json

# Prepare list to store all chunks with their metadata
chunks_data = []

# Loop through the document_chunks dictionary
for doc_name, chunks in document_chunks.items():
    # Process each chunk in the document
    for i, chunk_content in enumerate(chunks):
        chunk_data = {
            "document_name": doc_name,
            "chunk_id": f"{doc_name}_chunk_{i}",
            "chunk_content": chunk_content
        }
        chunks_data.append(chunk_data)

# Save to JSON file
output_path = "document_chunks.json"
with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(chunks_data, f, indent=2, ensure_ascii=False)

print(f"Saved {len(chunks_data)} chunks to {output_path}")

Saved 315 chunks to document_chunks.json


In [6]:
import json
# If chunks are already generated, start here
# Load and restructure the chunks data
with open("document_chunks.json", 'r', encoding='utf-8') as f:
    chunks_list = json.load(f)

# Convert the flat list structure back to document_chunks dictionary
document_chunks = {}
for chunk in chunks_list:
    doc_name = chunk['document_name']
    if doc_name not in document_chunks:
        document_chunks[doc_name] = []
    document_chunks[doc_name].append(chunk['chunk_content'])

print(f"Loaded chunks for {len(document_chunks)} documents")

Loaded chunks for 3 documents


In [18]:
# To verify all chunks are loaded correctly
# len(document_chunks['GDPR-with-image-refs'])

122

In [8]:
from pathlib import Path
import logging
from typing import Dict, List
def format_document_chunks(chunks_data: List[dict]) -> Dict[str, str]:
    """
    Format chunks from JSON data into strings organized by document.
    
    Args:
        chunks_data: List of dictionaries containing chunk information from document_chunks.json
        
    Returns:
        Dictionary mapping document names to their formatted content string
    """
    formatted_docs = {}
    
    # Group chunks by document
    for chunk in chunks_data:
        doc_name = chunk['document_name']
        
        if doc_name not in formatted_docs:
            formatted_docs[doc_name] = f"{doc_name}:\n\n"
            
        formatted_docs[doc_name] += "----x----\n"
        formatted_docs[doc_name] += f"chunk_id: {chunk['chunk_id']}\n"
        formatted_docs[doc_name] += f"chunk_content: {chunk['chunk_content']}\n\n"
    
    return formatted_docs

# Load chunks from JSON
with open("document_chunks.json", 'r', encoding='utf-8') as f:
    chunks_data = json.load(f)

# Generate formatted documents
formatted_docs = format_document_chunks(chunks_data)



In [None]:
# Generate eval set for one doc to see results
# generate_questions(formatted_docs['AI_ACT-with-image-refs'],10)


'[{"question": "What is the purpose of the AI Act?", "answer": "The purpose of the AI Act is to improve the functioning of the internal market by laying down a uniform legal framework for the development, placing on the market, putting into service, and use of artificial intelligence systems in the Union.", "difficulty": "medium", "chunk_ids": ["AI_ACT-with-image-refs_chunk_3"]}, {"question": "When did the European Parliament and Council adopt the AI Act?", "answer": "13 June 2024", "difficulty": "easy", "chunk_ids": ["AI_ACT-with-image-refs_chunk_1", "AI_ACT-with-image-refs_chunk_130"]}, {"question": "What does the AI Act lay down?", "answer": "The AI Act lays down harmonized rules for AI systems, prohibitions of certain AI practices, requirements for high-risk AI systems, transparency rules, rules for general-purpose AI models, rules on market monitoring and governance, and measures to support innovation.", "difficulty": "medium", "chunk_ids": ["AI_ACT-with-image-refs_chunk_12"]}, {"

In [22]:
# Generate eval sets for each document
eval_sets = {}
for doc_id, formatted_content in formatted_docs.items():
    print(f"Generating questions for {doc_id}...")
    eval_sets[doc_id] = generate_questions(formatted_content,50)

Generating questions for AI_ACT-with-image-refs...
Generating questions for Cybersecurity_California_Privacy-with-image-refs...
Generating questions for GDPR-with-image-refs...


In [16]:
# # Let's first examine what we're getting
# print("Type of response:", type(eval_sets['AI_ACT-with-image-refs']))
# print("\nFirst 200 characters of response:")
# print(eval_sets['AI_ACT-with-image-refs'][:200])

# # Try parsing with error handling
# try:
#     parsed_response = json.loads(eval_sets['AI_ACT-with-image-refs'])
#     print("\nSuccessfully parsed JSON!")
# except json.JSONDecodeError as e:
#     print(f"\nJSON parsing error: {str(e)}")
#     # Print the problematic section of the string
#     error_position = e.pos
#     print("\nProblematic section:")
#     print(eval_sets['AI_ACT-with-image-refs'][error_position-50:error_position+50])

Type of response: <class 'str'>

First 200 characters of response:
[{"question": "What is the AI Act's main objective?", "answer": "The purpose of this Regulation is to improve the functioning of the internal market by laying down a uniform legal framework in particu

JSON parsing error: Unterminated string starting at: line 1 column 24610 (char 24609)

Problematic section:
EU database for high-risk AI systems?", "answer": "The EU database contains information on high-risk


In [17]:
# def clean_and_parse_eval_set(eval_set_str: str) -> list:
#     """Clean and parse the eval set string into a list of dictionaries."""
#     try:
#         # First attempt: direct parsing
#         return json.loads(eval_set_str)
#     except json.JSONDecodeError:
#         try:
#             # Second attempt: fix common issues
#             # Replace any unescaped quotes within the text
#             cleaned = eval_set_str.replace('\\"', '"')  # First unescape any escaped quotes
#             cleaned = cleaned.replace('"', '\\"')       # Then escape all quotes
#             cleaned = cleaned.replace('\\"{"', '{"')    # Fix the start of each object
#             cleaned = cleaned.replace('"}\\"', '"}')    # Fix the end of each object
#             # Ensure the string starts and ends with square brackets
#             if not cleaned.startswith('['):
#                 cleaned = '[' + cleaned
#             if not cleaned.endswith(']'):
#                 cleaned = cleaned + ']'
#             return json.loads(cleaned)
#         except json.JSONDecodeError as e:
#             print(f"Failed to parse JSON even after cleaning: {str(e)}")
#             return []

# # Try parsing the eval sets with the new function
# parsed_eval_sets = []

# for doc_id, eval_set in eval_sets.items():
#     print(f"Processing {doc_id}...")
#     questions = clean_and_parse_eval_set(eval_set)
    
#     # Add document information to each question
#     for question in questions:
#         question['document'] = doc_id
#         parsed_eval_sets.append(question)

# # Save to JSON file
# output_path = "evaluation_sets.json"
# with open(output_path, 'w', encoding='utf-8') as f:
#     json.dump(parsed_eval_sets, f, indent=2, ensure_ascii=False)

# print(f"Saved {len(parsed_eval_sets)} questions to {output_path}")

Processing AI_ACT-with-image-refs...
Failed to parse JSON even after cleaning: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)
Processing Cybersecurity_California_Privacy-with-image-refs...
Processing GDPR-with-image-refs...
Saved 15 questions to evaluation_sets.json


In [23]:
import json

# Parse the eval sets and add document information
parsed_eval_sets = []

for doc_id, eval_set in eval_sets.items():
    # Convert string response to Python list of dictionaries
    questions = json.loads(eval_set)
    
    # Add document information to each question
    for question in questions:
        question['document'] = doc_id
        parsed_eval_sets.append(question)

# Save to JSON file
output_path = "evaluation_sets.json"
with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(parsed_eval_sets, f, indent=2, ensure_ascii=False)

print(f"Saved {len(parsed_eval_sets)} questions to {output_path}")

Saved 28 questions to evaluation_sets.json


In [24]:
# Load the eval sets when needed
with open("evaluation_sets.json", 'r', encoding='utf-8') as f:
    eval_set = json.load(f)

In [26]:
# Creating corpus and evaluating query, doc pair using cross-encoders
import numpy as np

from sentence_transformers.cross_encoder import CrossEncoder

# Pre-trained cross encoder
model = CrossEncoder("cross-encoder/stsb-distilroberta-base")


  from .autonotebook import tqdm as notebook_tqdm


Evaluating question 1/6482


TypeError: string indices must be integers

In [None]:
def evaluate_question(question: dict, chunks_data: list) -> dict:
    """
    Evaluate a single question against all document chunks and find top 10 chunks.
    
    Args:
        question: Dictionary containing question data
        chunks_data: List of all document chunks
        
    Returns:
        Dictionary containing evaluation results with simplified structure
    """
    # Create query-document pairs
    query = question['question']
    sentence_combinations = [[query, chunk['chunk_content']] for chunk in chunks_data]
    
    # Get cross-encoder scores
    scores = model.predict(sentence_combinations)
    
    # Create results list with scores and metadata
    results = []
    for chunk, score in zip(chunks_data, scores):
        results.append({
            'chunk_id': chunk['chunk_id'],
            'score': float(score)
        })
    
    # Sort results by score in descending order
    results.sort(key=lambda x: x['score'], reverse=True)
    
    # Get top 10 chunks from cross-encoder
    top_10_chunk_ids = [chunk['chunk_id'] for chunk in results[:10]]
    
    # Find overlapping chunks
    overlapping_chunks = [
        chunk_id for chunk_id in question['chunk_ids'] 
        if chunk_id in top_10_chunk_ids
    ]
    
    # Create final ground truth set:
    # 1. First add all overlapping chunks
    final_ground_truth = overlapping_chunks.copy()
    
    # 2. Add remaining chunks from top 10 until we have 10 total
    remaining_slots = 10 - len(final_ground_truth)
    if remaining_slots > 0:
        for chunk_id in top_10_chunk_ids:
            if chunk_id not in final_ground_truth:
                final_ground_truth.append(chunk_id)
                remaining_slots -= 1
                if remaining_slots == 0:
                    break
    
    return {
        'question': question['question'],
        'llm_chunk_labels': question['chunk_ids'],
        'cross_encoder_top_10': top_10_chunk_ids,
        'final_ground_truth': final_ground_truth,
        'total_overlap_chunks': len(overlapping_chunks)
    }

# Evaluate all questions
evaluation_results = []

for i, question in enumerate(eval_set):
    print(f"Evaluating question {i+1}/{len(eval_set)}")
    result = evaluate_question(question, chunks_data)
    evaluation_results.append(result)

# Save results
output_path = "cross_encoder_evaluations.json"
with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(evaluation_results, f, indent=2, ensure_ascii=False)

print(f"\nSaved evaluation results to {output_path}")

# Print sample results for the first question
if evaluation_results:
    first_eval = evaluation_results[0]
    print("\nSample evaluation for first question:")
    print(json.dumps(first_eval, indent=2))

Evaluating question 1/28
Evaluating question 2/28
Evaluating question 3/28
Evaluating question 4/28
Evaluating question 5/28


In [2]:
result

NameError: name 'result' is not defined