# RAG Evaluation Dataset Generation

## Overview

This notebook demonstrates how to generate high-quality RAG (Retrieval-Augmented Generation) evaluation datasets using the SDG Hub framework. It creates question-answer pairs with ground truth context that can be used to evaluate RAG systems.

## What This Notebook Does

This notebook will:

1. **Construct Input Dataset**: Show how to prepare documents with outlines for the RAG evaluation flow
2. **Generate RAG Evaluation Dataset**: Run the RAG Evaluation flow to create question-answer pairs with:
   - Topic extraction from documents
   - Conceptual question generation
   - Question evolution for better quality
   - Answer generation with grounding
   - Groundedness scoring and filtering
   - Ground truth context extraction
3. **Visualize Results**: Display sample generated responses
4. **Post-process for Evaluation**: Convert the output to evaluation-ready formats (e.g., for RAGAS)

## Prerequisites

- SDG Hub installed and configured
- Model endpoint configured via environment variables (see Environment Variables Setup section below)

```bash 
git clone https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git
cd sdg_hub
pip install .[examples]
```

In [None]:
from datasets import Dataset
import pandas as pd
import json
import os

from sdg_hub import Flow, FlowRegistry

In [None]:
# Required to run the flow with async mode
import nest_asyncio

nest_asyncio.apply()

## Step 1: Prepare Input Dataset

The RAG Evaluation flow requires:
- **document**: The full text content of the document
- **document_outline**: A concise title or summary that represents the document

You can prepare this from various sources:
- PDF documents (extract text and create outlines)
- Text files
- Existing datasets
- Web content

Below are example functions to help construct the input dataset.


In [None]:
def prepare_dataset_from_text(text: str, document_outline: str, chunk_size: int = 3000, overlap: int = 500):
    """
    Prepare dataset from a single text document by chunking it.
    
    Args:
        text: Full document text
        document_outline: Title or summary of the document
        chunk_size: Maximum characters per chunk
        overlap: Overlap between chunks to maintain context (must be < chunk_size)
        
    Returns:
        Dataset with document and document_outline columns
    """
    # Validate parameters
    if overlap >= chunk_size:
        raise ValueError(f"overlap ({overlap}) must be less than chunk_size ({chunk_size})")
    
    if chunk_size <= 0:
        raise ValueError(f"chunk_size must be positive, got {chunk_size}")
    
    # Simple chunking by character count with overlap
    chunks = []
    step_size = chunk_size - overlap
    
    for i in range(0, len(text), step_size):
        chunk = text[i:i + chunk_size]
        if chunk.strip():
            chunks.append(chunk)
    
    # Create dataset
    dataset = Dataset.from_dict({
        "document": chunks,
        "document_outline": [document_outline] * len(chunks)
    })
    
    print(f"Created {len(chunks)} chunks from document")
    return dataset


def prepare_dataset_from_pdf(pdf_path: str, document_outline: str, max_pages: int = None):
    """
    Prepare dataset from a PDF file.
    
    Args:
        pdf_path: Path to PDF file
        document_outline: Title or summary of the document
        max_pages: Maximum number of pages to process (None for all)
        
    Returns:
        Dataset with document and document_outline columns
    """
    try:
        from PyPDF2 import PdfReader
    except ImportError:
        raise ImportError("PyPDF2 is required. Install with: pip install PyPDF2")
    
    reader = PdfReader(pdf_path)
    text = ""
    
    pages_to_read = reader.pages[:max_pages] if max_pages else reader.pages
    for page in pages_to_read:
        text += page.extract_text() + "\n"
    
    return prepare_dataset_from_text(text, document_outline)

### Example: Create Dataset from IBM Annual Report

Here's an example using the IBM 2024 Annual Report. It will extract text from the first 20 pages and create chunks for processing.

In [None]:
pdf_path = "ibm-annual-report-2024.pdf"

if not os.path.exists(pdf_path):
    raise FileNotFoundError(
        f"PDF file not found: {pdf_path}\n"
    )

input_dataset = prepare_dataset_from_pdf(pdf_path, "IBM 2024 Annual Report Summary", max_pages=20)
print(f"\nInput dataset columns: {input_dataset.column_names}")
print(f"Number of samples: {len(input_dataset)}")


## Step 2: Discover and Load the RAG Evaluation Flow


In [None]:
# Get the RAG Evaluation flow
flow_name = "RAG Evaluation Dataset Flow"
flow_path = FlowRegistry.get_flow_path(flow_name)

flow = Flow.from_yaml(flow_path)

## Step 3: Configure Model

Set up the model configuration for the flow. This uses environment variables for configuration.

**IMPORTANT:** Before running the cells below, make sure to set the following environment variables:

```bash
export INFERENCE_MODEL="your-model-name"
export URL="your-api-endpoint"
export API_KEY="your-api-key"
```

In [None]:
def set_model_config(flow_object):
    """Configure the model for the flow based on environment variables."""
    model = os.getenv("INFERENCE_MODEL", "")
    api_base = os.getenv("URL", "")
    api_key = os.getenv("API_KEY", "")
    
    if model and not model.startswith("openai/") and not model.startswith("ollama/"):
        model = "openai/" + model
    
    if not model:
        raise ValueError("INFERENCE_MODEL environment variable must be set")
    
    print(f"Configuring model: {model}")
    
    flow_object.set_model_config(
        model=model,
        api_base=api_base if api_base else None,
        api_key=api_key if api_key else None,
    )
    
    return flow_object

# Configure the model
flow = set_model_config(flow)

## Step 4: Generate RAG Evaluation Dataset

Run the flow to generate question-answer pairs with ground truth context. The flow will:
1. Extract topics from documents
2. Generate conceptual questions
3. Evolve questions for better quality
4. Generate answers with grounding
5. Score groundedness and filter low-quality pairs
6. Extract ground truth context


In [None]:
# Get runtime parameters
max_concurrency = int(os.getenv("MAX_CONCURRENCY", "10"))

# Optional: Configure runtime parameters for specific blocks
runtime_params = {}

print("This may take several minutes depending on dataset size and model speed...\n")

# Generate the dataset
generated_data = flow.generate(
    input_dataset, 
    runtime_params=runtime_params, 
    max_concurrency=max_concurrency
)

## Step 5: Visualize Generated Results

Let's examine some of the generated question-answer pairs to assess quality.


In [None]:
df = generated_data.to_pandas()

print(f"Total records: {len(df)}")
print("\nColumns:", list(df.columns))

print("\nSAMPLE GENERATED RECORDS")

sample_cols = ["topic", "question", "response", "ground_truth_context"]

for i, row in df.head(3).iterrows():
    print("\n")
    for col in sample_cols:
        if col in df:
            val = row[col]
            text = str(val)
            if len(text) > 200:
                text = text[:200] + "..."
            print(f"{col.title()}: {text}")

In [None]:
display_columns = ['question', 'response', 'ground_truth_context']

print("DETAILED VIEW (First Record)")

first = df.iloc[0]

for col in display_columns:
    if col in df and pd.notna(first[col]):
        print(f"\n{col.upper()}:")
        print(first[col], "\n")

## Step 6: Post-process for Evaluation

Convert the generated dataset to evaluation-ready formats. This prepares the data for use with evaluation frameworks like RAGAS.


In [None]:
from pathlib import Path

def prepare_for_ragas_evaluation(generated_df: pd.DataFrame, output_file: str = None):
    """
    Convert generated dataset to RAGAS evaluation format.
    
    RAGAS expects:
    - question: The question
    - answer: The generated answer
    - contexts: List of context strings (usually one)
    - ground_truth: The ground truth answer (can be same as answer or use ground_truth_context)
    
    Args:
        generated_df: DataFrame from flow generation
        output_file: Optional path to save JSONL file
        
    Returns:
        List of dictionaries in RAGAS format
    """
    ragas_data = []
    
    for _, row in generated_df.iterrows():
        question = row.get('question', '')
        answer = row.get('response', '')
        context = row.get('document', row.get('context', ''))
        ground_truth = row.get('ground_truth_context', answer)
        
        ragas_record = {
            "question": str(question),
            "answer": str(answer),
            "contexts": [str(context)] if context else [""],
            "ground_truth": str(ground_truth)
        }
        
        ragas_data.append(ragas_record)
    
    if output_file:
        output_file = Path(output_file)
        output_file.parent.mkdir(parents=True, exist_ok=True)

        with output_file.open("w") as f:
            for record in ragas_data:
                f.write(json.dumps(record, ensure_ascii=False) + "\n")

    return ragas_data

ragas_data = prepare_for_ragas_evaluation(df, output_file="rag_evaluation_dataset.jsonl")

print(f"\nâœ… Prepared {len(ragas_data)} records for evaluation")

In [None]:
# Save the full generated dataset
output_csv = "rag_evaluation_full_results.csv"
generated_data.to_csv(output_csv, index=False)
print(f"Saved full results to {output_csv}")

## Summary

ðŸŽ‰ You have successfully:

1. âœ… Prepared input dataset with documents and outlines
2. âœ… Generated RAG evaluation dataset with question-answer pairs
3. âœ… Visualized generated results
4. âœ… Post-processed data for evaluation frameworks

### Next Steps

- Use the generated `rag_evaluation_dataset.jsonl` file with RAGAS or other evaluation frameworks
- Analyze the quality of generated questions and answers
- Fine-tune the flow parameters or prompts if needed
- Scale up to larger datasets for comprehensive evaluation

> **Note:**  
> In a real RAG system, the model-generated answer comes from retrieved context, 
> so it will often differ from the ground truth.

### Example: Using with RAGAS

```python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import load_dataset
from langchain_openai import ChatOpenAI
from langchain_huggingface import HuggingFaceEmbeddings
import os

# Load the prepared dataset
dataset = load_dataset("json", data_files="rag_evaluation_dataset.jsonl", split="train")

llm = ChatOpenAI(
    model=os.getenv("INFERENCE_MODEL", ""),
    temperature=0,
    base_url=os.getenv("URL", ""),
    api_key=os.getenv("API_KEY", "")
)

embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1.5",
    model_kwargs={'device': 'cpu', "trust_remote_code": True},
    encode_kwargs={'normalize_embeddings': True}
)

# Run RAGAS evaluation
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=llm,
    embeddings=embeddings
)

print(f"\n{results}")
```