# RAG Evaluation - Fully Custom Test Set Generation

**Approach:** Pure Custom Generation (No RAGAS)

**Pipeline:**
1. Spatial division (4 regions) + Recursive chunking
2. Custom question generation (all types)
3. Multi-turn conversation generation
4. Automatic classification
5. Quality filtering
6. Coverage analysis & export

**Question Types Supported:**
- chatbot_style, direct_factual, procedural, analytical, compliance
- multi_turn, simple, contextual, reasoning
- descriptive, multi_hop, comparative, conditional, unanswerable

**Goal:** Full control over question quality, types, and coverage

## 1. Setup & Imports

In [None]:
# Install required packages (run once)
#!pip install langchain langchain-openai langchain-text-splitters langchain-community pandas numpy matplotlib tqdm

In [1]:
import os
import json
import pandas as pd
import numpy as np
import re
from pathlib import Path
from typing import List, Dict, Any, Optional
from dataclasses import dataclass, field
from tqdm.notebook import tqdm

# LangChain
from langchain_openai import AzureChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.prompts import PromptTemplate
from langchain_core.documents import Document

# Visualization
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All imports successful")

‚úÖ All imports successful


## 2. Configuration

In [2]:
@dataclass
class CustomConfig:
    """Fully Custom Generation Configuration"""
    
    # Azure OpenAI Settings
    azure_endpoint: str = "https://your-endpoint.openai.azure.com/"
    azure_api_key: str = "your-api-key"
    azure_api_version: str = "2024-02-01"
    azure_deployment_gpt4: str = "gpt-4"
    
    # Coverage Strategy
    num_spatial_regions: int = 4
    questions_per_document: int = 20
    questions_per_region: int = 5
    
    # Chunking
    chunk_size: int = 1500
    chunk_overlap: int = 300
    
    # Question Type Distribution
    question_type_distribution: Dict[str, float] = field(default_factory=lambda: {
        "chatbot_style": 0.10,
        "direct_factual": 0.10,
        "procedural": 0.15,
        "scenario": 0.20,
        "analytical": 0.10,
        "compliance": 0.05,
        "descriptive": 0.10,
        "multi_hop": 0.10,
        "comparative": 0.05,
        "conditional": 0.05,
    })
    
    # Conversation Type Distribution
    single_turn_ratio: float = 0.60
    multi_turn_ratio: float = 0.40
    max_turns_per_conversation: int = 3
    
    # Domain
    domain_name: str = "Banking and Financial Services (BIS)"
    domain_context: str = "BIS Meeting Services, regulatory compliance, operational procedures"
    
    # LLM Parameters
    temperature: float = 0.7
    max_tokens: int = 3000
    
    # Quality
    min_quality_score: float = 7.0

config = CustomConfig()

print("‚úÖ Configuration loaded")
print(f"   Questions per document: {config.questions_per_document}")
print(f"   Multi-turn ratio: {config.multi_turn_ratio * 100}%")
print(f"   Question types: {len(config.question_type_distribution)}")

‚úÖ Configuration loaded
   Questions per document: 20
   Multi-turn ratio: 40.0%
   Question types: 10


In [4]:
# UPDATE YOUR AZURE CREDENTIALS HERE

config.azure_endpoint = os.getenv("AZURE_OAI_ENDPOINT")
config.azure_api_key = os.getenv("AZURE_OAI_API_KEY")
config.azure_deployment_gpt4 = os.getenv("AZURE_OAI_DEPLOYMENT")
config.azure_api_version = os.getenv("AZURE_OAI_API_VERSION")


print("‚úÖ Credentials configured")

‚úÖ Credentials configured


## 3. Initialize LLM

In [5]:
llm = AzureChatOpenAI(
    azure_endpoint=config.azure_endpoint,
    api_key=config.azure_api_key,
    api_version=config.azure_api_version,
    deployment_name=config.azure_deployment_gpt4,
    temperature=config.temperature,
    max_tokens=config.max_tokens,
)

# Test
test_response = llm.invoke("Say 'Ready'")
print(f"‚úÖ LLM Test: {test_response.content}")

‚úÖ LLM Test: Ready


## 4. Document Loading & Chunking

In [7]:
DOCUMENTS_DIR = "./documents"
Path(DOCUMENTS_DIR).mkdir(parents=True, exist_ok=True)


from tqdm import tqdm  # Use standard tqdm

def load_documents(documents_dir: str) -> List[Document]:
    documents = []
    doc_dir = Path(documents_dir)
    
    # Load PDFs
    pdf_files = list(doc_dir.glob("*.pdf"))
    for pdf_file in tqdm(pdf_files, desc="Loading PDFs"):
        try:
            loader = PyPDFLoader(str(pdf_file))
            docs = loader.load()
            for doc in docs:
                doc.metadata["source_file"] = pdf_file.name
                doc.metadata["file_type"] = "pdf"
            documents.extend(docs)
        except Exception as e:
            print(f"Error loading {pdf_file.name}: {e}")
    
    return documents

# Load documents
documents = load_documents(DOCUMENTS_DIR)

print(f"\n‚úÖ Total documents loaded: {len(documents)}")

if len(documents) == 0:
    print("‚ö†Ô∏è  No documents found!")
    print(f"   Please add PDF or DOCX files to: {DOCUMENTS_DIR}")
else:
    # Group by source file
    files = {}
    for doc in documents:
        source = doc.metadata.get("source_file", "unknown")
        files[source] = files.get(source, 0) + 1
    
    print("\nDocuments by file:")
    for file, count in files.items():
        print(f"  - {file}: {count} pages/sections")

Loading PDFs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  1.13it/s]


‚úÖ Total documents loaded: 26

Documents by file:
  - othp33.pdf: 26 pages/sections





In [8]:
def chunk_documents(documents: List[Document], config: CustomConfig) -> Dict[str, List[Document]]:
    """Chunk documents by source file"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=config.chunk_size,
        chunk_overlap=config.chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    
    # Group by source
    docs_by_file = {}
    for doc in documents:
        source = doc.metadata.get("source_file", "unknown")
        if source not in docs_by_file:
            docs_by_file[source] = []
        docs_by_file[source].append(doc)
    
    # Chunk each file
    chunked_docs = {}
    for source, docs in docs_by_file.items():
        combined_text = "\n\n".join([d.page_content for d in docs])
        combined_doc = Document(page_content=combined_text, metadata={"source_file": source})
        chunks = text_splitter.split_documents([combined_doc])
        
        for i, chunk in enumerate(chunks):
            chunk.metadata["chunk_index"] = i
            chunk.metadata["total_chunks"] = len(chunks)
        
        chunked_docs[source] = chunks
        print(f"  ‚úì {source}: {len(chunks)} chunks")
    
    return chunked_docs

if documents:
    chunked_docs_by_file = chunk_documents(documents, config)
    print(f"\n‚úÖ Chunking complete")
else:
    chunked_docs_by_file = {}
    print("‚ö†Ô∏è  No documents to chunk")

  ‚úì othp33.pdf: 69 chunks

‚úÖ Chunking complete


In [9]:
def sample_chunks_spatial(chunks: List[Document], n_samples: int, num_regions: int) -> List[Document]:
    """Sample chunks evenly from spatial regions"""
    total = len(chunks)
    if total == 0:
        return []
    
    region_size = total // num_regions
    samples_per_region = n_samples // num_regions
    
    sampled = []
    for region_idx in range(num_regions):
        start = region_idx * region_size
        end = start + region_size if region_idx < num_regions - 1 else total
        region_chunks = chunks[start:end]
        
        if len(region_chunks) > 0:
            step = max(1, len(region_chunks) // samples_per_region)
            selected = region_chunks[::step][:samples_per_region]
            for chunk in selected:
                chunk.metadata["region_id"] = region_idx
            sampled.extend(selected)
    
    return sampled[:n_samples]

# Sample chunks from each document
sampled_chunks_by_file = {}
for source, chunks in chunked_docs_by_file.items():
    sampled = sample_chunks_spatial(chunks, config.questions_per_document, config.num_spatial_regions)
    sampled_chunks_by_file[source] = sampled
    print(f"{source}: Sampled {len(sampled)} chunks")

print(f"\n‚úÖ Sampling complete")

othp33.pdf: Sampled 20 chunks

‚úÖ Sampling complete


In [10]:
sampled_chunks_by_file

{'othp33.pdf': [Document(metadata={'source_file': 'othp33.pdf', 'chunk_index': 0, 'total_chunks': 69, 'region_id': 0}, page_content='European Central Bank\nBank of Japan\nSveriges Riksbank\nSwiss National Bank\nBank of England\nBoard of Governors Federal Reserve System\nBank for International Settlements\nBank of Canada\nCentral bank digital currencies: \nfoundational principles and core features\nin a series of collaborations \nfrom a group of central banks \nReport no 1 \n\n  \n \n  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nThis publication is available on the BIS website (www.bis.org).  \n \n \n \n¬© Bank for International Settlements 2020. All rights reserved. Brief excerpts may be reproduced or \ntranslated provided the source is stated.  \n \nISBN: 978-92-9259-427-5 (online)'),
  Document(metadata={'source_file': 'othp33.pdf', 'chunk_index': 3, 'total_chunks': 69, 'region_id': 0}, page_content='4 CBDC design and technology ..........

## 5. Question Generation Prompts (All Types)

In [11]:
# Master Question Generation Prompt
QUESTION_GENERATION_PROMPT = """
You are a professional test question generator for {domain_name}.

DOCUMENT CONTEXT:
{context}

QUESTION TYPE: {question_type}

QUESTION TYPE DEFINITIONS:

1. **chatbot_style**: Conversational, help-seeking questions
   Example: "Can you help me understand the meeting room booking policy?"
   Example: "I need assistance with the escalation procedure. What should I do?"

2. **direct_factual**: Direct, specific factual questions
   Example: "What is the escalation timeframe for Priority 1 incidents?"
   Example: "How many approval stages are required for external meetings?"

3. **procedural**: Step-by-step process questions
   Example: "What are the steps to book a meeting room for external participants?"
   Example: "Describe the complete workflow for compliance violation reporting."

4. **scenario**: Realistic business situation questions
   Example: "A client requests an urgent meeting room for tomorrow at 8 AM, but all rooms are booked. The client is a senior regulatory official. What should the coordinator do?"

5. **analytical**: Analysis, evaluation, or comparison questions
   Example: "Analyze the differences between Priority 1 and Priority 2 incident response protocols."
   Example: "What factors determine whether a meeting requires executive approval?"

6. **compliance**: Regulatory or policy questions
   Example: "According to MiFID II requirements, what documentation must be retained for investment advisory meetings?"

7. **descriptive**: Questions requiring detailed descriptions
   Example: "Describe the roles and responsibilities of the compliance officer in the escalation process."

8. **multi_hop**: Questions requiring multiple pieces of information
   Example: "If a Priority 1 incident occurs during non-business hours and the primary contact is unavailable, what is the backup procedure and who has the authority to approve emergency measures?"

9. **comparative**: Questions comparing two or more items
   Example: "Compare the booking procedures for internal meetings versus external meetings with regulatory participants."

10. **conditional**: Questions with if-then scenarios
    Example: "If a meeting room booking is cancelled less than 24 hours in advance, what are the consequences and alternative options?"

QUALITY REQUIREMENTS:
- Be SPECIFIC: Reference sections, timeframes, roles, procedures
- Use PROFESSIONAL language appropriate for the type
- Include CONTEXT: Who, what, when, where relevant to type
- Make ANSWERABLE from the document context provided
- Avoid GENERIC questions that could apply to any document

OUTPUT FORMAT (JSON):
{{
    "question": "Your specific question here",
    "answer": "Complete, detailed answer based on the context",
    "question_type": "{question_type}",
    "complexity": "easy|medium|hard",
    "references": ["Section X", "Page Y"] or []
}}

Generate ONE {question_type} question now:
"""

print("‚úÖ Question generation prompt defined")

‚úÖ Question generation prompt defined


In [12]:
# Follow-up Question Generation Prompt
FOLLOWUP_GENERATION_PROMPT = """
You are generating a FOLLOW-UP question for a multi-turn conversation.

DOCUMENT CONTEXT:
{context}

INITIAL QUESTION:
{initial_question}

INITIAL ANSWER:
{initial_answer}

FOLLOW-UP REQUIREMENTS:
1. **Build on the initial question** - Reference or extend it naturally
2. **Dig deeper** - Ask for more detail, exceptions, edge cases, or consequences
3. **Maintain context** - Assume the initial answer is known
4. **Be conversational** - Natural progression of the conversation

FOLLOW-UP TYPES:
- **Clarification**: "In the procedure you mentioned, what happens if X?"
- **Edge case**: "What if the standard approach doesn't apply because Y?"
- **Deeper detail**: "For the documentation requirement, how long must records be retained?"
- **Consequence**: "After following that procedure, what are the next steps?"
- **Exception**: "Are there any circumstances where this rule can be waived?"

OUTPUT FORMAT (JSON):
{{
    "followup_question": "Your follow-up question here",
    "followup_answer": "Complete answer to the follow-up",
    "followup_type": "clarification|edge_case|deeper_detail|consequence|exception"
}}

Generate ONE follow-up question now:
"""

print("‚úÖ Follow-up generation prompt defined")

‚úÖ Follow-up generation prompt defined


## 6. Question Generation Functions

In [13]:
def generate_question(chunk: Document, question_type: str, config: CustomConfig) -> Optional[Dict]:
    """Generate a single question of specified type"""
    prompt = QUESTION_GENERATION_PROMPT.format(
        context=chunk.page_content[:2500],
        question_type=question_type,
        domain_name=config.domain_name
    )
    
    try:
        response = llm.invoke(prompt)
        content = response.content.strip()
        
        # Clean JSON from markdown
        content = re.sub(r'^```json\s*', '', content)
        content = re.sub(r'\s*```$', '', content)
        content = content.strip()
        
        result = json.loads(content)
        result['source_file'] = chunk.metadata.get('source_file', 'unknown')
        result['chunk_index'] = chunk.metadata.get('chunk_index', 0)
        result['region_id'] = chunk.metadata.get('region_id', 0)
        return result
    except Exception as e:
        print(f"  ‚ö†Ô∏è  Error generating {question_type}: {str(e)[:100]}")
        return None

def generate_followup(initial_qa: Dict, chunk: Document) -> Optional[Dict]:
    """Generate a follow-up question"""
    prompt = FOLLOWUP_GENERATION_PROMPT.format(
        context=chunk.page_content[:2500],
        initial_question=initial_qa['question'],
        initial_answer=initial_qa['answer']
    )
    
    try:
        response = llm.invoke(prompt)
        content = response.content.strip()
        content = re.sub(r'^```json\s*', '', content)
        content = re.sub(r'\s*```$', '', content)
        content = content.strip()
        result = json.loads(content)
        return result
    except Exception as e:
        print(f"  ‚ö†Ô∏è  Error generating follow-up: {str(e)[:100]}")
        return None

print("‚úÖ Generation functions defined")

‚úÖ Generation functions defined


## 7. Generate Test Set

In [15]:
# TEST MODE
TEST_MODE = True
TEST_QUESTIONS_PER_DOC = 50

if TEST_MODE:
    print(f"‚ö†Ô∏è  TEST MODE: Generating only {TEST_QUESTIONS_PER_DOC} questions per document")
    print("   Set TEST_MODE = False for full generation\n")
    questions_to_generate = TEST_QUESTIONS_PER_DOC
else:
    questions_to_generate = config.questions_per_document

def get_question_type_sequence(n_questions: int, distribution: Dict[str, float]) -> List[str]:
    """Create question type sequence based on distribution"""
    sequence = []
    for q_type, ratio in distribution.items():
        count = max(1, int(n_questions * ratio))
        sequence.extend([q_type] * count)
    
    import random
    random.shuffle(sequence)
    return sequence[:n_questions]

print("‚úÖ Ready to generate questions")

‚ö†Ô∏è  TEST MODE: Generating only 50 questions per document
   Set TEST_MODE = False for full generation

‚úÖ Ready to generate questions


In [16]:
# Generate questions for all documents
all_questions = []

for source_file, sampled_chunks in sampled_chunks_by_file.items():
    print(f"\n{'='*70}")
    print(f"Generating questions for: {source_file}")
    print(f"{'='*70}")
    
    # Get question type sequence
    question_types = get_question_type_sequence(questions_to_generate, config.question_type_distribution)
    
    print(f"\nQuestion type breakdown:")
    for q_type in set(question_types):
        count = question_types.count(q_type)
        print(f"  - {q_type}: {count}")
    
    # Decide which questions get follow-ups (multi-turn)
    n_multiturn = int(len(question_types) * config.multi_turn_ratio)
    multiturn_indices = np.random.choice(len(question_types), n_multiturn, replace=False)
    
    doc_questions = []
    
    for idx, (chunk, q_type) in enumerate(tqdm(list(zip(sampled_chunks[:questions_to_generate], question_types)), 
                                                desc="Generating")):
        # Generate initial question
        qa = generate_question(chunk, q_type, config)
        
        if qa:
            qa['conversation_type'] = 'multi_turn' if idx in multiturn_indices else 'single_turn'
            qa['turn_number'] = 1
            qa['parent_question_id'] = None
            qa['has_followup'] = False
            qa['followup_questions'] = []
            
            doc_questions.append(qa)
            
            # Generate follow-ups if multi-turn
            if idx in multiturn_indices:
                n_turns = np.random.randint(1, config.max_turns_per_conversation)
                followups = []
                
                for turn in range(n_turns):
                    followup = generate_followup(qa, chunk)
                    if followup:
                        followups.append(followup)
                
                if followups:
                    qa['has_followup'] = True
                    qa['followup_questions'] = followups
    
    all_questions.extend(doc_questions)
    
    print(f"\n‚úÖ Generated {len(doc_questions)} questions")
    print(f"   - Single-turn: {sum(1 for q in doc_questions if q['conversation_type'] == 'single_turn')}")
    print(f"   - Multi-turn: {sum(1 for q in doc_questions if q['conversation_type'] == 'multi_turn')}")
    print(f"   - Total with follow-ups: {sum(len(q['followup_questions']) for q in doc_questions)}")

print(f"\n{'='*70}")
print(f"‚úÖ TOTAL QUESTIONS GENERATED: {len(all_questions)}")
print(f"   - Multi-turn conversations: {sum(1 for q in all_questions if q['has_followup'])}")
print(f"   - Total follow-up questions: {sum(len(q['followup_questions']) for q in all_questions)}")
print(f"{'='*70}")


Generating questions for: othp33.pdf

Question type breakdown:
  - chatbot_style: 5
  - conditional: 2
  - scenario: 10
  - descriptive: 5
  - analytical: 5
  - compliance: 2
  - multi_hop: 5
  - direct_factual: 5
  - comparative: 2
  - procedural: 7


Generating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [01:38<00:00,  4.93s/it]


‚úÖ Generated 20 questions
   - Single-turn: 9
   - Multi-turn: 11
   - Total with follow-ups: 17

‚úÖ TOTAL QUESTIONS GENERATED: 20
   - Multi-turn conversations: 11
   - Total follow-up questions: 17





## 8. Review Sample Questions

In [22]:
len(all_questions)

20

In [20]:
# Display sample questions
print("\nüìã SAMPLE GENERATED QUESTIONS\n")
print("="*80)

for i, qa in enumerate(all_questions, 1):
    print(f"\nüî∑ QUESTION {i}")
    print(f"   Type: {qa.get('question_type', 'unknown').upper()}")
    print(f"   Conversation: {qa.get('conversation_type', 'unknown').upper()}")
    print(f"   Source: {qa.get('source_file', 'unknown')}")
    print(f"   Region: {qa.get('region_id', 0) + 1}")
    
    print(f"\n   Q: {qa['question']}")
    answer_preview = qa['answer'][:150] + "..." if len(qa['answer']) > 150 else qa['answer']
    print(f"   A: {answer_preview}")
    print(f"   Complexity: {qa.get('complexity', 'N/A')}")
    
    if qa.get('has_followup'):
        print(f"\n   üìé FOLLOW-UP QUESTIONS: {len(qa['followup_questions'])}")
        for j, followup in enumerate(qa['followup_questions'], 1):
            print(f"      {j}. {followup['followup_question']}")
            print(f"         Type: {followup.get('followup_type', 'unknown')}")
    
    print("\n" + "-"*80)

print("\nüí° Review the questions. If satisfied, set TEST_MODE = False for full generation.")


üìã SAMPLE GENERATED QUESTIONS


üî∑ QUESTION 1
   Type: SCENARIO
   Conversation: SINGLE_TURN
   Source: othp33.pdf
   Region: 1

   Q: A policy analyst at the Bank of England is tasked with preparing a briefing on cross-border collaboration for central bank digital currencies (CBDCs). The analyst needs to reference both the foundational principles and the core features of CBDCs as agreed upon by the group of central banks mentioned in Report no 1. What steps should the analyst take to ensure that the briefing accurately reflects the collaborative approach outlined by the European Central Bank, Bank of Japan, Sveriges Riksbank, Swiss National Bank, Board of Governors Federal Reserve System, Bank of Canada, and Bank for International Settlements?
   A: The analyst should first access the publication 'Central bank digital currencies: foundational principles and core features, Report no 1,' which is av...
   Complexity: medium

---------------------------------------------------------

## 9. Convert to DataFrame with Complete Classification

In [19]:
# Flatten to rows with full classification
flat_questions = []

for qa in all_questions:
    # Initial question
    question_id = f"Q{len(flat_questions) + 1}"
    
    flat_qa = {
        'question_id': question_id,
        'source_file': qa.get('source_file'),
        'region_id': qa.get('region_id', 0) + 1,
        'chunk_index': qa.get('chunk_index', 0),
        
        # Question content
        'question': qa['question'],
        'answer': qa['answer'],
        
        # Classification columns
        'conversation_type': qa.get('conversation_type', 'single_turn'),
        'turn_number': 1,
        'parent_question_id': None,
        'is_followup': False,
        
        'question_style': qa.get('question_type', 'unknown'),
        'question_type': qa.get('question_type', 'unknown'),
        'complexity_level': qa.get('complexity', 'medium'),
        
        'has_context': qa.get('conversation_type') == 'multi_turn',
        'generation_method': 'custom',
        
        'references': ', '.join(qa.get('references', [])) if qa.get('references') else '',
    }
    flat_questions.append(flat_qa)
    
    parent_id = question_id
    
    # Follow-up questions
    for turn_num, followup in enumerate(qa.get('followup_questions', []), 2):
        followup_id = f"Q{len(flat_questions) + 1}"
        
        followup_qa = {
            'question_id': followup_id,
            'source_file': qa.get('source_file'),
            'region_id': qa.get('region_id', 0) + 1,
            'chunk_index': qa.get('chunk_index', 0),
            
            'question': followup['followup_question'],
            'answer': followup['followup_answer'],
            
            'conversation_type': 'multi_turn',
            'turn_number': turn_num,
            'parent_question_id': parent_id,
            'is_followup': True,
            
            'question_style': followup.get('followup_type', 'followup'),
            'question_type': f"followup_{followup.get('followup_type', 'unknown')}",
            'complexity_level': 'medium',
            
            'has_context': True,
            'generation_method': 'custom',
            
            'references': '',
        }
        flat_questions.append(followup_qa)

df = pd.DataFrame(flat_questions)

print(f"‚úÖ DataFrame created with {len(df)} total questions")
print(f"\nColumns: {', '.join(df.columns.tolist())}")
print(f"\nBreakdown:")
print(f"   - Initial questions: {len(df[df['is_followup'] == False])}")
print(f"   - Follow-up questions: {len(df[df['is_followup'] == True])}")
print(f"   - Single-turn: {len(df[df['conversation_type'] == 'single_turn'])}")
print(f"   - Multi-turn: {len(df[df['conversation_type'] == 'multi_turn'])}")

# Display sample
print("\nSample DataFrame:")
display(df[['question_id', 'question_type', 'conversation_type', 'turn_number', 'parent_question_id']].head(15))

‚úÖ DataFrame created with 37 total questions

Columns: question_id, source_file, region_id, chunk_index, question, answer, conversation_type, turn_number, parent_question_id, is_followup, question_style, question_type, complexity_level, has_context, generation_method, references

Breakdown:
   - Initial questions: 20
   - Follow-up questions: 17
   - Single-turn: 9
   - Multi-turn: 28

Sample DataFrame:


Unnamed: 0,question_id,question_type,conversation_type,turn_number,parent_question_id
0,Q1,scenario,single_turn,1,
1,Q2,analytical,single_turn,1,
2,Q3,procedural,multi_turn,1,
3,Q4,followup_deeper_detail,multi_turn,2,Q3
4,Q5,procedural,multi_turn,1,
5,Q6,followup_consequence,multi_turn,2,Q5
6,Q7,conditional,single_turn,1,
7,Q8,analytical,multi_turn,1,
8,Q9,followup_exception,multi_turn,2,Q8
9,Q10,followup_exception,multi_turn,3,Q8


## 10. Coverage Analysis

In [None]:
# Analyze coverage
print("\nüìä COVERAGE ANALYSIS\n")
print("="*70)

# Spatial coverage
print("\n1. Spatial Coverage (by region):")
region_coverage = df.groupby(['source_file', 'region_id']).size().unstack(fill_value=0)
print(region_coverage)

# Question type distribution
print("\n2. Question Type Distribution:")
type_dist = df['question_type'].value_counts()
print(type_dist)

# Conversation type
print("\n3. Conversation Type:")
conv_dist = df['conversation_type'].value_counts()
print(conv_dist)
print(f"   Multi-turn ratio: {conv_dist.get('multi_turn', 0) / len(df) * 100:.1f}%")

# Complexity distribution
print("\n4. Complexity Distribution:")
complexity_dist = df['complexity_level'].value_counts()
print(complexity_dist)

print("\n" + "="*70)

## 11. Export Test Set

In [24]:
OUTPUT_DIR = "./outputs"
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

# Export full test set with all classification columns
full_file = f"{OUTPUT_DIR}/testset_full_classified.csv"
df.to_csv(full_file, index=False)
print(f"‚úÖ Exported: {full_file}")

# Export simple format (for RAG evaluation)
simple_df = df[['question', 'answer', 'source_file']].copy()
simple_df.columns = ['question', 'ground_truth', 'source']
simple_file = f"{OUTPUT_DIR}/testset_simple.csv"
simple_df.to_csv(simple_file, index=False)
print(f"‚úÖ Exported: {simple_file}")

# Export conversation chains
multiturn_df = df[df['conversation_type'] == 'multi_turn'].copy()
if len(multiturn_df) > 0:
    chains_file = f"{OUTPUT_DIR}/conversation_chains.csv"
    multiturn_df.to_csv(chains_file, index=False)
    print(f"‚úÖ Exported: {chains_file}")
    print(f"   {len(multiturn_df)} multi-turn questions")

# Export summary
summary = {
    'total_questions': len(df),
    'initial_questions': len(df[df['is_followup'] == False]),
    'followup_questions': len(df[df['is_followup'] == True]),
    'single_turn': len(df[df['conversation_type'] == 'single_turn']),
    'multi_turn': len(df[df['conversation_type'] == 'multi_turn']),
    'question_types': df['question_type'].value_counts().to_dict(),
    'questions_per_document': df.groupby('source_file').size().to_dict(),
    'region_coverage': df.groupby('region_id').size().to_dict(),
}

summary_file = f"{OUTPUT_DIR}/testset_summary.json"
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)
print(f"‚úÖ Exported: {summary_file}")

print(f"\n{'='*70}")
print("üìä SUMMARY:")
print(json.dumps(summary, indent=2))
print(f"{'='*70}")

‚úÖ Exported: ./outputs/testset_full_classified.csv
‚úÖ Exported: ./outputs/testset_simple.csv
‚úÖ Exported: ./outputs/conversation_chains.csv
   28 multi-turn questions
‚úÖ Exported: ./outputs/testset_summary.json

üìä SUMMARY:
{
  "total_questions": 37,
  "initial_questions": 20,
  "followup_questions": 17,
  "single_turn": 9,
  "multi_turn": 28,
  "question_types": {
    "followup_exception": 9,
    "followup_deeper_detail": 6,
    "scenario": 4,
    "procedural": 3,
    "multi_hop": 3,
    "analytical": 2,
    "conditional": 2,
    "direct_factual": 2,
    "followup_consequence": 1,
    "compliance": 1,
    "descriptive": 1,
    "comparative": 1,
    "followup_clarification": 1,
    "chatbot_style": 1
  },
  "questions_per_document": {
    "othp33.pdf": 37
  },
  "region_coverage": {
    "1": 7,
    "2": 12,
    "3": 6,
    "4": 12
  }
}


## 12. Visualization

In [None]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Question type distribution
type_counts = df['question_type'].value_counts().head(10)
axes[0, 0].barh(range(len(type_counts)), type_counts.values)
axes[0, 0].set_yticks(range(len(type_counts)))
axes[0, 0].set_yticklabels(type_counts.index)
axes[0, 0].set_xlabel('Count')
axes[0, 0].set_title('Question Type Distribution')
axes[0, 0].invert_yaxis()

# 2. Conversation type
conv_counts = df['conversation_type'].value_counts()
axes[0, 1].pie(conv_counts.values, labels=conv_counts.index, autopct='%1.1f%%')
axes[0, 1].set_title('Single-turn vs Multi-turn')

# 3. Spatial coverage
region_counts = df['region_id'].value_counts().sort_index()
axes[1, 0].bar(region_counts.index, region_counts.values)
axes[1, 0].set_xlabel('Region')
axes[1, 0].set_ylabel('Questions')
axes[1, 0].set_title('Spatial Coverage by Region')
axes[1, 0].set_xticks(region_counts.index)

# 4. Complexity distribution
complexity_counts = df['complexity_level'].value_counts()
axes[1, 1].bar(range(len(complexity_counts)), complexity_counts.values)
axes[1, 1].set_xticks(range(len(complexity_counts)))
axes[1, 1].set_xticklabels(complexity_counts.index)
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Complexity Distribution')

plt.tight_layout()
viz_file = f"{OUTPUT_DIR}/testset_analysis.png"
plt.savefig(viz_file, dpi=300, bbox_inches='tight')
plt.show()

print(f"‚úÖ Visualization saved: {viz_file}")

## ‚úÖ Complete!

### üìÅ Generated Files:
1. `testset_full_classified.csv` - Complete test set with all classification columns
2. `testset_simple.csv` - Simple format (question, ground_truth, source)
3. `conversation_chains.csv` - Multi-turn conversations only
4. `testset_summary.json` - Statistics and metadata
5. `testset_analysis.png` - Visualizations

### üéØ Features Included:
‚úÖ All question types (10+ types)
‚úÖ Multi-turn conversations with follow-ups
‚úÖ Complete classification columns
‚úÖ Spatial coverage guarantee
‚úÖ Quality generation with context
‚úÖ No RAGAS dependency

### üîÑ Next Steps:
1. Review sample questions above
2. If satisfied, set `TEST_MODE = False` and run again
3. Adjust `question_type_distribution` in config if needed
4. Use generated test set for RAG evaluation