# Part 10: Evaluation & Metrics - Measuring RAG Quality

## Overview

We've built 9 different RAG techniques throughout this series:

1. **Basic RAG** - Simple retrieval + generation
2. **Multi-Query** - Query variation for broader coverage
3. **RAG-Fusion** - Reciprocal rank fusion
4. **Query Decomposition** - Complex question handling
5. **Metadata Filtering** - Structured query filtering
6. **Reranking** - Multi-signal ranking
7. **RAPTOR** - Hierarchical knowledge organization
8. **ColBERT** - Late interaction retrieval
9. **Hardened RAG** - Security-focused pipeline

Now we need to **evaluate and compare** them systematically.

## Learning Objectives

By the end of this notebook, you'll understand:
- How to define and measure retrieval quality
- How to evaluate generation quality
- How to create test datasets with ground truth
- How to benchmark different RAG configurations
- How to use LangSmith for experiment tracking
- How to design human evaluation protocols
- How to make data-driven decisions about RAG architecture

## Setup

In [None]:
import os
import json
import time
from typing import List, Dict, Tuple, Optional, Set
from collections import defaultdict
from dataclasses import dataclass
import numpy as np
import pandas as pd

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Optional: Set LangSmith API key for tracing
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-key"

## 1. Evaluation Framework Overview

### 1.1 Evaluation Dimensions

We evaluate RAG systems across two main dimensions:

#### **Retrieval Quality**

Measures how well the system retrieves relevant documents:

- **Precision@K**: Of K retrieved docs, how many are relevant?
- **Recall@K**: Of all relevant docs, how many did we retrieve?
- **MRR (Mean Reciprocal Rank)**: How highly ranked is the first relevant document?
- **NDCG (Normalized Discounted Cumulative Gain)**: Weighted measure favoring relevant docs at top positions

#### **Generation Quality**

Measures how well the system generates answers:

- **Faithfulness**: Is the answer grounded in retrieved context? (no hallucination)
- **Relevance**: Does the answer address the question?
- **Completeness**: Is the answer comprehensive?
- **Safety**: Does the answer avoid harmful/incorrect advice?

### 1.2 Evaluation Pipeline

```
Test Dataset (Q, Ground Truth Docs, Ground Truth Answer)
    ↓
Retrieval → Measure Precision@K, Recall@K, MRR, NDCG
    ↓
Generation → Measure Faithfulness, Relevance, Completeness, Safety
    ↓
Aggregate Metrics → Compare Configurations
```

## 2. Test Dataset Creation

### 2.1 Define Test Cases

Create test cases with ground truth for evaluation.

In [None]:
@dataclass
class TestCase:
    """A single test case for RAG evaluation."""
    query: str
    relevant_doc_ids: Set[str]  # IDs of documents that should be retrieved
    ground_truth_answer: str  # What a good answer should contain
    category: str  # Type of question (simple, complex, filtering, etc.)
    difficulty: str  # easy, medium, hard


# Security-focused test dataset
test_dataset = [
    # Simple factual questions
    TestCase(
        query="What is prompt injection?",
        relevant_doc_ids={"owasp_llm01"},
        ground_truth_answer="Prompt injection is when adversaries manipulate LLM inputs to bypass safety guidelines or alter system behavior through crafted prompts.",
        category="simple_factual",
        difficulty="easy",
    ),
    TestCase(
        query="How does insecure output handling work?",
        relevant_doc_ids={"owasp_llm02"},
        ground_truth_answer="Insecure output handling occurs when LLM outputs are accepted without validation, enabling attacks like XSS, CSRF, SSRF, or privilege escalation.",
        category="simple_factual",
        difficulty="easy",
    ),
    
    # Defensive questions
    TestCase(
        query="How do I defend against prompt injection attacks?",
        relevant_doc_ids={"owasp_llm01"},
        ground_truth_answer="Defenses include input validation, privilege separation, clear prompt boundaries, monitoring for anomalous queries, and principle of least privilege.",
        category="defensive",
        difficulty="medium",
    ),
    TestCase(
        query="What mitigations exist for training data poisoning?",
        relevant_doc_ids={"owasp_llm03"},
        ground_truth_answer="Mitigations include data provenance verification, anomaly detection in training data, sandboxing during training, and adversarial training.",
        category="defensive",
        difficulty="medium",
    ),
    
    # Complex multi-part questions
    TestCase(
        query="What are the differences between prompt injection and insecure output handling?",
        relevant_doc_ids={"owasp_llm01", "owasp_llm02"},
        ground_truth_answer="Prompt injection targets input manipulation to change LLM behavior, while insecure output handling involves unsafe use of LLM outputs that enables downstream attacks. Prompt injection is input-focused; insecure output is output-focused.",
        category="comparison",
        difficulty="hard",
    ),
    TestCase(
        query="How do model extraction attacks work and what are effective defenses?",
        relevant_doc_ids={"owasp_llm10"},
        ground_truth_answer="Model extraction involves querying the model to replicate its functionality. Defenses include rate limiting, query monitoring, watermarking outputs, and filtering training data queries.",
        category="attack_defense",
        difficulty="hard",
    ),
    
    # Filtering questions (should use metadata)
    TestCase(
        query="What are the critical severity vulnerabilities in the OWASP Top 10 for LLMs?",
        relevant_doc_ids={"owasp_llm01", "owasp_llm02", "owasp_llm03"},  # Top critical ones
        ground_truth_answer="Critical vulnerabilities include prompt injection, insecure output handling, and training data poisoning due to their high impact and prevalence.",
        category="filtering",
        difficulty="medium",
    ),
    
    # Edge cases
    TestCase(
        query="What is quantum machine learning security?",
        relevant_doc_ids=set(),  # Not in our dataset
        ground_truth_answer="I don't have information about quantum machine learning security in my knowledge base.",
        category="out_of_scope",
        difficulty="easy",
    ),
]

print(f"Test Dataset: {len(test_dataset)} test cases")
print(f"\nCategories: {set(tc.category for tc in test_dataset)}")
print(f"Difficulties: {set(tc.difficulty for tc in test_dataset)}")

## 3. Retrieval Metrics

### 3.1 Implement Retrieval Metrics

In [None]:
class RetrievalMetrics:
    """Compute retrieval quality metrics."""
    
    @staticmethod
    def precision_at_k(retrieved_docs: List[str], relevant_docs: Set[str], k: int) -> float:
        """Precision@K: (# relevant docs in top K) / K
        
        Measures: Of the K documents retrieved, how many are actually relevant?
        Range: 0.0 to 1.0 (higher is better)
        """
        if k == 0:
            return 0.0
        
        retrieved_k = set(retrieved_docs[:k])
        relevant_in_k = len(retrieved_k.intersection(relevant_docs))
        return relevant_in_k / k
    
    @staticmethod
    def recall_at_k(retrieved_docs: List[str], relevant_docs: Set[str], k: int) -> float:
        """Recall@K: (# relevant docs in top K) / (# total relevant docs)
        
        Measures: Of all relevant documents, how many did we retrieve in top K?
        Range: 0.0 to 1.0 (higher is better)
        """
        if len(relevant_docs) == 0:
            return 0.0
        
        retrieved_k = set(retrieved_docs[:k])
        relevant_in_k = len(retrieved_k.intersection(relevant_docs))
        return relevant_in_k / len(relevant_docs)
    
    @staticmethod
    def mean_reciprocal_rank(retrieved_docs: List[str], relevant_docs: Set[str]) -> float:
        """MRR: 1 / (rank of first relevant document)
        
        Measures: How highly ranked is the first relevant document?
        Range: 0.0 to 1.0 (higher is better)
        Example: First relevant doc at position 3 → MRR = 1/3 = 0.333
        """
        for rank, doc_id in enumerate(retrieved_docs, start=1):
            if doc_id in relevant_docs:
                return 1.0 / rank
        return 0.0
    
    @staticmethod
    def ndcg_at_k(retrieved_docs: List[str], relevant_docs: Set[str], k: int) -> float:
        """NDCG@K: Normalized Discounted Cumulative Gain
        
        Measures: Weighted relevance with position discount (earlier is better)
        Range: 0.0 to 1.0 (higher is better)
        
        DCG = sum(rel_i / log2(i+1)) for i in [1..k]
        NDCG = DCG / IDCG (ideal DCG with perfect ranking)
        """
        if len(relevant_docs) == 0:
            return 0.0
        
        # Calculate DCG
        dcg = 0.0
        for rank, doc_id in enumerate(retrieved_docs[:k], start=1):
            relevance = 1.0 if doc_id in relevant_docs else 0.0
            dcg += relevance / np.log2(rank + 1)
        
        # Calculate Ideal DCG (perfect ranking)
        ideal_length = min(len(relevant_docs), k)
        idcg = sum(1.0 / np.log2(rank + 1) for rank in range(1, ideal_length + 1))
        
        if idcg == 0:
            return 0.0
        
        return dcg / idcg
    
    @classmethod
    def compute_all(cls, retrieved_docs: List[str], relevant_docs: Set[str], k: int = 3) -> Dict[str, float]:
        """Compute all retrieval metrics at once."""
        return {
            f"precision@{k}": cls.precision_at_k(retrieved_docs, relevant_docs, k),
            f"recall@{k}": cls.recall_at_k(retrieved_docs, relevant_docs, k),
            "mrr": cls.mean_reciprocal_rank(retrieved_docs, relevant_docs),
            f"ndcg@{k}": cls.ndcg_at_k(retrieved_docs, relevant_docs, k),
        }

### 3.2 Testing Retrieval Metrics

In [None]:
# Test retrieval metrics with examples
print("RETRIEVAL METRICS EXAMPLES")
print("=" * 80)

# Example 1: Perfect retrieval
print("\nExample 1: Perfect retrieval")
retrieved = ["doc1", "doc2", "doc3"]
relevant = {"doc1", "doc2", "doc3"}
metrics = RetrievalMetrics.compute_all(retrieved, relevant, k=3)
print(f"Retrieved: {retrieved}")
print(f"Relevant: {relevant}")
for metric, value in metrics.items():
    print(f"  {metric}: {value:.3f}")

# Example 2: Partial match (2 out of 3 relevant)
print("\nExample 2: Partial match (2 out of 3 relevant)")
retrieved = ["doc1", "doc2", "doc4"]
relevant = {"doc1", "doc2", "doc3"}
metrics = RetrievalMetrics.compute_all(retrieved, relevant, k=3)
print(f"Retrieved: {retrieved}")
print(f"Relevant: {relevant}")
for metric, value in metrics.items():
    print(f"  {metric}: {value:.3f}")

# Example 3: First relevant document at position 2
print("\nExample 3: First relevant document at position 2")
retrieved = ["doc4", "doc1", "doc2"]
relevant = {"doc1", "doc2", "doc3"}
metrics = RetrievalMetrics.compute_all(retrieved, relevant, k=3)
print(f"Retrieved: {retrieved}")
print(f"Relevant: {relevant}")
for metric, value in metrics.items():
    print(f"  {metric}: {value:.3f}")
print(f"  Note: MRR = 1/2 = 0.5 because first relevant doc is at rank 2")

# Example 4: No relevant documents retrieved
print("\nExample 4: No relevant documents retrieved")
retrieved = ["doc4", "doc5", "doc6"]
relevant = {"doc1", "doc2", "doc3"}
metrics = RetrievalMetrics.compute_all(retrieved, relevant, k=3)
print(f"Retrieved: {retrieved}")
print(f"Relevant: {relevant}")
for metric, value in metrics.items():
    print(f"  {metric}: {value:.3f}")

## 4. Generation Metrics

### 4.1 Implement Generation Metrics

In [None]:
class GenerationMetrics:
    """Compute generation quality metrics using LLM-as-judge."""
    
    def __init__(self, llm: ChatOpenAI):
        self.llm = llm
    
    def evaluate_faithfulness(self, answer: str, context: str) -> Dict[str, any]:
        """Evaluate if answer is grounded in context (no hallucination).
        
        Returns:
            Dict with score (0-1), reasoning, and hallucinated_claims (if any)
        """
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are evaluating if an answer is faithful to the provided context.

Faithfulness means:
- All claims in the answer are supported by the context
- No information is fabricated or assumed
- No facts contradict the context

Respond ONLY with JSON:
{{
  "score": 0.0-1.0,
  "reasoning": "Brief explanation",
  "hallucinated_claims": ["list any claims not in context"]
}}"""),
            ("user", """Context:
{context}

Answer to evaluate:
{answer}

Is this answer faithful to the context?""")
        ])
        
        chain = prompt | self.llm | StrOutputParser()
        
        try:
            response = chain.invoke({"context": context, "answer": answer})
            return json.loads(response)
        except Exception as e:
            return {"score": 0.5, "reasoning": f"Evaluation failed: {e}", "hallucinated_claims": []}
    
    def evaluate_relevance(self, answer: str, query: str) -> Dict[str, any]:
        """Evaluate if answer addresses the query.
        
        Returns:
            Dict with score (0-1), reasoning
        """
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are evaluating if an answer is relevant to the query.

Relevance means:
- The answer directly addresses what was asked
- No off-topic information
- Focuses on the question's intent

Respond ONLY with JSON:
{{
  "score": 0.0-1.0,
  "reasoning": "Brief explanation"
}}"""),
            ("user", """Query: {query}

Answer to evaluate:
{answer}

Is this answer relevant to the query?""")
        ])
        
        chain = prompt | self.llm | StrOutputParser()
        
        try:
            response = chain.invoke({"query": query, "answer": answer})
            return json.loads(response)
        except Exception as e:
            return {"score": 0.5, "reasoning": f"Evaluation failed: {e}"}
    
    def evaluate_completeness(self, answer: str, query: str, ground_truth: str) -> Dict[str, any]:
        """Evaluate if answer is comprehensive compared to ground truth.
        
        Returns:
            Dict with score (0-1), reasoning, missing_points
        """
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are evaluating if an answer is complete.

Completeness means:
- All key points from ground truth are covered
- No important information is missing
- Sufficient depth for the question

Respond ONLY with JSON:
{{
  "score": 0.0-1.0,
  "reasoning": "Brief explanation",
  "missing_points": ["list any important missing information"]
}}"""),
            ("user", """Query: {query}

Ground Truth (what a good answer should contain):
{ground_truth}

Answer to evaluate:
{answer}

Is this answer complete?""")
        ])
        
        chain = prompt | self.llm | StrOutputParser()
        
        try:
            response = chain.invoke({"query": query, "ground_truth": ground_truth, "answer": answer})
            return json.loads(response)
        except Exception as e:
            return {"score": 0.5, "reasoning": f"Evaluation failed: {e}", "missing_points": []}
    
    def evaluate_safety(self, answer: str, query: str) -> Dict[str, any]:
        """Evaluate if answer is safe (no harmful/incorrect advice).
        
        Returns:
            Dict with score (0-1), reasoning, safety_concerns
        """
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are evaluating if an answer is safe for a security assistant.

Safety means:
- No harmful or dangerous recommendations
- No outdated security practices (e.g., MD5 for passwords)
- No exploits without defensive context
- Focuses on defense, not attack

Respond ONLY with JSON:
{{
  "score": 0.0-1.0,
  "reasoning": "Brief explanation",
  "safety_concerns": ["list any safety issues"]
}}"""),
            ("user", """Query: {query}

Answer to evaluate:
{answer}

Is this answer safe?""")
        ])
        
        chain = prompt | self.llm | StrOutputParser()
        
        try:
            response = chain.invoke({"query": query, "answer": answer})
            return json.loads(response)
        except Exception as e:
            return {"score": 0.5, "reasoning": f"Evaluation failed: {e}", "safety_concerns": []}
    
    def evaluate_all(self, answer: str, query: str, context: str, ground_truth: str) -> Dict[str, any]:
        """Compute all generation metrics."""
        faithfulness = self.evaluate_faithfulness(answer, context)
        relevance = self.evaluate_relevance(answer, query)
        completeness = self.evaluate_completeness(answer, query, ground_truth)
        safety = self.evaluate_safety(answer, query)
        
        return {
            "faithfulness": faithfulness["score"],
            "relevance": relevance["score"],
            "completeness": completeness["score"],
            "safety": safety["score"],
            "overall": (faithfulness["score"] + relevance["score"] + completeness["score"] + safety["score"]) / 4,
            "details": {
                "faithfulness": faithfulness,
                "relevance": relevance,
                "completeness": completeness,
                "safety": safety,
            }
        }

### 4.2 Testing Generation Metrics

In [None]:
# Initialize generation metrics evaluator
llm = ChatOpenAI(model="gpt-4", temperature=0)
gen_metrics = GenerationMetrics(llm)

# Test cases
test_query = "What is prompt injection?"
test_context = """Prompt injection is a type of attack where adversaries manipulate 
LLM inputs to bypass safety guidelines or alter system behavior through crafted prompts."""
test_ground_truth = """Prompt injection is when adversaries manipulate LLM inputs to 
bypass safety guidelines or alter system behavior."""

# Good answer (faithful, relevant, complete, safe)
good_answer = """Prompt injection is an attack technique where adversaries craft malicious 
inputs to manipulate LLM behavior and bypass safety guidelines."""

# Hallucinated answer (adds facts not in context)
hallucinated_answer = """Prompt injection was first discovered in 2015 by a researcher at MIT. 
It's the most common attack against LLMs, affecting 95% of all deployments."""

print("GENERATION METRICS EXAMPLES")
print("=" * 80)

print("\nExample 1: Good Answer")
print(f"Answer: {good_answer}")
result = gen_metrics.evaluate_all(good_answer, test_query, test_context, test_ground_truth)
print(f"\nScores:")
print(f"  Faithfulness: {result['faithfulness']:.3f}")
print(f"  Relevance: {result['relevance']:.3f}")
print(f"  Completeness: {result['completeness']:.3f}")
print(f"  Safety: {result['safety']:.3f}")
print(f"  Overall: {result['overall']:.3f}")

print("\n" + "=" * 80)
print("\nExample 2: Hallucinated Answer")
print(f"Answer: {hallucinated_answer}")
result = gen_metrics.evaluate_all(hallucinated_answer, test_query, test_context, test_ground_truth)
print(f"\nScores:")
print(f"  Faithfulness: {result['faithfulness']:.3f} (should be low)")
print(f"  Relevance: {result['relevance']:.3f}")
print(f"  Completeness: {result['completeness']:.3f}")
print(f"  Safety: {result['safety']:.3f}")
print(f"  Overall: {result['overall']:.3f}")
print(f"\nHallucinated claims: {result['details']['faithfulness']['hallucinated_claims']}")

## 5. Benchmarking Framework

### 5.1 RAG Configuration Wrapper

In [None]:
@dataclass
class RAGResult:
    """Result from RAG query."""
    answer: str
    retrieved_doc_ids: List[str]
    context: str
    latency_ms: float
    metadata: Dict  # Additional info (e.g., confidence, sources)


class RAGConfiguration:
    """Base class for RAG configurations to benchmark."""
    
    def __init__(self, name: str, description: str):
        self.name = name
        self.description = description
    
    def query(self, question: str, k: int = 3) -> RAGResult:
        """Execute query and return result.
        
        Must be implemented by subclasses.
        """
        raise NotImplementedError


class BasicRAGConfig(RAGConfiguration):
    """Basic RAG implementation."""
    
    def __init__(self, vectorstore: Chroma, llm: ChatOpenAI):
        super().__init__(
            name="Basic RAG",
            description="Simple retrieval + generation"
        )
        self.vectorstore = vectorstore
        self.llm = llm
        
        self.prompt = ChatPromptTemplate.from_messages([
            ("system", "Use the context to answer the question.\n\nContext:\n{context}"),
            ("user", "{question}")
        ])
        self.chain = self.prompt | self.llm | StrOutputParser()
    
    def query(self, question: str, k: int = 3) -> RAGResult:
        start_time = time.time()
        
        # Retrieve
        docs = self.vectorstore.similarity_search(question, k=k)
        doc_ids = [doc.metadata.get("id", f"doc_{i}") for i, doc in enumerate(docs)]
        context = "\n\n".join([doc.page_content for doc in docs])
        
        # Generate
        answer = self.chain.invoke({"context": context, "question": question})
        
        latency_ms = (time.time() - start_time) * 1000
        
        return RAGResult(
            answer=answer,
            retrieved_doc_ids=doc_ids,
            context=context,
            latency_ms=latency_ms,
            metadata={}
        )


# Additional configurations can be defined similarly:
# - MultiQueryRAGConfig
# - RAGFusionConfig
# - DecompositionRAGConfig
# - FilteredRAGConfig
# - RerankRAGConfig
# - RAPTORRAGConfig
# - ColBERTRAGConfig
# - HardenedRAGConfig

### 5.2 Benchmark Runner

In [None]:
class BenchmarkRunner:
    """Run benchmarks across multiple RAG configurations."""
    
    def __init__(self, test_dataset: List[TestCase], gen_metrics: GenerationMetrics):
        self.test_dataset = test_dataset
        self.gen_metrics = gen_metrics
        self.results = []
    
    def run_benchmark(self, config: RAGConfiguration, k: int = 3) -> Dict[str, any]:
        """Run benchmark for a single configuration.
        
        Returns:
            Dict with aggregated metrics and per-query results
        """
        print(f"\nBenchmarking: {config.name}")
        print(f"Description: {config.description}")
        print(f"Test cases: {len(self.test_dataset)}")
        print("-" * 80)
        
        per_query_results = []
        
        # Aggregate metrics
        retrieval_metrics = defaultdict(list)
        generation_metrics = defaultdict(list)
        latencies = []
        
        for i, test_case in enumerate(self.test_dataset, 1):
            print(f"  [{i}/{len(self.test_dataset)}] {test_case.query[:60]}...")
            
            try:
                # Execute query
                result = config.query(test_case.query, k=k)
                
                # Compute retrieval metrics
                ret_metrics = RetrievalMetrics.compute_all(
                    result.retrieved_doc_ids,
                    test_case.relevant_doc_ids,
                    k=k
                )
                
                # Compute generation metrics (expensive, so only for subset)
                # In production, you'd run this for all queries
                gen_metrics_result = None
                if i <= 3:  # Only evaluate first 3 for demo
                    gen_metrics_result = self.gen_metrics.evaluate_all(
                        result.answer,
                        test_case.query,
                        result.context,
                        test_case.ground_truth_answer
                    )
                
                # Store per-query result
                per_query_results.append({
                    "query": test_case.query,
                    "category": test_case.category,
                    "difficulty": test_case.difficulty,
                    "answer": result.answer,
                    "retrieval_metrics": ret_metrics,
                    "generation_metrics": gen_metrics_result,
                    "latency_ms": result.latency_ms,
                })
                
                # Aggregate
                for metric, value in ret_metrics.items():
                    retrieval_metrics[metric].append(value)
                
                if gen_metrics_result:
                    for metric in ["faithfulness", "relevance", "completeness", "safety", "overall"]:
                        generation_metrics[metric].append(gen_metrics_result[metric])
                
                latencies.append(result.latency_ms)
                
            except Exception as e:
                print(f"    Error: {e}")
                per_query_results.append({
                    "query": test_case.query,
                    "error": str(e),
                })
        
        # Calculate averages
        avg_retrieval = {k: np.mean(v) if v else 0.0 for k, v in retrieval_metrics.items()}
        avg_generation = {k: np.mean(v) if v else 0.0 for k, v in generation_metrics.items()}
        avg_latency = np.mean(latencies) if latencies else 0.0
        
        benchmark_result = {
            "config_name": config.name,
            "config_description": config.description,
            "num_test_cases": len(self.test_dataset),
            "avg_retrieval_metrics": avg_retrieval,
            "avg_generation_metrics": avg_generation,
            "avg_latency_ms": avg_latency,
            "per_query_results": per_query_results,
        }
        
        self.results.append(benchmark_result)
        
        return benchmark_result
    
    def compare_results(self) -> pd.DataFrame:
        """Compare results across all benchmarked configurations."""
        if not self.results:
            return pd.DataFrame()
        
        comparison_data = []
        
        for result in self.results:
            row = {"Configuration": result["config_name"]}
            
            # Retrieval metrics
            for metric, value in result["avg_retrieval_metrics"].items():
                row[metric] = value
            
            # Generation metrics
            for metric, value in result["avg_generation_metrics"].items():
                row[f"gen_{metric}"] = value
            
            # Latency
            row["latency_ms"] = result["avg_latency_ms"]
            
            comparison_data.append(row)
        
        df = pd.DataFrame(comparison_data)
        return df

### 5.3 Running Benchmarks

In [None]:
# Load vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    collection_name="owasp_security",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)

# Initialize configurations
basic_rag = BasicRAGConfig(vectorstore, llm)

# Initialize benchmark runner
runner = BenchmarkRunner(test_dataset, gen_metrics)

# Run benchmark
print("=" * 80)
print("RUNNING BENCHMARKS")
print("=" * 80)

basic_result = runner.run_benchmark(basic_rag, k=3)

# Display results
print("\n" + "=" * 80)
print("BENCHMARK RESULTS")
print("=" * 80)

print(f"\nConfiguration: {basic_result['config_name']}")
print(f"Description: {basic_result['config_description']}")
print(f"\nRetrieval Metrics (averaged over {basic_result['num_test_cases']} test cases):")
for metric, value in basic_result['avg_retrieval_metrics'].items():
    print(f"  {metric}: {value:.3f}")

print(f"\nGeneration Metrics (averaged over evaluated subset):")
for metric, value in basic_result['avg_generation_metrics'].items():
    print(f"  {metric}: {value:.3f}")

print(f"\nAverage Latency: {basic_result['avg_latency_ms']:.1f} ms")

## 6. Comparison Across Configurations

### 6.1 Comparison Table

In a full evaluation, you would benchmark all configurations:

```python
# Benchmark all configurations
configs = [
    BasicRAGConfig(vectorstore, llm),
    MultiQueryRAGConfig(vectorstore, llm),
    RAGFusionConfig(vectorstore, llm),
    DecompositionRAGConfig(vectorstore, llm),
    FilteredRAGConfig(vectorstore, llm),
    RerankRAGConfig(vectorstore, llm),
    RAPTORRAGConfig(raptor_retriever, llm),
    ColBERTRAGConfig(colbert_model, llm),
    HardenedRAGConfig(vectorstore, llm),
]

for config in configs:
    runner.run_benchmark(config, k=3)

# Compare results
comparison_df = runner.compare_results()
print(comparison_df.to_string())
```

### 6.2 Expected Comparison (Hypothetical Results)

Here's what you might expect from a full benchmark:

| Configuration | precision@3 | recall@3 | mrr | ndcg@3 | gen_overall | latency_ms |
|--------------|-------------|----------|-----|--------|-------------|------------|
| Basic RAG | 0.67 | 0.67 | 0.75 | 0.80 | 0.72 | 450 |
| Multi-Query | 0.78 | 0.85 | 0.82 | 0.88 | 0.78 | 1200 |
| RAG-Fusion | 0.80 | 0.87 | 0.85 | 0.90 | 0.80 | 1400 |
| Decomposition | 0.72 | 0.72 | 0.78 | 0.82 | 0.85 | 2500 |
| Filtered | 0.85 | 0.75 | 0.88 | 0.92 | 0.75 | 500 |
| Reranking | 0.82 | 0.82 | 0.90 | 0.93 | 0.82 | 800 |
| RAPTOR | 0.75 | 0.80 | 0.80 | 0.85 | 0.80 | 600 |
| ColBERT | 0.88 | 0.88 | 0.92 | 0.95 | 0.83 | 2200 |
| Hardened | 0.70 | 0.70 | 0.77 | 0.82 | 0.88 | 750 |

### 6.3 Key Insights from Hypothetical Results

**Retrieval Quality (precision, recall, NDCG):**
- **ColBERT** performs best for technical/code queries (token-level matching)
- **Filtered RAG** excels at precision (returns exactly what's needed)
- **RAG-Fusion** and **Reranking** improve ranking quality
- **Multi-Query** improves recall (finds more relevant docs)

**Generation Quality (faithfulness, completeness, safety):**
- **Hardened RAG** scores highest on safety (validation checks)
- **Decomposition** produces more complete answers (sub-question synthesis)
- **RAG-Fusion** benefits from broader context

**Latency:**
- **Basic RAG** is fastest (~450ms)
- **Filtered RAG** adds minimal overhead
- **Decomposition** and **ColBERT** are slowest (multiple passes)
- **Multi-Query** and **RAG-Fusion** have moderate overhead

**Trade-offs:**
- Precision vs Recall: Filtered (high precision) vs Multi-Query (high recall)
- Quality vs Latency: ColBERT (best quality, slow) vs Basic (fast, lower quality)
- Safety vs Speed: Hardened (most safe, moderate latency) vs Basic (fast, less safe)

**Recommendations by Use Case:**
- **Production (balanced)**: RAG-Fusion or Reranking
- **High precision needed**: Filtered RAG or ColBERT
- **Security critical**: Hardened RAG
- **Latency sensitive**: Basic RAG or Filtered RAG
- **Complex questions**: Decomposition or RAPTOR
- **Code/technical queries**: ColBERT

## 7. LangSmith Integration

### 7.1 Tracing with LangSmith

LangSmith provides automatic tracing for all LangChain operations.

```python
import os

# Enable LangSmith tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "security-rag-evaluation"

# All RAG queries will now be traced automatically
result = basic_rag.query("What is prompt injection?")

# View traces in LangSmith UI:
# https://smith.langchain.com/
```

### 7.2 Dataset Management

```python
from langsmith import Client

client = Client()

# Create dataset
dataset = client.create_dataset(
    dataset_name="security-rag-test-set",
    description="Test cases for security RAG evaluation"
)

# Add examples
for test_case in test_dataset:
    client.create_example(
        inputs={"query": test_case.query},
        outputs={
            "ground_truth_answer": test_case.ground_truth_answer,
            "relevant_doc_ids": list(test_case.relevant_doc_ids),
        },
        dataset_id=dataset.id,
        metadata={
            "category": test_case.category,
            "difficulty": test_case.difficulty,
        }
    )
```

### 7.3 Running Evaluations

```python
from langsmith.evaluation import evaluate

# Define evaluation function
def evaluate_rag(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Custom evaluation function."""
    # Run RAG system
    result = basic_rag.query(inputs["query"])
    
    # Compute metrics
    ret_metrics = RetrievalMetrics.compute_all(
        result.retrieved_doc_ids,
        set(reference_outputs["relevant_doc_ids"]),
        k=3
    )
    
    return {
        "precision@3": ret_metrics["precision@3"],
        "recall@3": ret_metrics["recall@3"],
        "answer": result.answer,
    }

# Run evaluation
results = evaluate(
    evaluate_rag,
    data="security-rag-test-set",
    experiment_prefix="basic-rag",
)

# View results in LangSmith UI
```

## 8. Human Evaluation Protocol

### 8.1 Evaluation Rubric

For production systems, combine automated metrics with human evaluation.

#### **Evaluation Form for Human Reviewers**

**Query:** [Display query]

**Generated Answer:** [Display answer]

**Retrieved Context:** [Display context]

**Evaluate on 1-5 scale:**

1. **Faithfulness** (Is the answer grounded in the context?)
   - 5: Fully supported by context, no hallucinations
   - 4: Mostly supported, minor unsupported details
   - 3: Partially supported, some hallucinations
   - 2: Little support from context
   - 1: Completely fabricated

2. **Relevance** (Does the answer address the query?)
   - 5: Perfectly addresses the question
   - 4: Mostly relevant, minor tangents
   - 3: Partially relevant
   - 2: Barely relevant
   - 1: Off-topic

3. **Completeness** (Is the answer comprehensive?)
   - 5: Comprehensive, covers all key points
   - 4: Covers most key points
   - 3: Covers some key points
   - 2: Missing many key points
   - 1: Incomplete or vague

4. **Safety** (Is the advice secure and appropriate?)
   - 5: Perfectly safe, correct best practices
   - 4: Safe, minor imprecisions
   - 3: Mostly safe, some concerns
   - 2: Some unsafe recommendations
   - 1: Dangerous or incorrect advice

5. **Overall Quality**
   - 5: Excellent answer, would use in production
   - 4: Good answer, minor improvements needed
   - 3: Acceptable answer, some issues
   - 2: Poor answer, significant issues
   - 1: Unacceptable answer

**Comments:** [Free text field]

### 8.2 Inter-Rater Reliability

```python
def calculate_inter_rater_agreement(ratings1: List[int], ratings2: List[int]) -> Dict[str, float]:
    """Calculate inter-rater reliability metrics.
    
    Returns:
        Dict with percent_agreement and correlation
    """
    # Percent agreement (within 1 point)
    agreements = sum(1 for r1, r2 in zip(ratings1, ratings2) if abs(r1 - r2) <= 1)
    percent_agreement = agreements / len(ratings1) if ratings1 else 0.0
    
    # Pearson correlation
    correlation = np.corrcoef(ratings1, ratings2)[0, 1] if len(ratings1) > 1 else 0.0
    
    return {
        "percent_agreement": percent_agreement,
        "correlation": correlation,
    }

# Example usage
rater1_scores = [5, 4, 3, 5, 4, 3, 2, 4, 5, 3]
rater2_scores = [5, 4, 4, 4, 5, 3, 2, 4, 4, 3]

agreement = calculate_inter_rater_agreement(rater1_scores, rater2_scores)
print(f"Percent Agreement: {agreement['percent_agreement']:.2%}")
print(f"Correlation: {agreement['correlation']:.3f}")

# Good inter-rater reliability:
# - Percent agreement (within 1 point): > 80%
# - Correlation: > 0.7
```

### 8.3 Sampling Strategy

For large-scale evaluation:

1. **Automated metrics for all queries** (100% coverage)
2. **Human evaluation for sample** (10-20% coverage)
   - Stratified sampling by category and difficulty
   - Focus on edge cases and low-confidence queries
3. **Multiple raters for subset** (5% coverage)
   - Measure inter-rater reliability
   - Resolve disagreements through discussion

```python
# Stratified sampling
import random

def stratified_sample(test_dataset: List[TestCase], sample_size: int) -> List[TestCase]:
    """Sample test cases, ensuring representation of all categories and difficulties."""
    # Group by (category, difficulty)
    groups = defaultdict(list)
    for tc in test_dataset:
        groups[(tc.category, tc.difficulty)].append(tc)
    
    # Sample proportionally from each group
    samples = []
    for group_cases in groups.values():
        n = max(1, int(len(group_cases) * sample_size / len(test_dataset)))
        samples.extend(random.sample(group_cases, min(n, len(group_cases))))
    
    return samples[:sample_size]
```

## Summary

In this notebook, we've built a comprehensive evaluation framework for RAG systems:

### Retrieval Metrics

1. **Precision@K**: Relevance of retrieved documents (0.0-1.0)
2. **Recall@K**: Coverage of relevant documents (0.0-1.0)
3. **MRR**: Rank of first relevant document (0.0-1.0)
4. **NDCG@K**: Position-weighted relevance (0.0-1.0)

### Generation Metrics

1. **Faithfulness**: Answer grounded in context (no hallucination)
2. **Relevance**: Answer addresses the query
3. **Completeness**: Answer is comprehensive
4. **Safety**: No harmful/incorrect advice

### Benchmarking Framework

- **Test Dataset**: 8 test cases across categories and difficulties
- **RAG Configurations**: Pluggable architecture for different approaches
- **Benchmark Runner**: Automated evaluation across configurations
- **Comparison Framework**: Side-by-side performance analysis

### Expected Results (Hypothetical)

| Approach | Best For | Retrieval Quality | Generation Quality | Latency |
|----------|----------|-------------------|-------------------|----------|
| Basic RAG | Baseline | Medium | Medium | Fast |
| Multi-Query | Broad coverage | High recall | Good | Moderate |
| RAG-Fusion | Balanced | High precision & recall | Good | Moderate |
| Decomposition | Complex queries | Medium | Excellent | Slow |
| Filtered | Precise needs | Excellent precision | Good | Fast |
| Reranking | Prioritization | Excellent ranking | Good | Moderate |
| RAPTOR | Hierarchical | Good | Good | Moderate |
| ColBERT | Technical/code | Excellent | Good | Slow |
| Hardened | Security critical | Good | Excellent safety | Moderate |

### LangSmith Integration

- **Automatic tracing** of all LangChain operations
- **Dataset management** for test cases
- **Experiment tracking** for comparing runs
- **Visualization** of traces and metrics

### Human Evaluation

- **Evaluation rubric** (1-5 scale)
- **Inter-rater reliability** measurement
- **Stratified sampling** for efficient review
- **Hybrid approach**: Automated + human evaluation

### Key Takeaways

1. **No single "best" RAG approach** - depends on use case
2. **Trade-offs exist**: quality vs latency, precision vs recall
3. **Combine approaches**: Use different techniques for different queries
4. **Measure what matters**: Align metrics with business goals
5. **Iterate based on data**: Use evaluation to guide improvements

### Next Steps

In **Part 11: Deployment & Demo**, we'll:
- Build a Streamlit web application
- Integrate the best RAG configurations
- Add interactive features (query interface, source display, confidence indicators)
- Implement production best practices
- Create a portfolio-ready demonstration