# Retrieval-Augmented Generation (RAG) Evaluations

This notebook demonstrates how to evaluate Retrieval-Augmented Generation (RAG) systems using Azure AI evaluation tools.

## What is RAG?

A RAG system generates the most relevant answer consistent with grounding documents in response to a user's query. At a high level:
1. User's query triggers a search retrieval in the corpus of grounding documents
2. Retrieved documents provide grounding context for the AI model
3. AI model generates a response based on the context

## RAG Evaluation Aspects

RAG evaluations address three critical aspects:

1. **Relevance of Retrieval Results to Query**
   - **Document Retrieval**: Use when you have ground truth labels (qrels) for accurate measurements
   - **Retrieval**: Use when you only have retrieved context without labels

2. **Consistency of Generated Response with Grounding Documents**
   - **Groundedness**: Customizable LLM-judge prompt for groundedness definition
   - **Groundedness Pro**: Straightforward definition powered by Azure AI Content Safety

3. **Relevance of Final Response to Query**
   - **Relevance**: Use when you don't have ground truth
   - **Response Completeness**: Use when you have ground truth and want to ensure no critical information is missed

## Key Concepts

- **Groundedness** = Precision aspect (shouldn't contain content outside grounding context)
- **Response Completeness** = Recall aspect (shouldn't miss critical information compared to ground truth)

## Table of Contents

1. [Environment Setup](#environment-setup)
2. [Model Configuration](#model-configuration)
3. [Retrieval Evaluators](#retrieval-evaluators)
   - 3.1: Retrieval Evaluator
   - 3.2: Document Retrieval Evaluator
4. [Groundedness Evaluators](#groundedness-evaluators)
   - 4.1: Groundedness Evaluator
   - 4.2: Groundedness Pro Evaluator
5. [Response Quality Evaluators](#response-quality-evaluators)
   - 5.1: Relevance Evaluator
   - 5.2: Response Completeness Evaluator
6. [Complete RAG Evaluation Example](#complete-rag-evaluation-example)
7. [Summary and Best Practices](#summary-and-best-practices)

## Environment Setup

Load environment variables and import necessary libraries.

In [None]:
import os
import shutil

new_path_entry = "/opt/homebrew/bin"  # Replace with the directory you want to add
current_path = os.environ.get('PATH', '')

if new_path_entry not in current_path.split(os.pathsep):
    os.environ['PATH'] = new_path_entry + os.pathsep + current_path
    print(f"Updated PATH for this session: {os.environ['PATH']}")
else:
    print(f"PATH already contains {new_path_entry}: {current_path}")

# You can then verify with shutil.which again
print(f"Location of 'az' found by kernel now: {shutil.which('az')}")

In [None]:
import sys
from pathlib import Path
from dotenv import load_dotenv

# Add parent directory to path
parent_dir = Path(__file__).parent.parent if hasattr(__builtins__, '__file__') else Path.cwd().parent
sys.path.insert(0, str(parent_dir / "utils"))

# Load environment variables
agent_ops_dir = Path.cwd().parent if Path.cwd().name == "05_evaluation" else Path.cwd()
env_path = agent_ops_dir / ".env"
load_dotenv(env_path)

print("✅ Environment loaded successfully")

## Model Configuration

Configure the LLM-judge model for AI-assisted evaluators. All RAG evaluators except Groundedness Pro use this configuration.

### Supported Models

| Evaluators | Reasoning Models (o-series) | Non-reasoning Models (GPT-4/4o) | Enable Reasoning |
|------------|----------------------------|--------------------------------|------------------|
| Retrieval, Groundedness, Relevance, Response Completeness | ✅ Supported | ✅ Supported | `is_reasoning_model=True` |
| Groundedness Pro | ❌ Not Supported | ✅ Supported | N/A (uses Azure AI Content Safety) |

**Recommendation**: For complex evaluations, use reasoning models like `gpt-4.1-mini` for balanced performance and cost.

In [None]:
import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from pprint import pprint

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT_GPT_4o"],
    api_key=os.environ["AZURE_OPENAI_API_KEY_GPT_4o"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION_GPT_4o"],
    azure_deployment=os.environ["AZURE_OPENAI_MODEl_GPT_4o"],
)

print("✅ Model configuration created successfully")
print(f"   Endpoint: {model_config['azure_endpoint']}")
print(f"   Deployment: {model_config['azure_deployment']}")

## Part 3: Retrieval Evaluators

Retrieval quality is upstream in RAG and critical to final response quality. Poor retrieval results in poor final responses.

### When to Use Which Evaluator?

- **Retrieval**: Textual quality measurement without ground truth (LLM-based)
- **Document Retrieval**: Classical IR metrics (NDCG, XDCG, Fidelity) with ground truth labels

### 3.1: Retrieval Evaluator

**Purpose**: Measures textual quality of retrieval results using LLM without requiring ground truth.

**Key Features**:
- No ground truth required (unlike Document Retrieval)
- Evaluates relevance of context chunks to query
- Assesses if most relevant chunks are at the top
- Context chunks encoded as strings

**Output**: Likert scale score (1-5, higher is better)
- Score >= threshold → pass
- Score < threshold → fail

**Use Case**: Quick quality check of retrieval without needing labeled ground truth.

In [None]:
from azure.ai.evaluation import RetrievalEvaluator
import json

retrieval = RetrievalEvaluator(model_config=model_config, threshold=3)

result = retrieval(
    query="Where was Marie Curie born?",
    context="Background: 1. Marie Curie was born in Warsaw. 2. Marie Curie was born on November 7, 1867. 3. Marie Curie is a French scientist. ",
)

print("=" * 80)
print("RETRIEVAL EVALUATION RESULT")
print("=" * 80)
print(json.dumps(result, indent=2))
print("\n" + "=" * 80)
print("INTERPRETATION")
print("=" * 80)
print(f"Score: {result.get('retrieval', 'N/A')}/5")
print(f"Result: {result.get('retrieval_result', 'N/A')}")
print(f"\nReason: {result.get('retrieval_reason', 'N/A')}")

### 3.2: Document Retrieval Evaluator

**Purpose**: Measures retrieval quality using classical IR metrics with ground truth labels (qrels).

**Computed Metrics**:

| Metric | Category | Description |
|--------|----------|-------------|
| **Fidelity** | Search Fidelity | Good documents returned / Total known good documents |
| **NDCG** | Search NDCG | Quality of rankings vs ideal order |
| **XDCG** | Search XDCG | Quality in top-k documents regardless of other scores |
| **Max Relevance N** | Search Max Relevance | Maximum relevance in top-k chunks |
| **Holes** | Search Label Sanity | Missing query relevance judgments |

**Use Case**: Parameter sweep optimization - test various search parameters (algorithms, top_k, chunk sizes) to find optimal RAG configuration.

**Requirements**:
- Ground truth labels (query relevance judgments)
- Label score min/max range
- Retrieved documents with relevance scores

In [None]:
from azure.ai.evaluation import DocumentRetrievalEvaluator

# Ground truth: Query relevance labels from human or LLM judges
retrieval_ground_truth = [
    {"document_id": "1", "query_relevance_label": 4},
    {"document_id": "2", "query_relevance_label": 2},
    {"document_id": "3", "query_relevance_label": 3},
    {"document_id": "4", "query_relevance_label": 1},
    {"document_id": "5", "query_relevance_label": 0},
]

# Label score range
ground_truth_label_min = 0
ground_truth_label_max = 4

# Retrieved documents from search system
retrieved_documents = [
    {"document_id": "2", "relevance_score": 45.1},
    {"document_id": "6", "relevance_score": 35.8},
    {"document_id": "3", "relevance_score": 29.2},
    {"document_id": "5", "relevance_score": 25.4},
    {"document_id": "7", "relevance_score": 18.8},
]

document_retrieval_evaluator = DocumentRetrievalEvaluator(
    ground_truth_label_min=ground_truth_label_min,
    ground_truth_label_max=ground_truth_label_max,
    # Optional: Override thresholds for pass/fail
    ndcg_threshold=0.5,
    xdcg_threshold=50.0,
    fidelity_threshold=0.5,
    top1_relevance_threshold=50.0,
    top3_max_relevance_threshold=50.0,
)

result = document_retrieval_evaluator(
    retrieval_ground_truth=retrieval_ground_truth,
    retrieved_documents=retrieved_documents
)

print("=" * 80)
print("DOCUMENT RETRIEVAL EVALUATION RESULT")
print("=" * 80)
print(json.dumps(result, indent=2, default=str))

print("\n" + "=" * 80)
print("KEY METRICS SUMMARY")
print("=" * 80)
print(
    f"NDCG@3: {result.get('ndcg@3', 'N/A'):.4f} ({result.get('ndcg@3_result', 'N/A')})")
print(
    f"XDCG@3: {result.get('xdcg@3', 'N/A'):.4f} ({result.get('xdcg@3_result', 'N/A')})")
print(
    f"Fidelity: {result.get('fidelity', 'N/A'):.4f} ({result.get('fidelity_result', 'N/A')})")
print(
    f"Top-1 Relevance: {result.get('top1_relevance', 'N/A')} ({result.get('top1_relevance_result', 'N/A')})")
print(
    f"Top-3 Max Relevance: {result.get('top3_max_relevance', 'N/A')} ({result.get('top3_max_relevance_result', 'N/A')})")
print(f"Holes: {result.get('holes', 'N/A')} (lower is better)")

## Part 4: Groundedness Evaluators

Groundedness evaluates how well the generated response aligns with grounding context, ensuring the model doesn't fabricate content.

**Two Options**:
- **Groundedness**: Customizable LLM-judge with open-source prompt
- **Groundedness Pro**: Azure AI Content Safety-powered, straightforward definition

### 4.1: Groundedness Evaluator

**Purpose**: Measures how well the generated response aligns with given context (grounding source).

**Key Features**:
- Customizable LLM-judge prompt
- Captures **precision** aspect (doesn't fabricate beyond context)
- Complementary to Response Completeness (recall aspect)

**Output**: Likert scale score (1-5, higher is better)
- Lower score = irrelevant to query or fabricated content
- Higher score = well-grounded in context

**Use Case**: Ensure AI doesn't hallucinate or add information not present in grounding documents.

In [None]:
from azure.ai.evaluation import GroundednessEvaluator

groundedness = GroundednessEvaluator(model_config=model_config, threshold=3)

result = groundedness(
    query="Is Marie Curie born in Paris?",
    context="Background: 1. Marie Curie is born on November 7, 1867. 2. Marie Curie is born in Warsaw.",
    response="No, Marie Curie is born in Warsaw."
)

print("=" * 80)
print("GROUNDEDNESS EVALUATION RESULT")
print("=" * 80)
print(json.dumps(result, indent=2))
print("\n" + "=" * 80)
print("INTERPRETATION")
print("=" * 80)
print(f"Score: {result.get('groundedness', 'N/A')}/5")
print(f"Result: {result.get('groundedness_result', 'N/A')}")
print(f"\nReason: {result.get('groundedness_reason', 'N/A')}")

### 4.2: Groundedness Pro Evaluator

**Purpose**: Detects if generated text is consistent with given context, powered by Azure AI Content Safety.

**Key Features**:
- Binary label output (True/False)
- Straightforward definition
- Avoids speculation or fabrication
- Enterprise-grade safety checks

**Output**: Boolean score
- `True` = All content grounded in context
- `False` = Contains ungrounded content

**Requirements**: Azure AI Project credentials (not just OpenAI model config)

**Use Case**: Production RAG systems requiring strict groundedness validation.

In [None]:
from azure.ai.evaluation import GroundednessProEvaluator
from azure.identity import DefaultAzureCredential

# Use Azure AI Project endpoint
azure_ai_project = os.environ["AZURE_AI_PROJECT_ENDPOINT"]

groundedness_pro = GroundednessProEvaluator(
    azure_ai_project=azure_ai_project, 
    credential=DefaultAzureCredential()
)

result = groundedness_pro(
    query="Is Marie Curie born in Paris?", 
    context="Background: 1. Marie Curie is born on November 7, 1867. 2. Marie Curie is born in Warsaw.",
    response="No, Marie Curie is born in Warsaw."
)

print("=" * 80)
print("GROUNDEDNESS PRO EVALUATION RESULT")
print("=" * 80)
print(json.dumps(result, indent=2))
print("\n" + "=" * 80)
print("INTERPRETATION")
print("=" * 80)
print(f"Label: {result.get('groundedness_pro_label', 'N/A')}")
print(f"Reason: {result.get('groundedness_pro_reason', 'N/A')}")

## Part 5: Response Quality Evaluators

Evaluate the final response quality in relation to the query and expected output.

### 5.1: Relevance Evaluator

**Purpose**: Measures how effectively a response addresses a query (without ground truth).

**Key Features**:
- Assesses accuracy, completeness, and direct relevance
- No ground truth required
- Evaluates final response quality

**Output**: Likert scale score (1-5, higher is better)
- Higher score = better relevance to query

**Use Case**: Ensure AI generates relevant responses even when you don't have expected answers.

In [None]:
from azure.ai.evaluation import RelevanceEvaluator

relevance = RelevanceEvaluator(model_config=model_config, threshold=3)

result = relevance(
    query="Is Marie Curie born in Paris?",
    response="No, Marie Curie is born in Warsaw."
)

print("=" * 80)
print("RELEVANCE EVALUATION RESULT")
print("=" * 80)
print(json.dumps(result, indent=2))
print("\n" + "=" * 80)
print("INTERPRETATION")
print("=" * 80)
print(f"Score: {result.get('relevance', 'N/A')}/5")
print(f"Result: {result.get('relevance_result', 'N/A')}")
print(f"\nReason: {result.get('relevance_reason', 'N/A')}")

### 5.2: Response Completeness Evaluator

**Purpose**: Measures if response captures all critical information from ground truth (recall aspect).

**Key Features**:
- Requires ground truth expected response
- Captures **recall** aspect (completeness)
- Complementary to Groundedness (precision aspect)
- Detects missing critical information

**Output**: Likert scale score (1-5, higher is better)
- Lower score = missing critical information
- Higher score = complete coverage of expected content

**Use Case**: Ensure AI responses don't miss important information when you have ground truth.

In [None]:
from azure.ai.evaluation import ResponseCompletenessEvaluator

response_completeness = ResponseCompletenessEvaluator(
    model_config=model_config, threshold=3)

result = response_completeness(
    response="Based on the retrieved documents, the shareholder meeting discussed the operational efficiency of the company and financing options.",
    ground_truth="The shareholder meeting discussed the compensation package of the company CEO."
)

print("=" * 80)
print("RESPONSE COMPLETENESS EVALUATION RESULT")
print("=" * 80)
print(json.dumps(result, indent=2))
print("\n" + "=" * 80)
print("INTERPRETATION")
print("=" * 80)
print(f"Score: {result.get('response_completeness', 'N/A')}/5")
print(f"Result: {result.get('response_completeness_result', 'N/A')}")
print(f"\nReason: {result.get('response_completeness_reason', 'N/A')}")

## Part 6: Complete RAG Evaluation Example

Demonstrate a comprehensive RAG evaluation combining multiple evaluators.

In [None]:
# Sample RAG scenario
query = "What are the main causes of climate change?"

context = """Climate change is primarily caused by human activities that increase greenhouse gas emissions. 
The main causes include: 1) Burning fossil fuels for energy and transportation, which releases carbon dioxide. 
2) Deforestation, which reduces carbon absorption by trees. 3) Industrial processes and agriculture, 
which emit methane and other greenhouse gases. 4) Changes in land use that affect carbon storage."""

response = """The main causes of climate change are burning fossil fuels, deforestation, and industrial activities. 
These human activities increase greenhouse gas emissions like carbon dioxide and methane in the atmosphere."""

ground_truth = """Climate change is mainly caused by burning fossil fuels, deforestation, industrial processes, 
and agriculture, which all increase greenhouse gas emissions."""

print("=" * 80)
print("COMPREHENSIVE RAG EVALUATION")
print("=" * 80)
print(f"\nQuery: {query}")
print(f"\nContext (first 100 chars): {context[:100]}...")
print(f"\nResponse: {response}")
print(f"\nGround Truth: {ground_truth}")

# Evaluate retrieval quality
print("\n" + "=" * 80)
print("1. RETRIEVAL QUALITY")
print("=" * 80)
retrieval_result = retrieval(query=query, context=context)
print(f"Retrieval Score: {retrieval_result.get('retrieval', 'N/A')}/5 ({retrieval_result.get('retrieval_result', 'N/A')})")
print(f"Reason: {retrieval_result.get('retrieval_reason', 'N/A')[:150]}...")

# Evaluate groundedness
print("\n" + "=" * 80)
print("2. GROUNDEDNESS (Precision)")
print("=" * 80)
groundedness_result = groundedness(query=query, context=context, response=response)
print(f"Groundedness Score: {groundedness_result.get('groundedness', 'N/A')}/5 ({groundedness_result.get('groundedness_result', 'N/A')})")
print(f"Reason: {groundedness_result.get('groundedness_reason', 'N/A')[:150]}...")

# Evaluate relevance
print("\n" + "=" * 80)
print("3. RELEVANCE")
print("=" * 80)
relevance_result = relevance(query=query, response=response)
print(f"Relevance Score: {relevance_result.get('relevance', 'N/A')}/5 ({relevance_result.get('relevance_result', 'N/A')})")
print(f"Reason: {relevance_result.get('relevance_reason', 'N/A')[:150]}...")

# Evaluate response completeness
print("\n" + "=" * 80)
print("4. RESPONSE COMPLETENESS (Recall)")
print("=" * 80)
completeness_result = response_completeness(response=response, ground_truth=ground_truth)
print(f"Completeness Score: {completeness_result.get('response_completeness', 'N/A')}/5 ({completeness_result.get('response_completeness_result', 'N/A')})")
print(f"Reason: {completeness_result.get('response_completeness_reason', 'N/A')[:150]}...")

# Overall assessment
print("\n" + "=" * 80)
print("OVERALL RAG QUALITY ASSESSMENT")
print("=" * 80)
all_passed = all([
    retrieval_result.get('retrieval_result') == 'pass',
    groundedness_result.get('groundedness_result') == 'pass',
    relevance_result.get('relevance_result') == 'pass',
    completeness_result.get('response_completeness_result') == 'pass'
])

if all_passed:
    print("✅ All evaluations PASSED - RAG system performing well!")
else:
    print("⚠️  Some evaluations FAILED - Review individual metrics for improvements")
    
print("\nMetric Summary:")
print(f"  Retrieval: {retrieval_result.get('retrieval_result', 'N/A')}")
print(f"  Groundedness: {groundedness_result.get('groundedness_result', 'N/A')}")
print(f"  Relevance: {relevance_result.get('relevance_result', 'N/A')}")
print(f"  Completeness: {completeness_result.get('response_completeness_result', 'N/A')}")

## Summary and Best Practices

### Evaluation Strategy Decision Tree

```
┌─────────────────────────────────────────┐
│ Evaluating RAG System                   │
└────────────┬────────────────────────────┘
             │
             ├─► Retrieval Quality?
             │   ├─► Have ground truth labels? → Document Retrieval
             │   └─► No labels? → Retrieval
             │
             ├─► Response Consistency?
             │   ├─► Need customization? → Groundedness
             │   └─► Need enterprise safety? → Groundedness Pro
             │
             └─► Response Quality?
                 ├─► Have ground truth? → Response Completeness
                 └─► No ground truth? → Relevance
```

### Key Takeaways

1. **Comprehensive Evaluation**: Use multiple evaluators together for complete RAG assessment
   - Retrieval quality (upstream)
   - Groundedness (precision)
   - Response completeness (recall)
   - Relevance (final quality)

2. **Ground Truth Trade-offs**:
   - **With Ground Truth**: Document Retrieval, Response Completeness (more accurate)
   - **Without Ground Truth**: Retrieval, Relevance (more flexible)

3. **Precision vs Recall**:
   - **Groundedness** = Precision (doesn't add false info)
   - **Response Completeness** = Recall (doesn't miss critical info)

4. **Model Selection**:
   - Use reasoning models (o-series, GPT-4.1-mini) for complex evaluations
   - Set `is_reasoning_model=True` when using reasoning models
   - Balance performance and cost

5. **Threshold Tuning**:
   - Default threshold = 3 (fair)
   - Adjust based on quality requirements
   - Higher threshold = stricter quality standards

6. **Parameter Sweep for Optimization**:
   - Use Document Retrieval metrics (NDCG, XDCG, Fidelity)
   - Test different: search algorithms, top_k values, chunk sizes
   - Find optimal configuration for your use case

### Common Patterns

| Scenario | Recommended Evaluators |
|----------|------------------------|
| Production RAG without ground truth | Retrieval + Groundedness Pro + Relevance |
| RAG development with ground truth | Document Retrieval + Groundedness + Response Completeness |
| Parameter sweep optimization | Document Retrieval (multiple configurations) |
| Quick quality check | Retrieval + Groundedness + Relevance |

### Additional Resources

- [Azure AI Evaluation Documentation](https://learn.microsoft.com/azure/ai-studio/how-to/evaluate-sdk)
- [RAG Evaluation Best Practices](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-rag)
- [Azure AI Foundry Studio](https://ai.azure.com)