# Local Agent Evaluation with Azure AI Evaluation SDK

## Introduction

This notebook demonstrates how to perform **local evaluation** of Azure AI agents using the Azure AI Evaluation SDK. Local evaluation runs entirely on your development machine, providing fast feedback loops for iterative agent development.

### What is Local Agent Evaluation?

Local evaluation allows you to:
- **Test agent responses** against quality metrics (Relevance, Coherence, Fluency)
- **Run evaluations locally** without uploading data to the cloud
- **Iterate quickly** with immediate feedback on prompt/configuration changes
- **Analyze results** programmatically with full control over the evaluation pipeline
- **Save evaluation history** for tracking improvements over time

### Evaluation Workflow

```
1. Define Test Prompts
   â†“
2. Collect Agent Responses (via Azure AI Agent API)
   â†“
3. Initialize Evaluators (Relevance, Coherence, Fluency)
   â†“
4. Run Evaluations Locally
   â†“
5. Analyze Results & Calculate Metrics
   â†“
6. Save Results for Historical Tracking
   â†“
7. Iterate on Agent Configuration
```

### Quality Metrics Explained

**Relevance Evaluator:**
- **Purpose**: Measures whether the response appropriately addresses the query
- **Input**: Query + Response
- **Output**: Score 1-5 (1=irrelevant, 5=highly relevant)
- **Use Case**: Ensure agent stays on topic and answers the question

**Coherence Evaluator:**
- **Purpose**: Measures logical flow and consistency of the response
- **Input**: Query + Response
- **Output**: Score 1-5 (1=incoherent, 5=highly coherent)
- **Use Case**: Detect rambling, contradictions, or disjointed reasoning

**Fluency Evaluator:**
- **Purpose**: Measures language quality, grammar, and readability
- **Input**: Query + Response
- **Output**: Score 1-5 (1=poor language, 5=excellent fluency)
- **Use Case**: Ensure professional, well-written responses

### When to Use Local vs. Cloud Evaluation

**Use Local Evaluation (This Notebook) When:**
- Rapid prototyping and iterative development
- Small test sets (< 50 queries)
- Quick feedback loops during development
- Debugging specific agent behaviors
- Working offline or with sensitive data

**Use Cloud Evaluation (`05_cloud_based_evaluation.ipynb`) When:**
- Large datasets (100+ samples)
- Team collaboration and shared results
- Production quality gates in CI/CD pipelines
- Centralized governance and audit trails
- Historical comparison across multiple runs

### Prerequisites

- Azure AI Project with deployed agent
- Azure OpenAI deployment (GPT-4/GPT-4o for evaluation)
- Azure credentials configured (`az login`)
- Environment variables set in `.env` file

## Table of Contents

1. [Part 1: Environment Setup](#part-1-environment-setup)
   - 1.1: Install Dependencies
   - 1.2: Configure Environment
2. [Part 2: Define Test Prompts](#part-2-define-test-prompts)
3. [Part 3: Collect Agent Responses](#part-3-collect-agent-responses)
4. [Part 4: Initialize Evaluators](#part-4-initialize-evaluators)
5. [Part 5: Run Evaluations](#part-5-run-evaluations)
6. [Part 6: Analyze Results](#part-6-analyze-results)
7. [Part 7: Save Evaluation History](#part-7-save-evaluation-history)
8. [Summary and Best Practices](#summary-and-best-practices)

---

## Part 1: Environment Setup

### 1.1: Install Dependencies

Install the Azure AI Evaluation SDK for local evaluation.

In [None]:
%pip install azure-ai-evaluation==1.13.5 -qU

In [None]:
import os
import shutil

new_path_entry = "/opt/homebrew/bin"  # Replace with the directory you want to add
current_path = os.environ.get('PATH', '')

if new_path_entry not in current_path.split(os.pathsep):
    os.environ['PATH'] = new_path_entry + os.pathsep + current_path
    print(f"Updated PATH for this session: {os.environ['PATH']}")
else:
    print(f"PATH already contains {new_path_entry}: {current_path}")

# You can then verify with shutil.which again
print(f"Location of 'az' found by kernel now: {shutil.which('az')}")

### 1.2: Configure Environment

Set up Azure AI Project client and load configuration from environment variables.

**Required Environment Variables:**
- `AZURE_AI_PROJECT_ENDPOINT`: Your Azure AI Foundry project endpoint
- `AZURE_OPENAI_API_KEY_GPT_4o`: API key for the evaluation model
- `AZURE_OPENAI_ENDPOINT_GPT_4o`: Azure OpenAI endpoint
- `AZURE_OPENAI_MODEl_GPT_4o`: Deployment name (e.g., gpt-4o-mini)
- `TARGET_AGENT_ID`: The agent ID to evaluate (optional, can be set in code)

**Authentication:**
- Use `az login` for DefaultAzureCredential
- Ensure you have access to the Azure AI Project and OpenAI resources

In [None]:
import json
import logging
import os
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Dict, List

from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential

from azure.ai.projects import AIProjectClient
from azure.ai.evaluation import RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator

# Add parent directory to path for agent_utils import
parent_dir = Path(__file__).parent.parent if hasattr(__builtins__, '__file__') else Path.cwd().parent
sys.path.insert(0, str(parent_dir / "utils"))

from agent_utils import AgentManager

# Load environment variables from parent directory
agent_ops_dir = Path.cwd().parent if Path.cwd().name == "05_evaluation" else Path.cwd()
env_path = agent_ops_dir / ".env"
load_dotenv(env_path)

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
logger = logging.getLogger("agent_eval")

# Suppress verbose Azure SDK and HTTP logging
logging.getLogger("azure.core.pipeline.policies.http_logging_policy").setLevel(logging.WARNING)
logging.getLogger("azure.identity").setLevel(logging.WARNING)
logging.getLogger("azure.cosmos._cosmos_http_logging_policy").setLevel(logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING) 
logging.getLogger("openai").setLevel(logging.WARNING)  

# Initialize Azure AI Project Client with endpoint
endpoint = os.getenv("AZURE_AI_PROJECT_ENDPOINT")
if not endpoint:
    raise ValueError("Set AZURE_AI_PROJECT_ENDPOINT in .env file")

credential = DefaultAzureCredential()
project_client = AIProjectClient(endpoint=endpoint, credential=credential)
agent_manager = AgentManager(project_client)
logger.info("âœ… Connected to Azure AI project")

# Get Azure OpenAI configuration from .env
model_api_key = os.getenv("AZURE_OPENAI_API_KEY_GPT_4o")
if not model_api_key:
    raise ValueError("Set AZURE_OPENAI_API_KEY_GPT_4o in .env file")

model_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT_GPT_4o")
if not model_endpoint:
    raise ValueError("Set AZURE_OPENAI_ENDPOINT_GPT_4o in .env file")

deployment_name = os.getenv("AZURE_OPENAI_MODEl_GPT_4o", "gpt-4o")

logger.info(f"âœ… Evaluation will use deployment '{deployment_name}' at endpoint '{model_endpoint}'")

---

## Part 2: Define Test Prompts

Create a set of test prompts that represent typical user queries for your agent.

**Best Practices:**
- Include diverse question types (factual, explanatory, comparative)
- Cover core agent capabilities
- Add edge cases and challenging queries
- Start with 3-10 prompts for quick iteration
- Expand to 20-50 prompts for comprehensive evaluation

---

## Part 3: Collect Agent Responses

Run the target agent on each test prompt and collect responses for evaluation.

**Process:**
1. Create a new thread for each query (isolated context)
2. Send prompt to the agent
3. Capture the response text
4. Store query-response pairs for evaluation
5. Clean up threads after completion

In [None]:
TARGET_AGENT_ID = 'asst_3pPWPYFexU3fEwbYB3VDWO1N'

PROMPTS: List[str] = [
    "Summarize the Azure AI Foundry service in two sentences.",
    "List three responsible AI considerations when deploying an agent to production.",
    "Explain how prompt caching can improve latency for frequently repeated questions.",
]

evaluation_rows: List[Dict[str, Any]] = []

for prompt in PROMPTS:
    thread = agent_manager.create_thread()
    try:
        response_text = agent_manager.run_agent_simple(
            thread_id=thread.id,
            agent_id=TARGET_AGENT_ID,
            user_message=prompt,
            verbose=False,
        )
        if not response_text:
            logger.warning("Agent returned an empty response for prompt: %s", prompt)
            continue
        evaluation_rows.append({"query": prompt, "response": response_text})
        logger.info("Captured response for prompt: %s", prompt)
    except Exception as exc:
        logger.exception("Unable to capture response for prompt '%s': %s", prompt, exc)
    finally:
        agent_manager.delete_thread(thread.id, silent=True)

if not evaluation_rows:
    raise RuntimeError("No agent responses captured; aborting evaluation.")

logger.info("Collected %d agent responses for evaluation.", len(evaluation_rows))

In [None]:
from IPython.display import Markdown, display

# Display as Markdown table with full text (no truncation)
markdown_output = "# Collected Agent Responses\n\n"
markdown_output += "| # | Query | Response |\n"
markdown_output += "|---|-------|----------|\n"

for i, row in enumerate(evaluation_rows, 1):
    query = row['query'].replace('|', '\\|').replace('\n', '<br>')
    response = row['response'].replace('|', '\\|').replace('\n', '<br>')
    markdown_output += f"| {i} | {query} | {response} |\n"

display(Markdown(markdown_output))

### 3.1: Save Dataset to JSONL

Persist the collected responses to JSONL format for reproducibility and historical tracking.

**JSONL Format:**
- One JSON object per line
- Each object contains `query` and `response` fields
- Timestamped filename for versioning

In [None]:
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S")
# Use absolute path relative to notebook location
notebook_dir = Path.cwd() if Path.cwd().name == "05_evaluation" else Path.cwd() / "05_evaluation"
dataset_dir = notebook_dir / "data"
dataset_dir.mkdir(parents=True, exist_ok=True)
dataset_path = dataset_dir / f"eval_{timestamp}.jsonl"

with dataset_path.open("w", encoding="utf-8") as handle:
    for row in evaluation_rows:
        handle.write(json.dumps(row, ensure_ascii=True) + "\n")

logger.info(f"âœ… Wrote evaluation dataset to {dataset_path}")

---

## Part 4: Initialize Evaluators

Configure the quality evaluators with the Azure OpenAI model for LLM-judged metrics.

**Evaluator Configuration:**
- All three evaluators use the same model configuration
- GPT-4 or GPT-4o recommended for best evaluation quality
- API key and endpoint must match your Azure OpenAI deployment

In [None]:
import pandas as pd
from azure.ai.evaluation import evaluate

# Configure model for evaluators
model_config = {
    "azure_endpoint": model_endpoint,
    "api_key": model_api_key,
    "azure_deployment": deployment_name,
    "api_version": "2024-08-01-preview"
}

# Initialize evaluators
relevance_eval = RelevanceEvaluator(model_config)
coherence_eval = CoherenceEvaluator(model_config)
fluency_eval = FluencyEvaluator(model_config)

logger.info("âœ… Evaluators initialized")

---

## Part 5: Run Evaluations

Execute all three evaluators on each collected response.

**Evaluation Process:**
- Each evaluator receives the query and response
- LLM judges the quality based on specific criteria
- Returns a score (typically 1-5 scale)
- Results include both score and reasoning (if available)

**Error Handling:**
- Wrap evaluator calls in try-except for robustness
- Log evaluation progress for debugging
- Continue evaluation even if individual samples fail

In [None]:
# Run evaluation on each response
results = []

for row in evaluation_rows:
    query = row["query"]
    response = row["response"]
    
    logger.info(f"Evaluating: {query[:50]}...")
    
    try:
        # Run evaluators
        relevance_score = relevance_eval(query=query, response=response)
        coherence_score = coherence_eval(query=query, response=response)
        fluency_score = fluency_eval(query=query, response=response)
        
        results.append({
            "query": query,
            "response": response,
            "relevance": relevance_score.get("relevance", relevance_score),
            "coherence": coherence_score.get("coherence", coherence_score),
            "fluency": fluency_score.get("fluency", fluency_score)
        })
    except Exception as e:
        logger.error(f"Error evaluating query '{query[:50]}...': {str(e)}")
        results.append({
            "query": query,
            "response": response,
            "relevance": None,
            "coherence": None,
            "fluency": None,
            "error": str(e)
        })
    
logger.info(f"âœ… Evaluation completed for {len(results)} responses")

In [None]:
# Display results
df = pd.DataFrame(results)
display(df)

# Calculate averages (exclude None values from errors)
avg_relevance = df["relevance"].dropna().mean()
avg_coherence = df["coherence"].dropna().mean()
avg_fluency = df["fluency"].dropna().mean()

logger.info(f"\nðŸ“Š Average Scores:")
logger.info(f"  Relevance: {avg_relevance:.2f}")
logger.info(f"  Coherence: {avg_coherence:.2f}")
logger.info(f"  Fluency: {avg_fluency:.2f}")

---

## Part 6: Analyze Results

Review evaluation metrics and identify areas for improvement.

**Analysis Checklist:**
- âœ… Review aggregate metrics (mean, min, max)
- âœ… Identify low-scoring responses (< 3.0)
- âœ… Check for consistency across metrics
- âœ… Spot patterns in failing queries
- âœ… Compare with quality thresholds

In [None]:
# Detailed analysis
print("="*60)
print("EVALUATION ANALYSIS")
print("="*60)

# Statistics by metric
for metric in ["relevance", "coherence", "fluency"]:
    scores = df[metric].dropna()
    print(f"\n{metric.upper()} Statistics:")
    print(f"  Mean: {scores.mean():.2f}")
    print(f"  Median: {scores.median():.2f}")
    print(f"  Min: {scores.min():.2f}")
    print(f"  Max: {scores.max():.2f}")
    print(f"  Std Dev: {scores.std():.2f}")

# Identify low-scoring responses
print("\n" + "="*60)
print("LOW-SCORING RESPONSES (< 3.0)")
print("="*60)

low_threshold = 3.0
low_scores = df[(df["relevance"] < low_threshold) | 
                 (df["coherence"] < low_threshold) | 
                 (df["fluency"] < low_threshold)]

if len(low_scores) > 0:
    for idx, row in low_scores.iterrows():
        print(f"\nQuery: {row['query']}")
        print(f"  Relevance: {row['relevance']:.2f}")
        print(f"  Coherence: {row['coherence']:.2f}")
        print(f"  Fluency: {row['fluency']:.2f}")
else:
    print("âœ… All responses meet the quality threshold!")

# Quality assessment
print("\n" + "="*60)
print("QUALITY ASSESSMENT")
print("="*60)

def assess_quality(score):
    if score >= 4.5:
        return "Excellent"
    elif score >= 4.0:
        return "Good"
    elif score >= 3.5:
        return "Acceptable"
    else:
        return "Needs Improvement"

print(f"Overall Relevance: {assess_quality(avg_relevance)} ({avg_relevance:.2f})")
print(f"Overall Coherence: {assess_quality(avg_coherence)} ({avg_coherence:.2f})")
print(f"Overall Fluency: {assess_quality(avg_fluency)} ({avg_fluency:.2f})")

---

## Part 7: Save Evaluation History

Persist evaluation results for historical tracking and comparison.

**Saved Data:**
- Timestamp and agent metadata
- Individual query-response-scores
- Aggregate metrics
- Model configuration used for evaluation

In [None]:
# Save results to JSON
results_path = dataset_dir / f"eval_results_{timestamp}.json"
with results_path.open("w", encoding="utf-8") as f:
    json.dump({
        "timestamp": timestamp,
        "agent_id": TARGET_AGENT_ID,
        "model": deployment_name,
        "results": results,
        "averages": {
            "relevance": float(avg_relevance),
            "coherence": float(avg_coherence),
            "fluency": float(avg_fluency)
        }
    }, f, indent=2)

logger.info(f"âœ… Saved results to {results_path}")

---

## Summary and Best Practices

### Key Takeaways

1. **Local Evaluation Benefits**: Fast feedback loops, no cloud upload, full control
2. **Quality Metrics**: Relevance, Coherence, Fluency provide comprehensive assessment
3. **Iterative Development**: Run evaluations frequently during development
4. **Historical Tracking**: Save results with timestamps for trend analysis
5. **Quick Debugging**: Identify problematic queries and iterate on prompts

### Best Practices

#### 1. Test Prompt Design
- âœ… Start with 3-10 representative queries
- âœ… Include diverse question types (factual, explanatory, comparative)
- âœ… Add edge cases and challenging scenarios
- âœ… Cover all core agent capabilities
- âœ… Expand to 20-50 prompts for comprehensive coverage

#### 2. Evaluation Frequency
- âœ… Run after every significant prompt change
- âœ… Evaluate before deploying to staging/production
- âœ… Create baseline evaluations for comparison
- âœ… Track metrics over time to detect regressions
- âœ… Automate with CI/CD for continuous monitoring

#### 3. Model Configuration for Evaluation
- âœ… Use GPT-4 or GPT-4o for best evaluation quality
- âœ… Ensure sufficient API quota for evaluation workload
- âœ… Use consistent model version across runs
- âœ… Test evaluators on known good/bad responses first

#### 4. Results Analysis
- âœ… Review individual scores, not just averages
- âœ… Investigate low-scoring responses (< 3.0)
- âœ… Look for patterns in failing queries
- âœ… Compare metrics across different agent versions
- âœ… Set quality thresholds based on use case requirements

#### 5. Error Handling
- âœ… Wrap evaluator calls in try-except blocks
- âœ… Log evaluation progress for debugging
- âœ… Continue evaluation even if individual samples fail
- âœ… Store error information for troubleshooting
- âœ… Monitor API rate limits and quota usage

### Quality Thresholds (Recommended)

| Metric | Excellent | Good | Acceptable | Needs Improvement |
|--------|-----------|------|------------|-------------------|
| Relevance | 4.5-5.0 | 4.0-4.4 | 3.5-3.9 | < 3.5 |
| Coherence | 4.5-5.0 | 4.0-4.4 | 3.5-3.9 | < 3.5 |
| Fluency | 4.5-5.0 | 4.0-4.4 | 3.5-3.9 | < 3.5 |

### Common Issues and Solutions

| Issue | Possible Cause | Solution |
|-------|---------------|----------|
| Low Relevance | Agent off-topic or misunderstands query | Improve system prompt clarity |
| Low Coherence | Rambling or contradictory responses | Add output structure guidelines |
| Low Fluency | Grammar or formatting issues | Review prompt examples, adjust temperature |
| All scores low | Evaluation model misconfigured | Verify model_config parameters |
| Evaluation errors | API rate limits or quota | Add retry logic, check quota |

### Expanding Your Evaluation

**Add More Evaluators:**
```python
from azure.ai.evaluation import (
    GroundednessEvaluator,  # For RAG scenarios
    F1ScoreEvaluator,        # For exact match comparison
    SimilarityEvaluator      # For semantic similarity
)
```

**Add Context for RAG Evaluation:**
```python
# Include retrieved context in your evaluation rows
evaluation_rows.append({
    "query": prompt,
    "context": retrieved_context,  # Add retrieved documents
    "response": response_text
})

# Use GroundednessEvaluator
groundedness_eval = GroundednessEvaluator(model_config)
score = groundedness_eval(context=context, response=response)
```

**Compare Agent Versions:**
```python
# Evaluate multiple agent versions with same prompts
agents_to_compare = [
    {"id": "agent_v1", "name": "Baseline"},
    {"id": "agent_v2", "name": "Improved Prompt"},
    {"id": "agent_v3", "name": "With Tools"}
]

for agent in agents_to_compare:
    # Run evaluation for each agent
    # Compare metrics side-by-side
```

### Integration with CI/CD

**Example GitHub Actions Workflow:**
```yaml
name: Agent Quality Gate
on: [pull_request]
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run Agent Evaluation
        run: python evaluate_agent.py
      - name: Check Quality Thresholds
        run: |
          if [ $(jq '.averages.relevance' results.json) -lt 4.0 ]; then
            echo "Quality gate failed: Relevance below threshold"
            exit 1
          fi
```

### Next Steps

1. **Expand Test Coverage**: Add more diverse queries covering edge cases
2. **Add Groundedness**: For RAG scenarios, evaluate context fidelity
3. **Track Over Time**: Create dashboard comparing evaluations across versions
4. **Automate**: Integrate into CI/CD for continuous quality monitoring
5. **Custom Evaluators**: Build domain-specific metrics for specialized use cases
6. **Cloud Evaluation**: Use `05_cloud_based_evaluation.ipynb` for large-scale testing

### Additional Resources

- [Azure AI Evaluation SDK Documentation](https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk)
- [Built-in Evaluators Reference](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics)
- [Custom Evaluators Guide](https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#custom-evaluators)
- [Agent Evaluation Best Practices](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-approach-gen-ai)

### Related Notebooks

- `02_simulator_eval.ipynb`: Agent conversation testing with multiple scenarios
- `03_rag_evaluation.ipynb`: RAG-specific evaluators (Retrieval, Groundedness, etc.)
- `05_cloud_based_evaluation.ipynb`: Cloud-based evaluation for large datasets