# Agent Conversation Testing and Evaluation

This notebook demonstrates how to test Azure AI agents with synthetic conversations and evaluate their responses using quality metrics.

## What is Agent Conversation Testing?

Agent conversation testing validates that your AI agent produces high-quality responses across different user scenarios. The process involves:
1. **Test Scenario Definition**: Create realistic user queries representing different personas and use cases
2. **Agent Interaction**: Execute conversations with the agent to gather responses
3. **Quality Evaluation**: Assess response quality using AI-assisted evaluators

## Evaluation Metrics

This notebook evaluates agent responses using three key quality metrics:

1. **Relevance**: How well the response addresses the user's query
   - Measures accuracy, completeness, and directness
   - Likert scale 1-5 (higher is better)

2. **Coherence**: Logical flow and consistency of the response
   - Assesses if ideas connect naturally
   - Likert scale 1-5 (higher is better)

3. **Fluency**: Language quality and readability
   - Evaluates grammar, naturalness, and clarity
   - Likert scale 1-5 (higher is better)

## Workflow Overview

```
Test Scenarios ‚Üí Agent Conversations ‚Üí Extract Q&A Pairs ‚Üí Evaluate Quality ‚Üí Analyze Results
```

## Table of Contents

1. [Part 1: Environment Setup](#part-1-environment-setup)
2. [Part 2: Model Configuration](#part-2-model-configuration)
3. [Part 3: Define Test Scenarios](#part-3-define-test-scenarios)
4. [Part 4: Define Agent Callback](#part-4-define-agent-callback)
5. [Part 5: Run Agent Conversations](#part-5-run-agent-conversations)
   - 5.1: View Conversation Results
6. [Part 6: Initialize Evaluators](#part-6-initialize-evaluators)
7. [Part 7: Evaluate Conversations](#part-7-evaluate-conversations)
8. [Part 8: Analyze Results](#part-8-analyze-results)
   - 8.1: Save Evaluation Results
9. [Summary and Best Practices](#summary-and-best-practices)

In [None]:
%pip install azure-ai-evaluation wikipedia -qU

In [None]:
import os
import shutil

new_path_entry = "/opt/homebrew/bin"  # Replace with the directory you want to add
current_path = os.environ.get('PATH', '')

if new_path_entry not in current_path.split(os.pathsep):
    os.environ['PATH'] = new_path_entry + os.pathsep + current_path
    print(f"Updated PATH for this session: {os.environ['PATH']}")
else:
    print(f"PATH already contains {new_path_entry}: {current_path}")

# You can then verify with shutil.which again
print(f"Location of 'az' found by kernel now: {shutil.which('az')}")

## Part 1: Environment Setup

Load required libraries and configure Azure AI project connection.

**Prerequisites**:
- `.env` file in `17_agent_ops/` directory with:
  - `AZURE_AI_PROJECT_ENDPOINT` - Azure AI Foundry project endpoint
  - `AZURE_OPENAI_API_KEY_GPT_4o` - Azure OpenAI API key
  - `AZURE_OPENAI_ENDPOINT_GPT_4o` - Azure OpenAI endpoint
  - `AZURE_OPENAI_MODEl_GPT_4o` - Deployment name (defaults to gpt-4o)
  - `TARGET_AGENT_ID` - The agent ID to evaluate
- Authenticate via `az login` for DefaultAzureCredential

In [None]:
import json
import asyncio
import logging
import os
import sys
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Dict, List, Optional

from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.ai.evaluation.simulator import Simulator
from azure.ai.evaluation import RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator
from azure.ai.projects import AIProjectClient
import wikipedia
import pandas as pd
from IPython.display import display

# Add parent directory to path for agent_utils import
parent_dir = Path(__file__).parent.parent if hasattr(__builtins__, '__file__') else Path.cwd().parent
sys.path.insert(0, str(parent_dir / "utils"))

from agent_utils import AgentManager

# Load environment variables from parent directory
agent_ops_dir = Path.cwd().parent if Path.cwd().name == "05_evaluation" else Path.cwd()
env_path = agent_ops_dir / ".env"
load_dotenv(env_path)

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
logger = logging.getLogger("simulator_eval")

# Suppress verbose Azure SDK HTTP logging
logging.getLogger("azure.core.pipeline.policies.http_logging_policy").setLevel(logging.WARNING)
logging.getLogger("azure.identity").setLevel(logging.WARNING)
logging.getLogger("azure.cosmos._cosmos_http_logging_policy").setLevel(logging.WARNING)

# Initialize Azure AI Project Client with endpoint
endpoint = os.getenv("AZURE_AI_PROJECT_ENDPOINT")
if not endpoint:
    raise ValueError("Set AZURE_AI_PROJECT_ENDPOINT in .env file")

credential = DefaultAzureCredential()
project_client = AIProjectClient(endpoint=endpoint, credential=credential)
agent_manager = AgentManager(project_client)
logger.info("‚úÖ Connected to Azure AI project")

# Get Azure OpenAI configuration from .env
model_api_key = os.getenv("AZURE_OPENAI_API_KEY_GPT_4o")
if not model_api_key:
    raise ValueError("Set AZURE_OPENAI_API_KEY_GPT_4o in .env file")

model_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT_GPT_4o")
if not model_endpoint:
    raise ValueError("Set AZURE_OPENAI_ENDPOINT_GPT_4o in .env file")

deployment_name = os.getenv("AZURE_OPENAI_MODEl_GPT_4o", "gpt-4o")

# Agent to evaluate
TARGET_AGENT_ID = os.getenv("TARGET_AGENT_ID", "asst_3pPWPYFexU3fEwbYB3VDWO1N")

# Setup data directory
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S")
notebook_dir = Path.cwd() if Path.cwd().name == "05_evaluation" else Path.cwd() / "05_evaluation"
dataset_dir = notebook_dir / "data"
dataset_dir.mkdir(parents=True, exist_ok=True)

logger.info(f"‚úÖ Evaluation will use deployment '{deployment_name}' at endpoint '{model_endpoint}'")
logger.info(f"üéØ Target agent: {TARGET_AGENT_ID}")

## Part 2: Model Configuration

Configure the LLM model for evaluators and set up test context.

**Model Configuration**:
- Uses Azure OpenAI for AI-assisted quality evaluators
- Supports both API key and Azure AD authentication
- Configurable deployment and API version

In [None]:
# Configure model for evaluators
evaluator_model_config = {
    "azure_endpoint": model_endpoint,
    "api_key": model_api_key,
    "azure_deployment": deployment_name,
    "api_version": "2024-08-01-preview"
}

# Use Wikipedia to get contextual information for test scenarios
wiki_search_term = "Azure AI"
wiki_title = wikipedia.search(wiki_search_term)[0]
wiki_page = wikipedia.page(wiki_title)
text = wiki_page.summary[:1000]

logger.info(f"‚úÖ Model configuration ready")
logger.info(f"üìö Test context from Wikipedia: {wiki_title}")
logger.info(f"üìù Context preview (first 200 chars): {text[:200]}...")

## Part 3: Define Test Scenarios

Create realistic test scenarios representing different user personas and use cases.

**Test Scenario Design**:
- Each scenario represents a distinct user persona
- Queries should reflect real-world use cases
- Include variety: beginners, experts, educators, practitioners
- Consider different information needs and goals

In [None]:
# Define test scenarios with different user personas
test_scenarios = [
    {
        "scenario": "Student learning basics",
        "persona": "Beginner student",
        "query": f"Can you explain what {wiki_search_term} is in simple terms?"
    },
    {
        "scenario": "Teacher preparing lesson",
        "persona": "Educator",
        "query": f"What are the key concepts of {wiki_search_term} I should teach to my students?"
    },
    {
        "scenario": "Technical overview",
        "persona": "Technical professional",
        "query": f"What are the main technical components of {wiki_search_term}?"
    },
    {
        "scenario": "Practical application",
        "persona": "Developer/Practitioner",
        "query": f"How can I use {wiki_search_term} in a real-world project?"
    }
]

logger.info(f"‚úÖ Defined {len(test_scenarios)} test scenarios")
for i, scenario in enumerate(test_scenarios, 1):
    logger.info(f"  {i}. {scenario['scenario']} ({scenario['persona']})")

## Part 4: Define Agent Callback

Create a callback function to interact with the Azure AI agent.

**Callback Function Purpose**:
- Abstracts agent interaction logic
- Handles thread creation and cleanup
- Manages errors gracefully
- Returns responses in expected format

In [None]:
# Define callback function to interact with the agent
async def agent_callback(
    messages: Dict[str, List[Dict]],
    stream: bool = False,
    session_state: Any = None,
    context: Optional[Dict[str, Any]] = None,
) -> dict:
    """
    Callback function that the simulator will use to interact with your agent.
    This simulates how a user would interact with the agent.
    """
    messages_list = messages["messages"]
    # Get the last message from the user
    latest_message = messages_list[-1]
    query = latest_message["content"]

    # Create a new thread for this conversation
    thread = agent_manager.create_thread()

    try:
        # Call the agent
        response_text = agent_manager.run_agent_simple(
            thread_id=thread.id,
            agent_id=TARGET_AGENT_ID,
            user_message=query,
            verbose=False,
        )

        # Format the response to follow the OpenAI chat protocol format
        formatted_response = {
            "content": response_text,
            "role": "assistant",
            "context": context or {},
        }
        messages["messages"].append(formatted_response)

        return {
            "messages": messages["messages"],
            "stream": stream,
            "session_state": session_state,
            "context": context
        }
    except Exception as e:
        logger.error(f"Error in agent callback: {e}")
        # Return error response
        error_response = {
            "content": f"Error: {str(e)}",
            "role": "assistant",
            "context": context or {},
        }
        messages["messages"].append(error_response)
        return {
            "messages": messages["messages"],
            "stream": stream,
            "session_state": session_state,
            "context": context
        }
    finally:
        agent_manager.delete_thread(thread.id, silent=True)

logger.info("‚úÖ Agent callback function defined")

## Part 5: Run Agent Conversations

Execute conversations with the agent for each test scenario.

**Process**:
1. Iterate through test scenarios
2. Create message format for each query
3. Call agent callback to get response
4. Store conversation results

In [None]:
# Run agent conversations for all test scenarios
async def run_agent_conversations():
    """
    Execute conversations with the agent using predefined test scenarios.
    Returns conversation results with scenario metadata.
    """
    conversation_results = []

    for test in test_scenarios:
        logger.info(f"üìù Testing: {test['scenario']}")

        # Create conversation format expected by callback
        messages = {
            "messages": [
                {"role": "user", "content": test["query"]}
            ]
        }

        # Call the agent
        response = await agent_callback(messages=messages)

        # Store the result with metadata
        conversation_results.append({
            "scenario": test["scenario"],
            "persona": test["persona"],
            "messages": response["messages"]
        })
        
        logger.info(f"  ‚úÖ Completed: {test['scenario']}")

    return conversation_results

# Execute conversations
logger.info("=" * 80)
logger.info("üöÄ Starting agent conversation testing...")
logger.info("=" * 80)
conversation_outputs = await run_agent_conversations()
logger.info(f"\n‚úÖ Completed {len(conversation_outputs)} conversations")

### 5.1: View Conversation Results

Display and save the generated conversations for inspection.

In [None]:
# Display conversation outputs
print("\n" + "=" * 80)
print("CONVERSATION RESULTS")
print("=" * 80)

for conv in conversation_outputs:
    print(f"\nüìå Scenario: {conv['scenario']} ({conv['persona']})")
    print("-" * 80)
    for msg in conv["messages"]:
        role_icon = "üë§" if msg["role"] == "user" else "ü§ñ"
        print(f"{role_icon} {msg['role'].upper()}: {msg['content'][:200]}...")
    print()

# Save conversation outputs
conversation_output_path = dataset_dir / f"conversations_{timestamp}.json"
with conversation_output_path.open("w", encoding="utf-8") as f:
    json.dump(conversation_outputs, f, indent=2, ensure_ascii=False)

logger.info(f"‚úÖ Saved conversations to {conversation_output_path}")

## Part 6: Initialize Evaluators

Set up AI-assisted evaluators to assess response quality.

**Evaluator Types**:
- **RelevanceEvaluator**: Measures how well the response addresses the query
- **CoherenceEvaluator**: Assesses logical flow and consistency
- **FluencyEvaluator**: Evaluates language quality and readability

All evaluators use an LLM-judge approach with Likert scale scoring (1-5).

In [None]:
# Initialize quality evaluators
relevance_eval = RelevanceEvaluator(evaluator_model_config)
coherence_eval = CoherenceEvaluator(evaluator_model_config)
fluency_eval = FluencyEvaluator(evaluator_model_config)

logger.info("=" * 80)
logger.info("‚úÖ Evaluators initialized successfully")
logger.info("=" * 80)
logger.info("  ‚Ä¢ Relevance Evaluator - Measures query-response alignment")
logger.info("  ‚Ä¢ Coherence Evaluator - Assesses logical consistency")
logger.info("  ‚Ä¢ Fluency Evaluator - Evaluates language quality")

## Run Simulator

Generate synthetic conversations by directly calling the agent with test queries.

In [None]:
# Run direct agent conversations (alternative to Simulator for simpler testing)
async def run_direct_conversations():
    """
    Directly call the agent with predefined queries instead of using the Simulator.
    This is simpler and more reliable for basic testing scenarios.
    """
    # Define test scenarios
    test_queries = [
        {
            "scenario": "Student learning basics",
            "query": f"Can you explain what {wiki_search_term} is in simple terms?"
        },
        {
            "scenario": "Teacher preparing lesson",
            "query": f"What are the key concepts of {wiki_search_term} I should teach to my students?"
        },
        {
            "scenario": "Technical overview",
            "query": f"What are the main technical components of {wiki_search_term}?"
        },
        {
            "scenario": "Practical application",
            "query": f"How can I use {wiki_search_term} in a real-world project?"
        }
    ]

    results = []

    for test in test_queries:
        logger.info(f"Testing scenario: {test['scenario']}")

        # Create conversation format expected by callback
        messages = {
            "messages": [
                {"role": "user", "content": test["query"]}
            ]
        }

        # Call the agent
        response = await agent_callback(messages=messages)

        # Store the result
        results.append({
            "scenario": test["scenario"],
            "messages": response["messages"]
        })

    return results

# Run the conversations
logger.info("üöÄ Starting agent conversations...")
simulator_outputs = await run_direct_conversations()
logger.info(f"‚úÖ Completed {len(simulator_outputs)} conversations")

## Display Simulator Outputs

View and save the generated conversations.

In [None]:
# Display simulator outputs
print("=" * 80)
print("SIMULATOR OUTPUTS")
print("=" * 80)
print(json.dumps(simulator_outputs, indent=2))

# Save simulator outputs
simulator_output_path = dataset_dir / f"simulator_outputs_{timestamp}.json"
with simulator_output_path.open("w", encoding="utf-8") as f:
    json.dump(simulator_outputs, f, indent=2, ensure_ascii=False)

logger.info(f"‚úÖ Saved simulator outputs to {simulator_output_path}")

## Initialize Evaluators

Set up evaluators to assess the quality of simulated conversations.

In [None]:
# Configure model for evaluators
evaluator_model_config = {
    "azure_endpoint": model_endpoint,
    "api_key": model_api_key,
    "azure_deployment": deployment_name,
    "api_version": "2024-08-01-preview"
}

# Initialize evaluators
relevance_eval = RelevanceEvaluator(evaluator_model_config)
coherence_eval = CoherenceEvaluator(evaluator_model_config)
fluency_eval = FluencyEvaluator(evaluator_model_config)

logger.info("‚úÖ Evaluators initialized")

## Part 7: Evaluate Conversations

Extract query-response pairs and evaluate them using quality metrics.

**Evaluation Process**:
1. Extract Q&A pairs from conversations
2. Run each evaluator (Relevance, Coherence, Fluency)
3. Handle errors gracefully
4. Store results with scenario metadata

In [None]:
# Extract query-response pairs from conversations
evaluation_data = []

for conversation in conversation_outputs:
    if "messages" in conversation:
        messages = conversation["messages"]
        # Extract user queries and assistant responses
        for i in range(len(messages) - 1):
            if messages[i]["role"] == "user" and messages[i + 1]["role"] == "assistant":
                evaluation_data.append({
                    "scenario": conversation.get("scenario", f"Conversation {len(evaluation_data) + 1}"),
                    "persona": conversation.get("persona", "Unknown"),
                    "query": messages[i]["content"],
                    "response": messages[i + 1]["content"]
                })

logger.info(f"\nüìä Extracted {len(evaluation_data)} query-response pairs for evaluation")

# Run evaluation on all conversations
logger.info("\n" + "=" * 80)
logger.info("üîç Starting quality evaluation...")
logger.info("=" * 80)

evaluation_results = []

for row in evaluation_data:
    scenario = row["scenario"]
    persona = row["persona"]
    query = row["query"]
    response = row["response"]

    logger.info(f"\nüìù Evaluating: {scenario}")
    logger.info(f"   Query: {query[:80]}...")

    # Run evaluators with error handling
    try:
        relevance_score = relevance_eval(query=query, response=response)
        coherence_score = coherence_eval(query=query, response=response)
        fluency_score = fluency_eval(query=query, response=response)

        result = {
            "scenario": scenario,
            "persona": persona,
            "query": query,
            "response": response[:200] + "..." if len(response) > 200 else response,
            "relevance": relevance_score.get("gpt_relevance", relevance_score.get("relevance", 0)),
            "coherence": coherence_score.get("gpt_coherence", coherence_score.get("coherence", 0)),
            "fluency": fluency_score.get("gpt_fluency", fluency_score.get("fluency", 0))
        }
        
        evaluation_results.append(result)
        logger.info(f"   ‚úÖ Scores - Relevance: {result['relevance']:.1f}, Coherence: {result['coherence']:.1f}, Fluency: {result['fluency']:.1f}")
        
    except Exception as e:
        logger.error(f"   ‚ùå Error: {str(e)}")
        evaluation_results.append({
            "scenario": scenario,
            "persona": persona,
            "query": query,
            "response": response[:200] + "..." if len(response) > 200 else response,
            "relevance": 0,
            "coherence": 0,
            "fluency": 0,
            "error": str(e)
        })

logger.info(f"\n‚úÖ Evaluation completed for {len(evaluation_results)} conversations")

## Part 8: Analyze Results

Visualize and analyze evaluation results to identify patterns and insights.

In [None]:
# Create DataFrame for analysis
results_df = pd.DataFrame(evaluation_results)

# Display results table
print("\n" + "=" * 80)
print("EVALUATION RESULTS SUMMARY")
print("=" * 80)
display(results_df[["scenario", "persona", "relevance", "coherence", "fluency"]])

# Calculate statistics
print("\n" + "=" * 80)
print("STATISTICAL SUMMARY")
print("=" * 80)

avg_relevance = results_df["relevance"].mean()
avg_coherence = results_df["coherence"].mean()
avg_fluency = results_df["fluency"].mean()

print(f"\nüìä Average Scores (out of 5.0):")
print(f"  Relevance:  {avg_relevance:.2f}")
print(f"  Coherence:  {avg_coherence:.2f}")
print(f"  Fluency:    {avg_fluency:.2f}")
print(f"  Overall:    {(avg_relevance + avg_coherence + avg_fluency) / 3:.2f}")

# Per-scenario analysis
print(f"\nüìã Per-Scenario Performance:")
scenario_stats = results_df.groupby("scenario")[["relevance", "coherence", "fluency"]].mean()
for scenario, scores in scenario_stats.iterrows():
    avg_score = scores.mean()
    status = "‚úÖ" if avg_score >= 4.0 else "‚ö†Ô∏è" if avg_score >= 3.0 else "‚ùå"
    print(f"  {status} {scenario}: {avg_score:.2f}")

# Quality assessment
print(f"\nüéØ Quality Assessment:")
overall_avg = (avg_relevance + avg_coherence + avg_fluency) / 3
if overall_avg >= 4.0:
    print("  ‚úÖ EXCELLENT - Agent responses are high quality across all metrics")
elif overall_avg >= 3.5:
    print("  ‚úÖ GOOD - Agent responses meet quality standards")
elif overall_avg >= 3.0:
    print("  ‚ö†Ô∏è  FAIR - Agent responses need improvement in some areas")
else:
    print("  ‚ùå POOR - Agent responses require significant improvement")

### 8.1: Save Evaluation Results

Persist evaluation results for future reference and comparison.

In [None]:
# Save comprehensive evaluation results
results_path = dataset_dir / f"evaluation_results_{timestamp}.json"
with results_path.open("w", encoding="utf-8") as f:
    json.dump({
        "metadata": {
            "timestamp": timestamp,
            "agent_id": TARGET_AGENT_ID,
            "model": deployment_name,
            "test_context": wiki_search_term,
            "num_scenarios": len(test_scenarios),
            "num_evaluations": len(evaluation_results)
        },
        "test_scenarios": test_scenarios,
        "evaluation_results": evaluation_results,
        "summary_statistics": {
            "averages": {
                "relevance": float(avg_relevance),
                "coherence": float(avg_coherence),
                "fluency": float(avg_fluency),
                "overall": float((avg_relevance + avg_coherence + avg_fluency) / 3)
            },
            "per_scenario": scenario_stats.to_dict()
        }
    }, f, indent=2)

logger.info(f"\n‚úÖ Saved evaluation results to {results_path}")

## Summary and Best Practices

### Key Takeaways

1. **Test Scenario Design**
   - Create diverse scenarios representing real user needs
   - Include different personas: beginners, experts, educators, practitioners
   - Cover various information types: definitions, concepts, technical details, applications

2. **Quality Metrics Understanding**
   - **Relevance (4.0+)**: Response directly addresses the query
   - **Coherence (4.0+)**: Ideas flow logically and connect naturally
   - **Fluency (4.0+)**: Language is clear, grammatical, and natural

3. **Evaluation Strategy**
   - Run evaluations consistently across all scenarios
   - Handle errors gracefully to avoid incomplete results
   - Analyze per-scenario performance to identify weak areas
   - Track results over time to measure improvements

4. **Agent Quality Standards**
   - **Excellent (4.0+)**: Production-ready, high-quality responses
   - **Good (3.5-4.0)**: Acceptable for most use cases
   - **Fair (3.0-3.5)**: Needs improvement before deployment
   - **Poor (<3.0)**: Requires significant agent refinement

### Common Patterns

| Scenario Type | Expected Relevance | Expected Coherence | Expected Fluency |
|--------------|-------------------|-------------------|-----------------|
| Simple definitions | 4.5+ | 4.0+ | 4.5+ |
| Technical explanations | 4.0+ | 4.0+ | 4.0+ |
| Educational content | 4.5+ | 4.5+ | 4.5+ |
| Practical guidance | 4.0+ | 4.0+ | 4.0+ |

### Next Steps

- **Expand Test Coverage**: Add more scenarios and edge cases
- **Customize Evaluators**: Create domain-specific evaluation metrics
- **A/B Testing**: Compare different agent configurations
- **Continuous Monitoring**: Track quality metrics over time
- **User Feedback Integration**: Combine automated metrics with real user feedback

### Additional Resources

- [Azure AI Evaluation Documentation](https://learn.microsoft.com/azure/ai-studio/how-to/evaluate-sdk)
- [Agent Quality Best Practices](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-agent)
- [Azure AI Foundry Studio](https://ai.azure.com)