# Cloud-Based Evaluation with Azure AI Projects

## Introduction

This notebook demonstrates how to perform **cloud-based evaluations** using Azure AI Project's evaluation service. Unlike local evaluations that run on your machine, cloud-based evaluations execute in Azure infrastructure, providing scalability, centralized storage, and team collaboration capabilities.

### What is Cloud-Based Evaluation?

Cloud-based evaluation allows you to:
- **Upload datasets** to Azure AI Project for centralized storage
- **Run evaluations asynchronously** in Azure's scalable infrastructure
- **Track evaluation jobs** and monitor progress programmatically
- **View detailed results** in Azure AI Foundry Studio with rich visualizations
- **Share results** with team members for collaborative analysis
- **Maintain audit trails** with governance tags and version control

### When to Use Cloud-Based Evaluation

**Use Cloud-Based Evaluation When:**
- Working with large datasets (100+ samples) that need distributed processing
- Collaborating with a team that needs shared access to evaluation results
- Building production pipelines with automated evaluation workflows
- Requiring centralized governance and compliance tracking
- Needing historical comparison of evaluation runs over time

**Use Local Evaluation When:**
- Rapid prototyping and iterative development
- Small datasets (< 50 samples) with quick feedback loops
- Debugging specific evaluator configurations
- Working offline or with sensitive data that cannot leave local environment

### Key Concepts

1. **Dataset Management**: Datasets are versioned and stored in Azure AI Project, enabling reproducible evaluations
2. **Evaluator Configuration**: Define which metrics to compute using `EvaluatorConfiguration` with proper data mapping
3. **Asynchronous Execution**: Jobs run in the background; poll for status until completion
4. **Results Storage**: All metrics stored centrally with access through Studio UI or SDK
5. **Governance**: Use tags to classify evaluations by environment, data sensitivity, and purpose

### Evaluation Workflow

```
1. Prepare Dataset (JSONL format)
   ↓
2. Upload to Azure AI Project
   ↓
3. Configure Evaluators (Quality, Safety, Agent)
   ↓
4. Create Evaluation Job
   ↓
5. Monitor Job Status (Polling)
   ↓
6. View Results in Studio / Download
   ↓
7. Analyze Metrics & Iterate
```

### Available Evaluators

**Quality Metrics:**
- **Coherence**: How well the response flows logically (1-5 scale)
- **Relevance**: Whether response addresses the query appropriately (1-5 scale)
- **Fluency**: Language quality and readability (1-5 scale)
- **Groundedness**: Response fidelity to provided context (1-5 scale)

**Agent Metrics:**
- **Tool Call Accuracy**: Correctness of tool selection and arguments (0-1 binary or percentage)

**Safety Metrics** (configured separately):
- Content Safety categories (violence, sexual, self_harm, hate_unfairness)

### Prerequisites

- Azure AI Project with evaluation quota enabled
- Azure OpenAI deployment (GPT-4 or GPT-4o recommended for LLM-judged metrics)
- Dataset in JSONL format with required fields for your chosen evaluators
- Sufficient API quota for evaluation workload

## Table of Contents

1. [Part 1: Environment Setup](#part-1-environment-setup)
2. [Part 2: Dataset Preparation](#part-2-dataset-preparation)
3. [Part 3: Upload Dataset to Azure](#part-3-upload-dataset-to-azure)
4. [Part 4: Configure Evaluators](#part-4-configure-evaluators)
5. [Part 5: Create and Run Evaluation Job](#part-5-create-and-run-evaluation-job)
6. [Part 6: Monitor Job Status](#part-6-monitor-job-status)
7. [Part 7: View and Analyze Results](#part-7-view-and-analyze-results)
8. [Summary and Best Practices](#summary-and-best-practices)

---

## Part 1: Environment Setup

Configure the Azure AI Project client and verify connectivity.

**Required Environment Variables:**
- `AZURE_AI_PROJECT_ENDPOINT`: Your project endpoint URL
- `AZURE_OPENAI_ENDPOINT_GPT_4o`: Model endpoint for LLM-judged metrics
- `AZURE_OPENAI_API_KEY_GPT_4o`: API key for model access
- `AZURE_OPENAI_MODEl_GPT_4o`: Deployment name (e.g., gpt-4o-mini)

In [None]:
import os
import shutil

new_path_entry = "/opt/homebrew/bin"  # Replace with the directory you want to add
current_path = os.environ.get('PATH', '')

if new_path_entry not in current_path.split(os.pathsep):
    os.environ['PATH'] = new_path_entry + os.pathsep + current_path
    print(f"Updated PATH for this session: {os.environ['PATH']}")
else:
    print(f"PATH already contains {new_path_entry}: {current_path}")

# You can then verify with shutil.which again
print(f"Location of 'az' found by kernel now: {shutil.which('az')}")

In [None]:
import sys
from pathlib import Path
from dotenv import load_dotenv

# Add parent directory to path for agent_utils import
parent_dir = Path(__file__).parent.parent if hasattr(__builtins__, '__file__') else Path.cwd().parent
sys.path.insert(0, str(parent_dir / "01_agent"))

# Load environment variables from parent directory
agent_ops_dir = Path.cwd().parent if Path.cwd().name == "05_evaluation" else Path.cwd()
env_path = agent_ops_dir / ".env"
load_dotenv(env_path)

In [None]:
import os
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient


endpoint = os.environ["AZURE_AI_PROJECT_ENDPOINT"]
credential = DefaultAzureCredential()

project_client = AIProjectClient(endpoint=endpoint, credential=credential)

print(f"Connected to Azure AI Project: {endpoint}")

---

## Part 2: Dataset Preparation

Create a dataset in JSONL format for cloud-based evaluation.

**Dataset Requirements:**
- **Format**: JSONL (JSON Lines) - one JSON object per line
- **Fields**: Include all fields required by your chosen evaluators
- **Consistency**: Use the same field names across all samples
- **Validation**: Ensure valid JSON on each line

**Common Fields:**
- `query`: User's question or input
- `context`: Retrieved/provided context for the response
- `response`: Model's generated response
- `expected_tool_calls`: Ground truth for tool usage (agent scenarios)
- `actual_tool_calls`: Model's actual tool calls (agent scenarios)

In [None]:
import json

# Create sample evaluation data
evaluation_samples = [
    {
        "query": "What are the opening hours of the Space Needle in Seattle?",
        "context": "The Space Needle is open from 9:00 AM to 11:00 PM daily.",
        "response": "The Space Needle is open from 9:00 AM to 11:00 PM every day.",
        "expected_tool_calls": [],
        "actual_tool_calls": []
    },
    {
        "query": "What's the weather like in Seattle today?",
        "context": "Seattle typically has rainy weather in winter and mild summers.",
        "response": "The current weather in Seattle is rainy with a temperature of 14°C.",
        "expected_tool_calls": [
            {
                "name": "fetch_weather",
                "arguments": {"location": "Seattle"}
            }
        ],
        "actual_tool_calls": [
            {
                "name": "fetch_weather",
                "arguments": {"location": "Seattle"}
            }
        ]
    },
    {
        "query": "Can you recommend a good coffee shop in Seattle?",
        "context": "Seattle is famous for its coffee culture with many excellent cafes.",
        "response": "I recommend Pike Place Market Starbucks, the original Starbucks location, or local favorites like Espresso Vivace.",
        "expected_tool_calls": [],
        "actual_tool_calls": []
    },
    {
        "query": "How do I get to Pike Place Market from downtown Seattle?",
        "context": "Pike Place Market is located in downtown Seattle at 85 Pike St.",
        "response": "Pike Place Market is in downtown Seattle. You can walk there from most downtown locations, or take a bus to Pike Street.",
        "expected_tool_calls": [],
        "actual_tool_calls": []
    },
    {
        "query": "What temperature should I expect in Seattle in summer?",
        "context": "Seattle summers are mild with temperatures typically between 20-25°C.",
        "response": "In summer, Seattle typically has temperatures between 20-25°C (68-77°F), making it quite pleasant.",
        "expected_tool_calls": [],
        "actual_tool_calls": []
    }
]

# Write to JSONL file
dataset_file = "data/evaluate_test_data.jsonl"
with open(dataset_file, 'w') as f:
    for sample in evaluation_samples:
        f.write(json.dumps(sample) + '\n')

print(f"Created dataset file: {dataset_file}")
print(f"Number of samples: {len(evaluation_samples)}")

---

## Part 3: Upload Dataset to Azure

Upload the prepared dataset to Azure AI Project for centralized storage and evaluation.

**Dataset Versioning:**
- Use semantic versioning (e.g., "1.0", "2.0") for tracking changes
- Store multiple versions for comparing evaluation results over time
- Each version is immutable once uploaded

In [None]:
# Upload dataset to Azure AI Project
# Each line in the JSONL file represents one evaluation sample

dataset_name = os.environ.get("DATASET_NAME", "seattle-assistant-eval-dataset")
dataset_version = os.environ.get("DATASET_VERSION", "3.0")


In [None]:
dataset = project_client.datasets.upload_file(
    name=dataset_name,
    version=dataset_version,
    file_path=dataset_file,
)

print(f"Dataset uploaded successfully!")
print(f"Dataset Name: {dataset.name}")
print(f"Dataset Version: {dataset.version}")
print(f"Dataset ID: {dataset.id}")

---

## Part 4: Configure Evaluators

Define which evaluators to run and how to map dataset fields to evaluator inputs.

### Available Evaluator Types

**Quality Evaluators (LLM-Judged):**
- **Coherence**: Logical flow and consistency (requires: query, response)
- **Relevance**: Response appropriateness (requires: query, context, response)
- **Fluency**: Language quality and readability (requires: query, response)
- **Groundedness**: Fidelity to context (requires: context, response)

**Agent Evaluators:**
- **Tool Call Accuracy**: Correctness of tool usage (requires: query, expected_tool_calls, actual_tool_calls)

**Safety Evaluators** (configured separately via Azure AI Content Safety)

### Data Mapping Syntax

Use `${data.<field_name>}` to map dataset fields to evaluator inputs:
```python
data_mapping={
    "query": "${data.query}",
    "response": "${data.response}",
    "context": "${data.context}"
}
```

### Model Configuration

LLM-judged metrics require a capable model (GPT-4/GPT-4o) to act as the judge:
```python
model_config = {
    "azure_endpoint": "https://...",
    "azure_deployment": "gpt-4o-mini",
    "api_version": "2024-08-01-preview"
}
```

In [None]:
# Note: Evaluator configurations will be created using proper SDK models
# in the next step using EvaluatorConfiguration and EvaluatorIds

# Available evaluator IDs (from EvaluatorIds enum):
# - COHERENCE: Measures how well the response flows logically
# - RELEVANCE: Measures if the response is relevant to the query
# - FLUENCY: Measures language quality
# - GROUNDEDNESS: Measures if response is based on provided context
# - F1_SCORE: Measures overlap between response and ground truth
# - SIMILARITY: Measures semantic similarity
# - TOOL_CALL_ACCURACY: Evaluates tool call correctness (for agent scenarios)

print("Available Evaluators for Cloud-Based Evaluation:")
print("  Quality Metrics:")
print("    - Coherence (query + response)")
print("    - Relevance (query + context + response)")
print("    - Fluency (query + response)")
print("    - Groundedness (context + response)")
print("  Agent Metrics:")
print("    - Tool Call Accuracy (for agent tool usage)")
print("\nNote: Safety evaluators are configured separately through Azure AI Content Safety")

---

## Part 5: Create and Run Evaluation Job

Create an evaluation job that runs asynchronously in Azure's cloud infrastructure.

**Evaluation Configuration:**
- **Display Name**: Human-readable identifier for the evaluation
- **Description**: Purpose and scope of the evaluation
- **Data**: Reference to uploaded dataset by ID
- **Evaluators**: Dictionary of configured evaluators
- **Tags**: Metadata for governance and filtering

**Job Lifecycle:**
1. **Created**: Job submitted to Azure
2. **Running**: Evaluation in progress (distributed execution)
3. **Completed**: All metrics computed successfully
4. **Failed**: Error occurred during evaluation (check logs)
5. **Canceled**: Manually stopped by user

In [None]:
from azure.ai.projects.models import (
    Evaluation,
    EvaluatorConfiguration,
    EvaluatorIds,
    InputDataset
)

# Configure the model to use for LLM-judged metrics
# Note: Pass model_config as a plain dictionary with snake_case keys
# The evaluation service expects this format for proper model config validation
model_config = {
    "azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT_GPT_4o"],
    "azure_deployment": os.environ["AZURE_OPENAI_MODEl_GPT_4o"],
    "api_version": "2024-08-01-preview",
}

# Create evaluator configurations using data_mapping (not column_mapping)
evaluators = {
    "coherence": EvaluatorConfiguration(
        id=EvaluatorIds.COHERENCE,
        data_mapping={
            "response": "${data.response}",
            "query": "${data.query}"
        },
        init_params={"model_config": model_config}
    ),
    "relevance": EvaluatorConfiguration(
        id=EvaluatorIds.RELEVANCE,
        data_mapping={
            "response": "${data.response}",
            "context": "${data.context}",
            "query": "${data.query}"
        },
        init_params={"model_config": model_config}
    ),
    "fluency": EvaluatorConfiguration(
        id=EvaluatorIds.FLUENCY,
        data_mapping={
            "response": "${data.response}",
            "query": "${data.query}"
        },
        init_params={"model_config": model_config}
    ),
    "groundedness": EvaluatorConfiguration(
        id=EvaluatorIds.GROUNDEDNESS,
        data_mapping={
            "response": "${data.response}",
            "context": "${data.context}"
        },
        init_params={"model_config": model_config}
    ),
    "tool_call_accuracy": EvaluatorConfiguration(
        id=EvaluatorIds.TOOL_CALL_ACCURACY,
        data_mapping={
            "query": "${data.query}",
            "expected_tool_calls": "${data.expected_tool_calls}",
            "actual_tool_calls": "${data.actual_tool_calls}"
        },
        init_params={"model_config": model_config}
    ),
}

# Create evaluation object with proper configuration
# The 'data' parameter requires an InputDataset object with the dataset ID
evaluation = Evaluation(
    display_name="seattle-assistant-coherence-relevance-safety-eval",
    description="Cloud-based evaluation of Seattle assistant responses for quality metrics",
    data=InputDataset(id=dataset.id),
    evaluators=evaluators,
    # Optional tags for governance and tracking
    tags={
        "environment": "development",
        "evaluator_type": "quality_safety_agent",
        "assistant": "seattle_tourist",
        "data_classification": "test"
    }
)

# Create evaluation job
# Include model endpoint and API key in headers for LLM-judged metrics
evaluation_response = project_client.evaluations.create(
    evaluation,
    headers={
        "model-endpoint": os.environ["AZURE_OPENAI_ENDPOINT_GPT_4o"],
        "api-key": os.environ["AZURE_OPENAI_API_KEY_GPT_4o"],
    }
)

print("=" * 60)
print("Evaluation Job Created Successfully!")
print("=" * 60)
print(f"Evaluation Name: {evaluation_response.name}")
print(f"Status: {evaluation_response.status}")
print("=" * 60)

---

## Part 6: Monitor Job Status

Poll the evaluation job status until completion.

**Monitoring Considerations:**
- **Duration**: Depends on dataset size and number of evaluators
- **Polling Interval**: 5-10 seconds is reasonable for most jobs
- **Timeout**: Set a maximum wait time based on expected duration
- **Error Handling**: Check for failure status and retrieve error details

**Typical Duration Estimates:**
- 10 samples, 3 evaluators: ~30-60 seconds
- 50 samples, 5 evaluators: ~2-3 minutes
- 100+ samples, 5+ evaluators: ~5-10 minutes

In [None]:
import time

# Poll for job completion
max_attempts = 60  # Maximum 5 minutes (60 * 5 seconds)
attempt = 0

print("Polling evaluation job status...")
print("-" * 60)

while attempt < max_attempts:
    evaluation_response = project_client.evaluations.get(evaluation_response.name)
    status = evaluation_response.status
    
    print(f"[Attempt {attempt + 1}/{max_attempts}] Status: {status}")
    
    if status in ["Completed", "Failed", "Canceled"]:
        break
    
    time.sleep(5)  # Wait 5 seconds before next check
    attempt += 1

print("-" * 60)
print(f"Final Status: {evaluation_response.status}")

if hasattr(evaluation_response, 'error') and evaluation_response.error:
    print(f"Error: {evaluation_response.error}")

---

## Part 7: View and Analyze Results

Retrieve evaluation results and access detailed metrics in Azure AI Foundry Studio.

**Results Structure:**
- **Summary Metrics**: Aggregate scores across all samples
- **Instance Results**: Per-sample detailed scores (downloadable as JSONL)
- **Studio URL**: Direct link to rich visualization dashboard

**Accessing Detailed Results:**
1. Use the Studio URL from evaluation response
2. Navigate to "Evaluations" tab in Azure AI Foundry
3. Download `instance_results.jsonl` for programmatic analysis
4. View per-sample scores, reasoning, and distributions

In [None]:
from pprint import pprint

# Get evaluation details
evaluation_response = project_client.evaluations.get(evaluation_response.name)

print("=" * 60)
print("EVALUATION RESULTS")
print("=" * 60)
print(f"Evaluation Name: {evaluation_response.name}")
print(f"Status: {evaluation_response.status}")
print(f"Display Name: {evaluation_response.display_name}")

# Extract Studio URL from properties
studio_url = None
if hasattr(evaluation_response, 'properties') and evaluation_response.properties:
    studio_url = evaluation_response.properties.get('AiStudioEvaluationUri')

if studio_url:
    print("\n" + "=" * 60)
    print("View detailed results in Azure AI Foundry Studio:")
    print(studio_url)
    print("=" * 60)
else:
    print("\nStudio URL not available.")

# Try to get metrics from properties or results
if hasattr(evaluation_response, 'properties') and evaluation_response.properties:
    # Check if there are any metric-related properties
    metric_keys = [k for k in evaluation_response.properties.keys() if 'metric' in k.lower() or 'score' in k.lower()]
    if metric_keys:
        print("\n" + "=" * 60)
        print("METRICS FROM PROPERTIES")
        print("=" * 60)
        for key in metric_keys:
            print(f"{key}: {evaluation_response.properties[key]}")

# If results are available in the evaluation object
if hasattr(evaluation_response, 'results') and evaluation_response.results:
    print("\n" + "=" * 60)
    print("SUMMARY METRICS")
    print("=" * 60)
    pprint(evaluation_response.results)

---

## Summary and Best Practices

### Key Takeaways

1. **Cloud-Based Advantages**: Scalability, team collaboration, centralized governance
2. **Dataset Versioning**: Track evaluation history with versioned datasets
3. **Multiple Evaluators**: Combine quality, safety, and agent metrics for comprehensive assessment
4. **Asynchronous Execution**: Jobs run in background; poll for completion
5. **Rich Visualization**: Azure AI Foundry Studio provides detailed analysis tools

### Best Practices

#### 1. Dataset Preparation
- ✅ Use consistent field names across all samples
- ✅ Validate JSONL format (one JSON object per line)
- ✅ Include all required fields for your chosen evaluators
- ✅ Start with small datasets (10-20 samples) for initial testing
- ✅ Add diverse scenarios to capture edge cases

#### 2. Evaluator Configuration
- ✅ Map dataset fields correctly using `${data.<field_name>}` syntax
- ✅ Choose evaluators appropriate for your use case (quality vs. safety vs. agent)
- ✅ Use GPT-4 or GPT-4o for LLM-judged metrics (better reasoning)
- ✅ Ensure sufficient API quota for evaluation workload
- ✅ Test evaluator configs with small datasets first

#### 3. Model Configuration
- ✅ Use `azure_endpoint`, `azure_deployment`, `api_version` format
- ✅ Pass model_config as dictionary with snake_case keys
- ✅ Verify model deployment is active and has quota
- ✅ Use same model version across evaluation runs for consistency

#### 4. Governance and Compliance
- ✅ Use tags to classify evaluations by:
  - Environment (development, staging, production)
  - Data classification (test, internal, customer)
  - Purpose (debugging, validation, benchmarking)
  - Team or project identifier
- ✅ Document evaluation purposes and expected outcomes
- ✅ Track data sensitivity levels for compliance
- ✅ Archive evaluation results for audit trails

#### 5. Results Analysis
- ✅ Review aggregate metrics in evaluation response
- ✅ Use Studio URL for detailed per-sample analysis
- ✅ Download `instance_results.jsonl` for programmatic analysis
- ✅ Track metrics over time to measure improvements
- ✅ Identify low-scoring samples for targeted debugging
- ✅ Use results to guide prompt engineering or model fine-tuning

### Common Evaluator Data Mappings Reference

| Evaluator | Required Fields | Mapping |
|-----------|----------------|---------|
| **Coherence** | query, response | `{"query": "${data.query}", "response": "${data.response}"}` |
| **Fluency** | query, response | `{"query": "${data.query}", "response": "${data.response}"}` |
| **Relevance** | query, context, response | `{"query": "${data.query}", "context": "${data.context}", "response": "${data.response}"}` |
| **Groundedness** | context, response | `{"context": "${data.context}", "response": "${data.response}"}` |
| **Tool Call Accuracy** | query, expected_tool_calls, actual_tool_calls | `{"query": "${data.query}", "expected_tool_calls": "${data.expected_tool_calls}", "actual_tool_calls": "${data.actual_tool_calls}"}` |

### Quality Thresholds (Suggested)

| Metric | Excellent | Good | Acceptable | Needs Improvement |
|--------|-----------|------|------------|-------------------|
| Coherence | 4.5-5.0 | 4.0-4.4 | 3.5-3.9 | < 3.5 |
| Relevance | 4.5-5.0 | 4.0-4.4 | 3.5-3.9 | < 3.5 |
| Fluency | 4.5-5.0 | 4.0-4.4 | 3.5-3.9 | < 3.5 |
| Groundedness | 4.5-5.0 | 4.0-4.4 | 3.5-3.9 | < 3.5 |
| Tool Call Accuracy | > 0.95 | 0.90-0.95 | 0.80-0.89 | < 0.80 |

### Troubleshooting Common Issues

| Issue | Possible Cause | Solution |
|-------|---------------|----------|
| Job fails immediately | Invalid data mapping | Verify field names match dataset |
| Low groundedness scores | Response hallucinates | Improve retrieval or add context constraints |
| Evaluation timeout | Large dataset | Split into smaller batches |
| Missing results | Job still running | Poll longer or check Studio UI |
| Tool call accuracy = 0 | Empty tool call fields | Ensure expected/actual_tool_calls are populated |

### Next Steps

1. **Iterate on Prompts**: Use low-scoring samples to refine system prompts
2. **Expand Dataset**: Add more diverse scenarios based on production use cases
3. **Automate Evaluations**: Integrate into CI/CD pipelines for continuous quality monitoring
4. **Compare Versions**: Run evaluations on different model versions to measure improvements
5. **Custom Evaluators**: Build domain-specific evaluators for specialized metrics

---

## Appendix: Understanding Evaluation Results Structure

### Instance Results Format

The `instance_results.jsonl` file contains per-sample detailed metrics. Each line is a JSON object with:

**Input Fields** (from your dataset):
```json
{
  "query": "What's the weather like in Seattle?",
  "context": "Seattle typically has rainy weather...",
  "response": "The current weather in Seattle is..."
}
```

**Evaluator Scores**:
```json
{
  "coherence.score": 4.5,
  "coherence.reason": "Response flows logically...",
  "relevance.score": 5.0,
  "relevance.reason": "Directly addresses the query...",
  "fluency.score": 4.0,
  "groundedness.score": 4.5,
  "tool_call_accuracy.score": 1.0
}
```

### Score Interpretation

| Metric | Scale | Interpretation |
|--------|-------|----------------|
| Coherence, Relevance, Fluency, Groundedness | 1-5 | 1=poor, 3=acceptable, 5=excellent |
| Tool Call Accuracy | 0-1 | 0=incorrect, 1=correct (binary) |

### Programmatic Analysis Example

After downloading `instance_results.jsonl`:

```python
import pandas as pd
import json

# Load results
results = []
with open('instance_results.jsonl', 'r') as f:
    for line in f:
        results.append(json.loads(line))

df = pd.DataFrame(results)

# Identify low-performing samples
low_relevance = df[df['relevance.score'] < 3.0]
print(f"Samples with low relevance: {len(low_relevance)}")

# Calculate aggregate metrics
print(f"Average coherence: {df['coherence.score'].mean():.2f}")
print(f"Average relevance: {df['relevance.score'].mean():.2f}")

# Check metric correlation
correlation = df[['coherence.score', 'relevance.score']].corr()
print(correlation)

# Find samples needing improvement
needs_improvement = df[
    (df['coherence.score'] < 3.5) | 
    (df['relevance.score'] < 3.5)
]
```

---

## Additional Resources

### Official Documentation
- [Azure AI Evaluation SDK Documentation](https://learn.microsoft.com/azure/ai-studio/how-to/evaluate-sdk)
- [Evaluator Types and Metrics Reference](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics)
- [Azure AI Foundry Studio](https://ai.azure.com)
- [Evaluation Results Analysis Guide](https://learn.microsoft.com/azure/ai-studio/how-to/evaluate-results)
- [Cloud-Based Evaluation Tutorial](https://learn.microsoft.com/azure/ai-studio/how-to/evaluate-cloud)

### Related Notebooks
- `03_rag_evaluation.ipynb`: Local RAG-specific evaluators
- `02_simulator_eval.ipynb`: Agent conversation testing
- `04_agent_evaluation.ipynb`: Agent-specific metrics

### Azure AI Project Setup
- [Create an Azure AI Project](https://learn.microsoft.com/azure/ai-studio/how-to/create-projects)
- [Configure Azure OpenAI](https://learn.microsoft.com/azure/ai-services/openai/how-to/create-resource)
- [Manage Quotas and Limits](https://learn.microsoft.com/azure/ai-studio/how-to/quota)

### Code Samples
- [Azure AI Evaluation Samples on GitHub](https://github.com/Azure-Samples/azureai-samples)
- [Evaluation SDK Examples](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/evaluation)