1. Introduction
 
The Azure AI Evaluation SDK allows you to quantitatively and qualitatively evaluate Generative AI applications both locally and at scale. It includes a variety of built-in evaluators you can use with your test data, and supports evaluation for both single-turn and multi-turn conversations, as well as multi-modal data (e.g., images). This is a very simple notebook for AI Evaluation SDK.

2. Environment Setup
 
Make sure you have access to the necessary Azure OpenAI resources. Set the following environment variables in your system (or in your notebook for demonstration):

In [7]:
import os
import openai
from openai import AzureOpenAI
from dotenv import load_dotenv

# Set up Azure OpenAI
load_dotenv("credentials.env")

True

3. SDK Installation
 
Install the Azure AI Evaluation SDK:

In [9]:
!pip install azure-ai-evaluation  



In [10]:
pip install azure-ai-projects

Note: you may need to restart the kernel to use updated packages.


4. Model Configuration
 
Required for AI-assisted evaluators (except some safety evaluators):
You need to specify which GPT model will be used as the judge.

In [8]:
from azure.ai.evaluation import AzureOpenAIModelConfiguration  
  
model_config = AzureOpenAIModelConfiguration(  
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],  
    api_key=os.environ["AZURE_OPENAI_API_KEY"],  
    azure_deployment="gpt-4o",  
    api_version= "2024-02-15-preview"  
)  

5. Running Built-in Evaluators (Single Row)
 
Let's run an evaluator on a simple query-response pair using the RelevanceEvaluator:

In [4]:
from azure.ai.evaluation import RelevanceEvaluator  
  
query = "What is the capital of France?"  
response = "Paris."  
  
relevance_eval = RelevanceEvaluator(model_config)  
result = relevance_eval(query=query, response=response)  
print(result)  

{'relevance': 4.0, 'gpt_relevance': 4.0, 'relevance_reason': 'The RESPONSE fully and accurately answers the QUERY, providing all necessary information without additional insights or elaboration.', 'relevance_result': 'pass', 'relevance_threshold': 3}


Supported Built-in Evaluators

General purpose: CoherenceEvaluator, FluencyEvaluator, QAEvaluator, etc.
Similarity: SimilarityEvaluator, F1ScoreEvaluator, BleuScoreEvaluator,...
RAG: GroundednessEvaluator, RetrievalEvaluator, etc.
Safety: ViolenceEvaluator, ContentSafetyEvaluator, ...
See full list in Azure Docs

6. Batch Evaluation with .jsonl Dataset
 
Prepare your dataset as a .jsonl file (JSON Lines):

Example: data.jsonl

{"query": "What is the capital of France?", "response": "Paris."}  
{"query": "What atoms compose water?", "response": "Hydrogen and oxygen."}  
{"query": "What color is my shirt?", "response": "Blue."}  

You can now run evaluators over this dataset:

In [7]:
from azure.ai.evaluation import evaluate, GroundednessEvaluator  
  
groundedness_eval = GroundednessEvaluator(model_config)  
  
result = evaluate(  
    data="data.jsonl",  
    evaluators={"groundedness": groundedness_eval},  
    output_path="./eval_results.json"    # Output is optional  
)  
import json  
print(json.dumps(result['metrics'], indent=2))  

[2025-06-12 14:24:10 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_groundedness_20250612_142410_422891, log path: /home/azureuser/.promptflow/.runs/azure_ai_evaluation_evaluators_groundedness_20250612_142410_422891/logs.txt


2025-06-12 14:24:10 +0000    2985 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-06-12 14:24:10 +0000    2985 execution.bulk     INFO     Finished 3 / 3 lines.
2025-06-12 14:24:10 +0000    2985 execution.bulk     INFO     Average execution time for completed lines: 0.0 seconds. Estimated time for incomplete lines: 0.0 seconds.
2025-06-12 14:24:10 +0000    2985 execution          ERROR    3/3 flow run failed, indexes: [1,2,0], exception of index 1: (UserError) GroundednessEvaluator: Either 'conversation' or individual inputs must be provided.

Run name: "azure_ai_evaluation_evaluators_groundedness_20250612_142410_422891"
Run status: "Completed"
Start time: "2025-06-12 14:24:10.429445+00:00"
Duration: "0:00:01.077503"
Output path: "/home/azureuser/.promptflow/.runs/azure_ai_evaluation_evaluators_groundedness_20250612_142410_422891"


{
    "groundedness": {
        "status": "Completed with Errors",
        "duration":

Data Requirements
 

Each line in .jsonl must be a valid JSON object.
Key names should match the evaluator's expected input (query, response, context, etc).

7. Evaluating Conversations
 
Conversation Example:

In [None]:
from azure.ai.evaluation import GroundednessEvaluator  
  
conversation = {  
    "messages": [  
        {"content": "Which tent is the most waterproof?", "role": "user"},  
        {  
            "content": "The Alpine Explorer Tent is the most waterproof",  
            "role": "assistant",  
            "context": "From our product list the Alpine Explorer Tent is the most waterproof.",  
        },  
        {"content": "How much does it cost?", "role": "user"},  
        {  
            "content": "The Alpine Explorer Tent is $120.",  
            "role": "assistant",  
            "context": None,  
        },  
    ]  
}  
  
groundedness_eval = GroundednessEvaluator(model_config)  
score = groundedness_eval(conversation=conversation)  
  
import json  
print(json.dumps(score, indent=2))  

JSONL Format for Conversations:

In [None]:
{"conversation": { "messages": [...] }}  

8. Using Composite Evaluators
 
Composite evaluators group several metrics under one evaluator:

QA Evaluator Example (works on query-response pairs):

In [7]:
evaluation_data = [
    {
        "query": "Who invented the lightbulb?",
        "response": "Thomas Edison invented the first commercially successful incandescent light bulb.",
        "context": "In 1879, Thomas Edison created the first commercially successful incandescent light bulb."
    },
    # Add more entries as needed
]


In [8]:
import json

with open("evaluation_data.jsonl", "w") as f:
    for entry in evaluation_data:
        f.write(json.dumps(entry) + "\n")


In [None]:
from azure.ai.evaluation import evaluate, QAEvaluator

# Initialize your evaluator
qa_evaluator = QAEvaluator(model_config)

# Run the evaluation
result = evaluate(
    data="evaluation_data.jsonl",
    evaluators={"qa": qa_evaluator},
    evaluation_name="RAG Evaluation Demo"
)

print(result["metrics"])


In [3]:
!az login

[93mTo sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code AZT3388RM to authenticate.[0m

Retrieving tenants and subscriptions for the selection...

[Tenant and subscription selection]

No     Subscription name                  Subscription ID                       Tenant
-----  ---------------------------------  ------------------------------------  --------
[96m[1][0m *  [96mME-MngEnvMCAP973053-yelizkilinc-1[0m  [96m00fd275a-dc44-46e0-81a6-ebc734ec11de[0m  [96mContoso[0m

The default is marked with an *; the default tenant is 'Contoso' and subscription is 'ME-MngEnvMCAP973053-yelizkilinc-1' (00fd275a-dc44-46e0-81a6-ebc734ec11de).

Select a subscription and tenant (Type a number or Enter for no changes): ^C


QA Evaluator:

Groundedness Evaluator:

In [10]:
from azure.ai.evaluation import evaluate, GroundednessEvaluator

# Define your Azure AI project details
azure_ai_project = {
    "subscription_id": "XXX",
     "project_name": "tracing",#"hackathon", #
     "resource_group_name": "rg-admin-3919_ai"#"hackathon" #
}

# Initialize the evaluator
groundedness_evaluator = GroundednessEvaluator(model_config)

# Run the evaluation
result = evaluate(
    data="evaluation_data.jsonl",
    evaluators={"groundedness": groundedness_evaluator},
    evaluation_name="RAG Groundedness Evaluation",
    azure_ai_project=azure_ai_project
)

# Output the evaluation metrics and the link to Azure AI Foundry
print(result["metrics"])
print(f"View results in Azure AI Foundry: {result.get('studio_url')}")


[2025-06-12 16:38:57 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_groundedness_20250612_163857_394168, log path: /home/azureuser/.promptflow/.runs/azure_ai_evaluation_evaluators_groundedness_20250612_163857_394168/logs.txt


2025-06-12 16:38:57 +0000    3055 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-06-12 16:39:00 +0000    3055 execution.bulk     INFO     Finished 1 / 1 lines.
2025-06-12 16:39:00 +0000    3055 execution.bulk     INFO     Average execution time for completed lines: 3.32 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_groundedness_20250612_163857_394168"
Run status: "Completed"
Start time: "2025-06-12 16:38:57.401317+00:00"
Duration: "0:00:04.073265"
Output path: "/home/azureuser/.promptflow/.runs/azure_ai_evaluation_evaluators_groundedness_20250612_163857_394168"


{
    "groundedness": {
        "status": "Completed",
        "duration": "0:00:04.073265",
        "completed_lines": 1,
        "failed_lines": 0,
        "log_path": "/home/azureuser/.promptflow/.runs/azure_ai_evaluation_evaluators_groundedness_20250612_163857_394168"
    }
}


{'groundedness.ground

9. Tracking Evaluations in Azure AI Project
 
You can log evaluation runs to your Azure AI project for easier tracking:

In [19]:
#example

Another Example:

In [None]:
pip install azure-ai-evaluation azure-identity


In [11]:
from azure.ai.evaluation import evaluate, GroundednessEvaluator, RetrievalEvaluator, ViolenceEvaluator, BleuScoreEvaluator
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
# Define your Azure AI project details
azure_ai_project = {
    "subscription_id": "XXX",
     "project_name": "tracing",#"hackathon", #
     "resource_group_name": "rg-admin-3919_ai"#"hackathon" #
}

# Initialize evaluators
evaluators = {
    "groundedness": GroundednessEvaluator(model_config),
    "retrieval": RetrievalEvaluator(model_config),
    "violence": ViolenceEvaluator(credential=credential, azure_ai_project=azure_ai_project),
    "bleu": BleuScoreEvaluator(threshold=0.5)
}

# Run the evaluation
result = evaluate(
    data="evaluation_data_new.jsonl",
    evaluators=evaluators,
    evaluation_name="RAG Comprehensive Evaluation",
    azure_ai_project=azure_ai_project
)

# Output the evaluation metrics and the link to Azure AI Foundry
print(result["metrics"])
print(f"View results in Azure AI Foundry: {result.get('studio_url')}")


[2025-06-12 16:39:14 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_groundedness_20250612_163914_690072, log path: /home/azureuser/.promptflow/.runs/azure_ai_evaluation_evaluators_groundedness_20250612_163914_690072/logs.txt
[2025-06-12 16:39:14 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_bleu_20250612_163914_691448, log path: /home/azureuser/.promptflow/.runs/azure_ai_evaluation_evaluators_bleu_20250612_163914_691448/logs.txt
[2025-06-12 16:39:14 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_violence_20250612_163914_690926, log path: /home/azureuser/.promptflow/.runs/azure_ai_evaluation_evaluators_violence_20250612_163914_690926/logs.txt
[2025-06-12 16:39:14 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_retrieval_20250612_163914_690432, log path: /

2025-06-12 16:39:14 +0000    3055 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-06-12 16:39:15 +0000    3055 execution.bulk     INFO     Finished 2 / 2 lines.
2025-06-12 16:39:15 +0000    3055 execution.bulk     INFO     Average execution time for completed lines: 0.36 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_bleu_20250612_163914_691448"
Run status: "Completed"
Start time: "2025-06-12 16:39:14.708369+00:00"
Duration: "0:00:01.193616"
Output path: "/home/azureuser/.promptflow/.runs/azure_ai_evaluation_evaluators_bleu_20250612_163914_691448"

2025-06-12 16:39:17 +0000    3055 execution.bulk     INFO     Finished 1 / 2 lines.
2025-06-12 16:39:17 +0000    3055 execution.bulk     INFO     Average execution time for completed lines: 2.69 seconds. Estimated time for incomplete lines: 2.69 seconds.
2025-06-12 16:39:17 +0000    3055 execution.bulk     INFO     Fini