## Azure OpenAI evaluators

The Azure OpenAI Graders are a new set of evaluation graders available in the Azure AI Foundry SDK, aimed at evaluating the performance of AI models and their outputs. These graders including Label grader, String checker, Text similarity, and General grader can be run locally or remotely. Each grader serves a specific purpose in assessing different aspects of AI model/model outputs.

> https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/azure-openai-graders

In [1]:
import datetime
import os
import json
import sys

from azure.ai.evaluation import AzureOpenAIGrader, AzureOpenAILabelGrader, AzureOpenAIModelConfiguration, AzureOpenAIStringCheckGrader, AzureOpenAITextSimilarityGrader, evaluate
from dotenv import load_dotenv
from openai.types.graders import StringCheckGrader
from pprint import pprint

In [2]:
sys.version

'3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]'

In [3]:
print(f"Today is {datetime.datetime.today().strftime('%d-%b-%Y %H:%M:%S')}")

Today is 26-Jun-2025 12:35:19


In [4]:
load_dotenv("azure.env")

endpoint = os.getenv("endpoint")
key = os.getenv("key")

In [5]:
model = "gpt-4.1"

In [6]:
model_config = AzureOpenAIModelConfiguration(azure_endpoint=endpoint,
                                             api_key=key,
                                             azure_deployment=model,
                                             api_version="2024-10-21")

## Label grader

AzureOpenAILabelGrader uses your custom prompt to instruct a model to classify outputs based on labels you define. It returns structured results with explanations for why each label was chosen.

In [7]:
json_file = "sample.jsonl"

with open(json_file, 'r', encoding='utf-8') as file:
    for line in file:
        data = json.loads(line)
        print(json.dumps(data, indent=5))

{
     "query": "What is the importance of choosing the right provider in getting the most value out of your health insurance plan?",
     "ground_truth": "Choosing an in-network provider helps you save money and ensures better, more personalized care. [Northwind_Health_Plus_Benefits_Details-3.pdf]",
     "response": "Choosing the right provider is key to maximizing your health insurance benefits. In-network providers reduce costs, offer better coverage, and support continuity of care, leading to more effective and personalized treatment. [Northwind_Health_Plus_Benefits_Details.pdf][Northwind_Standard_Benefits_Details.pdf]"
}
{
     "query": "What should you do when choosing an in-network provider for your health care needs?",
     "ground_truth": "Check with Northwind Health Plus to confirm the provider is in-network, as this helps reduce costs.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
     "response": "To choose an in-network provider, confirm they are part of your plan usin

In [8]:
#  Evaluation criteria: Determine if the response column contains texts that are "too short", "just right", or "too long" and pass if it is "just right"
label_grader = AzureOpenAILabelGrader(
    model_config=model_config,
    input=[{
        "content": "{{item.response}}",
        "role": "user"
    }, {
        "content":
        "Any text including space that's more than 600 characters are too long, less than 500 characters are too short; 500 to 600 characters are just right.",
        "role": "user",
        "type": "message"
    }],
    labels=["too short", "just right", "too long"],
    passing_labels=["just right"],
    model=model,
    name="label",
)

label_grader_evaluation = evaluate(
    data=json_file,
    evaluators={"label": label_grader},
)

Class AzureOpenAILabelGrader: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AzureOpenAIGrader: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Method _begin_aoai_evaluation: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [9]:
label_grader

<azure.ai.evaluation._aoai.label_grader.AzureOpenAILabelGrader at 0x7f72d9a27ee0>

In [10]:
pprint(label_grader_evaluation, width=150)

{'metrics': {'label.pass_rate': 0.2},
 'rows': [{'ground_truth': 'Choosing an in-network provider helps you save money and ensures better, more personalized care. '
                           '[Northwind_Health_Plus_Benefits_Details-3.pdf]',
           'outputs.label.label_result': 'fail',
           'outputs.label.passed': False,
           'outputs.label.sample': {'error': None,
                                    'finish_reason': 'stop',
                                    'input': [{'content': 'The limitation of in-network providers is that they may not accept the amount of payment '
                                                          'offered by Northwind Health, which means you may be responsible for a greater portion of '
                                                          'the cost [Northwind_Standard_Benefits_Details.pdf]. Additionally, out-of-network '
                                                          'providers may not offer additional services or discoun

## String checker

Compares input text to a reference value, checking for exact or partial matches with optional case insensitivity. Useful for flexible text validations and pattern matching.

In [11]:
# Evaluation criteria: Pass if the query column contains "What is"
string_grader = AzureOpenAIStringCheckGrader(
    model_config=model_config,
    input="{{item.query}}",
    name="starts with what is",
    operation=
    "like",  # "eq" for equal, "ne" for not equal, "like" for contain, "ilike" for case insensitive contain
    reference="What is",
)

string_grader_evaluation = evaluate(
    data=json_file,
    evaluators={"string": string_grader},
)

Class AzureOpenAIStringCheckGrader: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [12]:
string_grader

<azure.ai.evaluation._aoai.string_check_grader.AzureOpenAIStringCheckGrader at 0x7f72d9a26b90>

In [13]:
pprint(string_grader_evaluation, width=150)

{'metrics': {'string.pass_rate': 0.4},
 'rows': [{'ground_truth': 'Choosing an in-network provider helps you save money and ensures better, more personalized care. '
                           '[Northwind_Health_Plus_Benefits_Details-3.pdf]',
           'outputs.string.passed': False,
           'outputs.string.sample': None,
           'outputs.string.score': 0.0,
           'outputs.string.string_result': 'fail',
           'query': 'What is the importance of choosing the right provider in getting the most value out of your health insurance plan?',
           'response': 'Choosing the right provider is key to maximizing your health insurance benefits. In-network providers reduce costs, offer '
                       'better coverage, and support continuity of care, leading to more effective and personalized treatment. '
                       '[Northwind_Health_Plus_Benefits_Details.pdf][Northwind_Standard_Benefits_Details.pdf]'},
          {'ground_truth': 'Check with Northwind Heal

## Text similarity

Evaluates how closely input text matches a reference value using similarity metrics likefuzzy_match, BLEU, ROUGE, or METEOR. Useful for assessing text quality or semantic closeness.

In [14]:
# Evaluation criteria: Pass if response column and ground_truth column similarity score >= 0.5 using "fuzzy_match"
sim_grader = AzureOpenAITextSimilarityGrader(
    model_config=model_config,
    # support evaluation metrics including: "fuzzy_match", "bleu", "gleu", "meteor", "rouge_1", "rouge_2", "rouge_3", "rouge_4", "rouge_5", "rouge_l", "cosine",
    evaluation_metric="fuzzy_match",
    input="{{item.response}}",
    name="similarity",
    pass_threshold=0.5,
    reference="{{item.ground_truth}}",
)

sim_grader_evaluation = evaluate(
    data=json_file,
    evaluators={"similarity": sim_grader},
)

Class AzureOpenAITextSimilarityGrader: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [15]:
pprint(sim_grader_evaluation, width=150)

{'metrics': {'similarity.pass_rate': 0.4},
 'rows': [{'ground_truth': 'Choosing an in-network provider helps you save money and ensures better, more personalized care. '
                           '[Northwind_Health_Plus_Benefits_Details-3.pdf]',
           'outputs.similarity.passed': True,
           'outputs.similarity.sample': None,
           'outputs.similarity.score': 0.6117136659436009,
           'outputs.similarity.similarity_result': 'pass',
           'query': 'What is the importance of choosing the right provider in getting the most value out of your health insurance plan?',
           'response': 'Choosing the right provider is key to maximizing your health insurance benefits. In-network providers reduce costs, offer '
                       'better coverage, and support continuity of care, leading to more effective and personalized treatment. '
                       '[Northwind_Health_Plus_Benefits_Details.pdf][Northwind_Standard_Benefits_Details.pdf]'},
          {'gro

In [16]:
# Evaluation criteria: Pass if response column and ground_truth column similarity score >= 0.5 using "fuzzy_match"
sim_grader = AzureOpenAITextSimilarityGrader(
    model_config=model_config,
    evaluation_metric=
    "bleu",  # support evaluation metrics including: "fuzzy_match", "bleu", "gleu", "meteor", "rouge_1", "rouge_2", "rouge_3", "rouge_4", "rouge_5", "rouge_l", "cosine",
    input="{{item.response}}",
    name="similarity",
    pass_threshold=0.5,
    reference="{{item.ground_truth}}",
)

sim_grader_evaluation = evaluate(
    data=json_file,
    evaluators={"similarity": sim_grader},
)

In [17]:
pprint(sim_grader_evaluation, width=150)

{'metrics': {'similarity.pass_rate': 0.0},
 'rows': [{'ground_truth': 'Choosing an in-network provider helps you save money and ensures better, more personalized care. '
                           '[Northwind_Health_Plus_Benefits_Details-3.pdf]',
           'outputs.similarity.passed': False,
           'outputs.similarity.sample': None,
           'outputs.similarity.score': 0.01812756190015922,
           'outputs.similarity.similarity_result': 'fail',
           'query': 'What is the importance of choosing the right provider in getting the most value out of your health insurance plan?',
           'response': 'Choosing the right provider is key to maximizing your health insurance benefits. In-network providers reduce costs, offer '
                       'better coverage, and support continuity of care, leading to more effective and personalized treatment. '
                       '[Northwind_Health_Plus_Benefits_Details.pdf][Northwind_Standard_Benefits_Details.pdf]'},
          {'g

## General grader

Advanced users have the capability to import or define a custom grader and integrate it into the AOAI general grader. This allows for evaluations to be performed based on specific areas of interest aside from the existing AOAI graders. Following is an example to import the OpenAI StringCheckGrader and construct it to be ran as a AOAI general grader on Foundry SDK.

In [18]:
# Define an string check grader config directly using the OAI SDK
# Evaluation criteria: Pass if query column contains "Northwind"

oai_string_check_grader = StringCheckGrader(input="{{item.query}}",
                                            name="contains hello",
                                            operation="like",
                                            reference="Northwind",
                                            type="string_check")

# Plug that into the general grader
general_grader = AzureOpenAIGrader(model_config=model_config,
                                   grader_config=oai_string_check_grader)

evaluation = evaluate(
    data=json_file,
    evaluators={
        "general": general_grader,
    },
)

In [19]:
pprint(evaluation, width=150)

{'metrics': {'general.pass_rate': 0.4},
 'rows': [{'ground_truth': 'Choosing an in-network provider helps you save money and ensures better, more personalized care. '
                           '[Northwind_Health_Plus_Benefits_Details-3.pdf]',
           'outputs.general.general_result': 'pass',
           'outputs.general.passed': True,
           'outputs.general.sample': None,
           'outputs.general.score': 1.0,
           'query': 'What is the importance of choosing the right provider in getting the most value out of your health insurance plan?',
           'response': 'Choosing the right provider is key to maximizing your health insurance benefits. In-network providers reduce costs, offer '
                       'better coverage, and support continuity of care, leading to more effective and personalized treatment. '
                       '[Northwind_Health_Plus_Benefits_Details.pdf][Northwind_Standard_Benefits_Details.pdf]'},
          {'ground_truth': 'Check with Northwind