## NLP Evaluators

### BleuScoreEvaluator

BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine
translation. It is widely used in text summarization and text generation use cases. It evaluates how closely the
generated text matches the reference text. The BLEU score ranges from 0 to 1, with higher scores indicating
better quality.

In [1]:
from azure.ai.evaluation import BleuScoreEvaluator

bleu = BleuScoreEvaluator()

In [2]:
result = bleu(response="Tokyo is the capital of Japan.", ground_truth="The capital of Japan is Tokyo.")

print(result)

{'bleu_score': 0.22961813530951883, 'bleu_result': 'fail', 'bleu_threshold': 0.5}


### GleuScoreEvaluator

The GLEU (Google-BLEU) score evaluator measures the similarity between generated and reference texts by
evaluating n-gram overlap, considering both precision and recall. This balanced evaluation, designed for
sentence-level assessment, makes it ideal for detailed analysis of translation quality. GLEU is well-suited for
use cases such as machine translation, text summarization, and text generation.

In [3]:
from azure.ai.evaluation import GleuScoreEvaluator

gleu = GleuScoreEvaluator()

In [4]:
result = gleu(response="Tokyo is the capital of Japan.", ground_truth="The capital of Japan is Tokyo.")

print(result)

{'gleu_score': 0.4090909090909091, 'gleu_result': 'fail', 'gleu_threshold': 0.5}


### MeteorScoreEvaluator

The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by
comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of
other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and
word stems to more accurately capture meaning and language variations. In addition to machine translation and
text summarization, paraphrase detection is an optimal use case for the METEOR score.

In [5]:
from azure.ai.evaluation import MeteorScoreEvaluator

meteor = MeteorScoreEvaluator(alpha=0.9, beta=3.0, gamma=0.5)

In [6]:
result = meteor(response="Tokyo is the capital of Japan.", ground_truth="The capital of Japan is Tokyo.")

print(result)

{'meteor_score': 0.9067055393586005, 'meteor_result': 'pass', 'meteor_threshold': 0.5}


### RougeScoreEvaluator

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic
summarization and machine translation. It measures the overlap between generated text and reference summaries.
ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text
summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text
coherence and relevance are critical.


In [7]:
from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)

In [8]:
result = rouge(response="Tokyo is the capital of Japan.", ground_truth="The capital of Japan is Tokyo.")

print(result)

{'rouge_precision': 1.0, 'rouge_recall': 1.0, 'rouge_f1_score': 1.0, 'rouge_precision_result': 'pass', 'rouge_recall_result': 'pass', 'rouge_f1_score_result': 'pass', 'rouge_precision_threshold': 0.5, 'rouge_recall_threshold': 0.5, 'rouge_f1_score_threshold': 0.5}


## Evaluate a Dataset using Math Evaluators

The code below uses the Evaluate API with BLEU, GLEU, METEOR, and ROUGE evaluators to evaluate the results on a dataset.

In [9]:
import os
from dotenv import load_dotenv
from azure.ai.evaluation import evaluate

load_dotenv('../../.env')

ai_project_endpoint=os.environ["AI_PROJECT_ENDPOINT"]

result = evaluate(
        evaluation_name="NLP Evaluators",
        data="data.jsonl",
        evaluators={
            "bleu": bleu,
            "gleu": gleu,
            "meteor": meteor,
            "rouge": rouge,
        },
        azure_ai_project = ai_project_endpoint,
    )

print(f'AI Foundry URL: {result.get("studio_url")}')


2025-08-07 13:52:26 +0200   25880 execution.bulk     INFO     Finished 50 / 50 lines.
2025-08-07 13:52:26 +0200   25880 execution.bulk     INFO     Average execution time for completed lines: 0.0 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "rouge_20250807_115226_287806"
Run status: "Completed"
Start time: "2025-08-07 11:52:26.287806+00:00"
Duration: "0:00:01.003112"

2025-08-07 13:52:34 +0200   15200 execution.bulk     INFO     Finished 50 / 50 lines.
2025-08-07 13:52:34 +0200   15200 execution.bulk     INFO     Average execution time for completed lines: 0.17 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "gleu_20250807_115226_312344"
Run status: "Completed"
Start time: "2025-08-07 11:52:26.312344+00:00"
Duration: "0:00:08.571440"

2025-08-07 13:52:35 +0200    5060 execution.bulk     INFO     Finished 50 / 50 lines.
2025-08-07 13:52:35 +0200    5060 execution.bulk     INFO     Average execution time for completed lines: 0.18 seconds. Es

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "bleu_20250807_115226_279723"
Run status: "Completed"
Start time: "2025-08-07 11:52:26.279723+00:00"
Duration: "0:00:08.963219"

2025-08-07 13:52:35 +0200   30396 execution.bulk     INFO     Finished 50 / 50 lines.
2025-08-07 13:52:35 +0200   30396 execution.bulk     INFO     Average execution time for completed lines: 0.18 seconds. Estimated time for incomplete lines: 0.0 seconds.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "meteor_20250807_115226_299815"
Run status: "Completed"
Start time: "2025-08-07 11:52:26.299815+00:00"
Duration: "0:00:09.254112"


{
    "bleu": {
        "status": "Completed",
        "duration": "0:00:08.963219",
        "completed_lines": 50,
        "failed_lines": 0,
        "log_path": null
    },
    "gleu": {
        "status": "Completed",
        "duration": "0:00:08.571440",
        "completed_lines": 50,
        "failed_lines": 0,
        "log_path": null
    },
    "meteor": {
        "status": "Completed",
        "duration": "0:00:09.254112",
        "completed_lines": 50,
        "failed_lines": 0,
        "log_path": null
    },
    "rouge": {
        "status": "Completed",
        "duration": "0:00:01.003112",
        "completed_lines": 50,
        "failed_lines": 0,
        "log_path": null
    }
}


AI Foundry URL: https://ai.azure.com/resource/build/evaluation/86d1f5ff-a164-41d5-b583-905d1e7c5339?wsid=/subscriptions/8babb7f9-50f7-498f-9e0a-8bef4389331d/

{'metrics': {'bleu.binary_aggregate': 0.06,
             'bleu.bleu_score': 0.2725492373855213,
             'bleu.bleu_threshold': 0.5,
             'gleu.binary_aggregate': 0.34,
             'gleu.gleu_score': 0.407072927072927,
             'gleu.gleu_threshold': 0.5,
             'meteor.binary_aggregate': 1.0,
             'meteor.meteor_score': 0.8376872508938054,
             'meteor.meteor_threshold': 0.5,
             'rouge.binary_aggregate': 0.96,
             'rouge.rouge_f1_score': 0.6787035187035185,
             'rouge.rouge_f1_score_threshold': 0.5,
             'rouge.rouge_precision': 0.6612857142857144,
             'rouge.rouge_precision_threshold': 0.5,
             'rouge.rouge_recall': 0.7116190476190476,
             'rouge.rouge_recall_threshold': 0.5},
 'rows': [{'inputs.ground_truth': 'A dog is barking loudly.',
           'inputs.response': 'The dog barks loudly.',
           'line_number': 0,
           'outputs.bleu.bleu_result': 'fail',
           'outpu

View the results

In [10]:
from pprint import pprint

pprint(result)

View the foundry URL

In [11]:
print(f'AI Foundry URL: {result.get("studio_url")}')