# Evaluate with quantitative NLP evaluators

## Objective
This notebook demonstrates how to use NLP-based evaluators to assess the quality of generated text by comparing it to reference text. By the end of this tutorial, you'll be able to:
 - Understand different NLP evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.
 - Evaluate dataset using these evaluators.

## Time
You should expect to spend about 10 minutes running this notebook.

## Before you begin

### Installation
Install the following packages required to execute this notebook.

In [1]:
# Install the packages
%pip install azure-ai-evaluation

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os
from pprint import pprint
from dotenv import load_dotenv
load_dotenv("../.credentials.env")

True

## NLP Evaluators

### BleuScoreEvaluator

BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine
translation. It is widely used in text summarization and text generation use cases. It evaluates how closely the
generated text matches the reference text. The BLEU score ranges from 0 to 1, with higher scores indicating
better quality.

In [3]:
from azure.ai.evaluation import BleuScoreEvaluator

bleu = BleuScoreEvaluator()

In [4]:
result = bleu(response="London is the capital of England.", ground_truth="The capital of England is London.")

print(result)

{'bleu_score': 0.22961813530951883}


### GleuScoreEvaluator

The GLEU (Google-BLEU) score evaluator measures the similarity between generated and reference texts by
evaluating n-gram overlap, considering both precision and recall. This balanced evaluation, designed for
sentence-level assessment, makes it ideal for detailed analysis of translation quality. GLEU is well-suited for
use cases such as machine translation, text summarization, and text generation.

In [5]:
from azure.ai.evaluation import GleuScoreEvaluator

gleu = GleuScoreEvaluator()

In [6]:
result = gleu(response="London is the capital of England.", ground_truth="The capital of England is London.")

print(result)

{'gleu_score': 0.4090909090909091}


### MeteorScoreEvaluator

The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by
comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of
other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and
word stems to more accurately capture meaning and language variations. In addition to machine translation and
text summarization, paraphrase detection is an optimal use case for the METEOR score.

In [7]:
from azure.ai.evaluation import MeteorScoreEvaluator

meteor = MeteorScoreEvaluator(alpha=0.9, beta=3.0, gamma=0.5)

In [8]:
result = meteor(response="London is the capital of England.", ground_truth="The capital of England is London.")

print(result)

{'meteor_score': 0.9067055393586005}


### RougeScoreEvaluator

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic
summarization and machine translation. It measures the overlap between generated text and reference summaries.
ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text
summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text
coherence and relevance are critical.


In [9]:
from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)

In [10]:
result = rouge(response="London is the capital of England.", ground_truth="The capital of England is London.")

print(result)

{'rouge_precision': 1.0, 'rouge_recall': 1.0, 'rouge_f1_score': 1.0}


## Evaluate a Dataset using Math Evaluators

The code below uses the Evaluate API with BLEU, GLEU, METEOR, and ROUGE evaluators to evaluate the results on a dataset.

In [11]:
from azure.ai.evaluation import evaluate
import random

randomNum = random.randint(1111, 9999)
result = evaluate(
    data="nlp_data.jsonl",
    evaluation_name="NLP-demo-" + str(randomNum),
    evaluators={
        "bleu": bleu,
        "gleu": gleu,
        "meteor": meteor,
        "rouge": rouge,
    },
    # Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project
    azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_PROJECT_NAME"),
    },
)

[2024-12-20 10:14:21 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_bleu_bleu_asyncbleuscoreevaluator_bo0c_q9x_20241220_101420_607274, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_bleu_bleu_asyncbleuscoreevaluator_bo0c_q9x_20241220_101420_607274\logs.txt
[2024-12-20 10:14:22 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_meteor_meteor_asyncmeteorscoreevaluator_3hd7ofch_20241220_101420_611426, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_meteor_meteor_asyncmeteorscoreevaluator_3hd7ofch_20241220_101420_611426\logs.txt
[2024-12-20 10:14:22 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_gleu_gleu_asyncgleuscoreevaluator_564dpwb2_20241220_101420_615029, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_gleu_gleu_asyncgleu

Prompt flow service has started...
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_bleu_bleu_asyncbleuscoreevaluator_bo0c_q9x_20241220_101420_607274
Prompt flow service has started...
Prompt flow service has started...
Prompt flow service has started...
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_meteor_meteor_asyncmeteorscoreevaluator_3hd7ofch_20241220_101420_611426
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_gleu_gleu_asyncgleuscoreevaluator_564dpwb2_20241220_101420_615029
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_rouge_rouge_asyncrougescoreevaluator_gtb5l99b_20241220_101420_616029
2024-12-20 10:14:22 +0000   61976 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in Ba

View the results

In [12]:
from pprint import pprint

pprint(result)

{'metrics': {'bleu.bleu_score': 0.27619794053333335,
             'gleu.gleu_score': 0.34843304843333334,
             'meteor.meteor_score': 0.7349908339666668,
             'rouge.rouge_f1_score': 0.5913715913666667,
             'rouge.rouge_precision': 0.6666666666666666,
             'rouge.rouge_recall': 0.5321428571333334},
 'rows': [{'inputs.ground_truth': 'A cat is sitting on the mat.',
           'inputs.response': 'The cat sits on the mat.',
           'line_number': 0,
           'outputs.bleu.bleu_score': 0.37684991640000004,
           'outputs.gleu.gleu_score': 0.4230769231,
           'outputs.meteor.meteor_score': 0.7454289733,
           'outputs.rouge.rouge_f1_score': 0.6153846154,
           'outputs.rouge.rouge_precision': 0.6666666667000001,
           'outputs.rouge.rouge_recall': 0.5714285714},
          {'inputs.ground_truth': 'She loves to read books.',
           'inputs.response': 'She enjoys reading books.',
           'line_number': 1,
           'outputs.