# 🎯 Goal of the Exercise
In this exercise, you'll explore how to effectively evaluate text-based AI outputs using common **NLP evaluators**, such as **BLEU**, **GLEU**, **METEOR**, and **ROUGE**. Your task is to implement each evaluator, measure the similarity between AI-generated responses and reference texts, and assess the quality of text-generation models.

While these NLP evaluators might not be the right metrics to evaluate the performance of your Generative AI applications, it gives you a flavor of what looks like evaluation metrics of traditional Natural Language Processing (NLP) tasks.

This hands-on practice will help you understand:

- How NLP evaluation metrics are practically applied.
- The nuances between different evaluators.
- Best practices to objectively measure text-generation accuracy and quality.

Here's a concise table comparing **BLEU**, **GLEU**, **METEOR**, and **ROUGE** evaluators clearly:

| Criteria                 | BLEU                                       | GLEU                                       | METEOR                                       | ROUGE                                         |
|--------------------------|--------------------------------------------|--------------------------------------------|----------------------------------------------|-----------------------------------------------|
| **Primary usage**        | Machine translation                         | Grammar correction / translation fluency   | Machine translation (fluency, semantics)     | Text summarization & information extraction   |
| **Evaluation method**    | Precision-based with brevity penalty        | Precision and recall equally weighted      | Precision, recall, alignment, synonyms, stemming | Recall-focused (but also precision, F1-score)|
| **N-gram consideration** | Overlapping n-grams (Precision only)        | Overlapping n-grams (Precision & Recall)   | Flexible alignment with synonym/stemming support | Overlapping n-grams (Recall-focused)         |
| **Main strengths**       | Simple, widely-used, computationally efficient | Better correlation with grammatical correctness | Strong correlation with human judgment, semantic similarity | Excellent for summarization tasks, recall-oriented |
| **Main weaknesses**      | Ignores recall and fluency; limited semantic awareness | Less standardized in translation evaluations | More complex, slower to compute              | Less suitable for translation evaluation (focuses mainly on recall) |

---

### Quick insights on when to choose each metric:

- **BLEU**: For traditional machine translation benchmarks prioritizing simplicity and precision.
- **GLEU**: For grammatical correctness or fluency-oriented evaluations requiring balanced precision/recall.
- **METEOR**: When semantic similarity, human judgment alignment, and nuanced evaluation (synonyms/stemming) matter significantly.
- **ROUGE**: Primarily for text summarization or extractive tasks emphasizing recall (content coverage) over precision.

### Links to Documentation:
https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators

In [3]:
import os
from dotenv import load_dotenv

load_dotenv('../.env')

azure_ai_project = {
    "subscription_id": os.environ.get("SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("RG_NAME"),
    "project_name": os.environ.get("PROJECT_NAME"),
}

## NLP Evaluators

### BleuScoreEvaluator

BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine
translation. It is widely used in text summarization and text generation use cases. It evaluates how closely the
generated text matches the reference text. The BLEU score ranges from 0 to 1, with higher scores indicating
better quality.

### 🔧 Task: Implement the BLEU Score Evaluator
Fill in the missing code to initialize and use the `BleuScoreEvaluator`.

In [None]:
# TODO: Instantiate BleuScoreEvaluator and use it to evaluate a response vs. a reference ("ground truth") text.

### GleuScoreEvaluator

The GLEU (Google-BLEU) score evaluator measures the similarity between generated and reference texts by
evaluating n-gram overlap, considering both precision and recall. This balanced evaluation, designed for
sentence-level assessment, makes it ideal for detailed analysis of translation quality. GLEU is well-suited for
use cases such as machine translation, text summarization, and text generation.

### 🔧 Task: Implement the GLEU Score Evaluator
Complete the code below to correctly instantiate and apply the `GleuScoreEvaluator`.

In [None]:
# TODO: Instantiate GleuScoreEvaluator and use it to evaluate a response vs. a reference ("ground truth") text.

### MeteorScoreEvaluator

The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by
comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of
other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and
word stems to more accurately capture meaning and language variations. In addition to machine translation and
text summarization, paraphrase detection is an optimal use case for the METEOR score.

### 🔧 Task: Implement the METEOR Score Evaluator
Modify the following cell to properly initialize and evaluate using `MeteorScoreEvaluator`.

In [None]:
# TODO: Instantiate MeteorScoreEvaluator and use it to evaluate a response vs. a reference ("ground truth") text.

### RougeScoreEvaluator

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic
summarization and machine translation. It measures the overlap between generated text and reference summaries.
ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text
summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text
coherence and relevance are critical.


### 🔧 Task: Implement the ROUGE Score Evaluator
Update the code below to correctly use the `RougeScoreEvaluator`.

In [None]:
# TODO: Instantiate RougeScoreEvaluator and use it to evaluate a response vs. a reference ("ground truth") text.

## Evaluate test data using these NLP evaluators

The code below uses the Evaluate API with BLEU, GLEU, METEOR, and ROUGE evaluators to evaluate the results on a dataset.

### 🔧 Task: Evaluate a Dataset
Modify the following code to run evaluations using the evaluators you implemented above.

In [None]:
import os
from dotenv import load_dotenv
from azure.ai.evaluation import evaluate

load_dotenv('../.env')

ai_project_endpoint=os.environ["AI_PROJECT_ENDPOINT"]

# TODO: Call evaluate function and run NLP evaluators on the provided test data data.jsonl
# Optional: Export the evaluation results to your Azure AI Foundry project so - do not forget to be logged in Azure - azd auth login
# you can then visualize them in the Azure AI Foundry Portal.

View the results

In [None]:
from pprint import pprint

pprint(result)