# General purpose and text similarity evaluators 

AI systems might generate textual responses that are incoherent, or lack the general writing quality you might desire beyond minimum grammatical correctness. To address these issues, use **Coherence and Fluency**.

If you have a question-answering (QA) scenario with both context and ground truth data in addition to query and response, you can also use our **QAEvaluator** a composite evaluator that uses relevant evaluators for judgment.

It's important to compare how closely the textual response generated by your AI system matches the response you would expect, typically called the "ground truth". Use LLM-judge metric like SimilarityEvaluator with a focus on the semantic similarity between the generated response and the ground truth, or use metrics from the field of natural language processing (NLP) including **F1 Score, BLEU, GLEU, ROUGE, and METEOR** with a focus on the overlaps of tokens or n-grams between the two.

> https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators

In [1]:
#%pip install --upgrade azure-ai-evaluation

In [2]:
import datetime
import os
import sys

from azure.ai.evaluation import AzureOpenAIModelConfiguration, BleuScoreEvaluator, CoherenceEvaluator, F1ScoreEvaluator, FluencyEvaluator, GleuScoreEvaluator, MeteorScoreEvaluator, QAEvaluator, RougeScoreEvaluator, RougeType, SimilarityEvaluator
from dotenv import load_dotenv

In [3]:
sys.version

'3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]'

In [4]:
print(f"Today is {datetime.datetime.today().strftime('%d-%b-%Y %H:%M:%S')}")

Today is 26-Jun-2025 12:34:11


In [5]:
load_dotenv("azure.env")

endpoint = os.getenv("endpoint")
key = os.getenv("key")

azure_deployment = "gpt-4.1"
api_version = "2024-10-21"

In [6]:
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=endpoint,
    api_key=key,
    azure_deployment=azure_deployment,
    api_version=api_version,
)

## Coherence evaluation

> Measures logical consistency and flow of responses.

CoherenceEvaluator measures the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A coherent response directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. Higher scores mean better coherence.

In [7]:
coherence_evaluator = CoherenceEvaluator(model_config=model_config, threshold=3)

In [8]:
coherence_evaluator(query="What is the capital of France?", response="The capital of France is Paris")

{'coherence': 4.0,
 'gpt_coherence': 4.0,
 'coherence_reason': 'The response is clear, logically organized, and directly answers the question, but its simplicity does not demonstrate advanced coherence.',
 'coherence_result': 'pass',
 'coherence_threshold': 3}

In [9]:
coherence_evaluator(query="Is Marie Curie is born in Paris?", response="She is living in Paris.")

{'coherence': 2.0,
 'gpt_coherence': 2.0,
 'coherence_reason': 'The response is a fragmented answer with some relevant words but lacks logical structure and does not address the question directly.',
 'coherence_result': 'fail',
 'coherence_threshold': 3}

## Fluency evaluator

> Measures natural language quality and readability.

FluencyEvaluatormeasures the effectiveness and clarity of written communication, focusing on grammatical accuracy, vocabulary range, sentence complexity, coherence, and overall readability. It assesses how smoothly ideas are conveyed and how easily the reader can understand the text.

In [10]:
fluency_evaluator = FluencyEvaluator(model_config=model_config)

In [11]:
fluency_evaluator(response="Victor Hugo is a writer.")

{'fluency': 3.0,
 'gpt_fluency': 3.0,
 'fluency_reason': 'The response is clear and correct, but uses basic vocabulary and a simple sentence structure without complexity or variety.',
 'fluency_result': 'pass',
 'fluency_threshold': 3}

In [12]:
fluency_evaluator(response="Victor Hugo was a renowned French Romantic writer, best known for his novels 'Les Misérables' and 'Notre-Dame de Paris'. He was also a poet, dramatist, and a significant political figure in France.")

{'fluency': 5.0,
 'gpt_fluency': 5.0,
 'fluency_reason': 'The response is well-articulated, uses varied vocabulary, and is grammatically flawless. It is coherent and cohesive, reflecting a high level of fluency.',
 'fluency_result': 'pass',
 'fluency_threshold': 3}

## QA Evaluator

> Measures comprehensively various quality aspects in question-answering.

QAEvaluator measures comprehensively various aspects in a question-answering scenario:
- Relevance
- Groundedness
- Fluency
- Coherence
- Similarity
- F1 score

In [13]:
qa_evaluator = QAEvaluator(model_config=model_config)

In [14]:
qa_evaluator(
    query="Where was Marie Curie born?",
    context=
    "Background: 1. Marie Curie was a chemist. 2. Marie Curie was born on November 7, 1867. 3. Marie Curie is a French scientist.",
    response=
    "According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw.")

{'f1_score': 0.631578947368421,
 'f1_result': 'pass',
 'f1_threshold': 3,
 'similarity': 5.0,
 'gpt_similarity': 5.0,
 'similarity_result': 'pass',
 'similarity_threshold': 3,
 'fluency': 3.0,
 'gpt_fluency': 3.0,
 'fluency_reason': 'The response is clear and correct, with adequate vocabulary and sentence structure, but it does not demonstrate advanced fluency or complexity.',
 'fluency_result': 'pass',
 'fluency_threshold': 3,
 'groundedness': 3.0,
 'gpt_groundedness': 3.0,
 'groundedness_reason': 'The response provides correct information, but it is not grounded in the provided context, as the context does not mention Warsaw or her place of birth at all.',
 'groundedness_result': 'pass',
 'groundedness_threshold': 3,
 'relevance': 4.0,
 'gpt_relevance': 4.0,
 'relevance_reason': 'The response fully and accurately answers the question by stating Marie Curie was born in Warsaw, which is all the essential information required.',
 'relevance_result': 'pass',
 'relevance_threshold': 3,
 '

## Similarity Evaluator

> AI-assisted textual similarity measurement.

SimilarityEvaluator measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response (instead of simple overlap in tokens or n-grams) and also considers the broader context of a query.

In [15]:
similarity_evaluator = SimilarityEvaluator(model_config=model_config)

In [16]:
response = similarity_evaluator(
    query="How many atoms in water?",
    response=
    "Water, chemically known as H₂O, is composed of two hydrogen atoms and one oxygen atom. Therefore, a single molecule of water contains a total of three atoms.",
    ground_truth="Three atoms: two hydrogen atoms and one oxygen atom")

print(response)

{'similarity': 5.0, 'gpt_similarity': 5.0, 'similarity_result': 'pass', 'similarity_threshold': 3}


In [17]:
response = similarity_evaluator(query="How many atoms in water?",
           response="Water is composed of three hydrogen atoms.",
           ground_truth="Three atoms: two hydrogen atoms and one oxygen atom")

print(response)

{'similarity': 2.0, 'gpt_similarity': 2.0, 'similarity_result': 'fail', 'similarity_threshold': 3}


## F1 score

> Harmonic mean of precision and recall in token overlaps between response and ground truth.

F1ScoreEvaluator measures the similarity by shared tokens between the generated text and the ground truth, focusing on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. Ratio is computed over the individual words in the generated response against those in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score. Precision is the ratio of the number of shared words to the total number of words in the generation. Recall is the ratio of the number of shared words to the total number of words in the ground truth.

In [18]:
f1_evaluator = F1ScoreEvaluator()

In [19]:
answer = f1_evaluator(response="I am walking in the street", ground_truth="Just walking")

print(answer["f1_score"])

0.28571428571428575


## BLEU score

> Bilingual Evaluation Understudy score for translation quality measures overlaps in n-grams between response and ground truth.

BleuScoreEvaluator computes the BLEU (Bilingual Evaluation Understudy) score commonly used in natural language processing (NLP) and machine translation. It measures how closely the generated text matches the reference text.

In [20]:
bleu_score = BleuScoreEvaluator()

In [21]:
answer = bleu_score(response="I am walking in the street",
                    ground_truth="I am walking in a street")

print(answer["bleu_score"])

0.537284965911771


## Gleu Score

> Google-BLEU variant for sentence-level assessment measures overlaps in n-grams between response and ground truth.

GleuScoreEvaluator computes the GLEU (Google-BLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth, similar to the BLEU score, focusing on both precision and recall. But it addresses the drawbacks of the BLEU score using a per-sentence reward objective.

In [22]:
gleu_score = GleuScoreEvaluator()

In [23]:
answer = gleu_score(response="I am walking in the street",
                    ground_truth="I am walking in a street")

print(answer["gleu_score"])

0.6111111111111112


## Rouge

> Recall-Oriented Understudy for Gisting Evaluation measures overlaps in n-grams between response and ground truth.

RougeScoreEvaluator computes the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score.

In [24]:
rouge_evaluator = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_L, precision_threshold=0.6, recall_threshold=0.5, f1_score_threshold=0.55) 

In [25]:
rouge_evaluator(
    response=
    "According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw.")

{'rouge_precision': 0.46153846153846156,
 'rouge_recall': 1.0,
 'rouge_f1_score': 0.631578947368421,
 'rouge_precision_result': 'fail',
 'rouge_recall_result': 'pass',
 'rouge_f1_score_result': 'pass',
 'rouge_precision_threshold': 0.6,
 'rouge_recall_threshold': 0.5,
 'rouge_f1_score_threshold': 0.55}

## Meteor

> Metric for Evaluation of Translation with Explicit Ordering measures overlaps in n-grams between response and ground truth.

MeteorScoreEvaluator measures the similarity by shared n-grams between the generated text and the ground truth, similar to the BLEU score, focusing on precision and recall. But it addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.

In [26]:
meteor_evaluator = MeteorScoreEvaluator(threshold=0.9)

In [27]:
answer = meteor_evaluator(
    response="Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie first moments were in Warsaw.")

print(answer["meteor_score"])

0.5831325301204818
