# üõçÔ∏è | Cora-For-Zava: Explore Quality Evaluators

Welcome! This notebook helps you dive deep into quality evaluators for measuring AI response quality.

## üõí Our Zava Scenario
**Cora** is a customer service chatbot for **Zava** - a fictitious retailer of home improvement goods for DIY enthusiasts. Cora must provide accurate, relevant, and coherent responses about hardware and home improvement products to Zava customers. This notebook helps you assess response quality using specialized evaluators for groundedness, relevance, coherence, fluency, and similarity‚Äîensuring Cora meets the high standards expected in a production retail environment.

## üéØ What You'll Build

By the end of this notebook, you'll have:
- ‚úÖ Learned what AI-assisted evaluation workflows are
- ‚úÖ Explored the built-in quality evaluators in Azure AI Foundry
- ‚úÖ Run individual quality evaluators with test prompts
- ‚úÖ Used composite evaluators to assess multiple quality metrics
- ‚úÖ Analyzed quality evaluation results

## üí° What You'll Learn

- What AI-assisted evaluation workflows are and how to run them
- The built-in quality evaluators available in Azure AI Foundry
- How to run quality evaluators on test data
- How to use composite evaluators for comprehensive quality assessment
- How to interpret quality metrics for your AI application

Ready to evaluate quality? Let's get started! üöÄ


---

## 1. Authenticate with Azure

To use the Azure AI evaluation SDK, you need to authenticate with Azure. The SDK uses the Azure Identity library to handle authentication. In this lab, we will use the `DefaultAzureCredential` class.

In [None]:
# Verify that you are authenticated
!az ad signed-in-user show

In [None]:
# Generate a default credential
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()

# Check: credential created
from pprint import pprint
pprint(credential)

---

## 2. Create the Azure AI Project Configuration

We need to create the Azure AI Project configuration that will be used to upload evaluation results to the Azure AI Foundry portal.

In [None]:
# Get Azure AI project configuration from environment variables
import os
from pprint import pprint

subscription_id = os.environ.get("AZURE_SUBSCRIPTION_ID")
resource_group_name = os.environ.get("AZURE_RESOURCE_GROUP")
project_name = os.environ.get("AZURE_AI_PROJECT_NAME")
azure_ai_foundry_name = os.environ.get("AZURE_AI_FOUNDRY_NAME")

# Create the azure_ai_project dictionary (used by some evaluators)
azure_ai_project = {
    "subscription_id": subscription_id,
    "resource_group_name": resource_group_name,
    "project_name": project_name,
}

# Create the azure_ai_project_url (used by ContentSafetyEvaluator)
azure_ai_project_url = f"https://{azure_ai_foundry_name}.services.ai.azure.com/api/projects/{project_name}"

print("Azure AI Project Configuration: Complete")

In [None]:
# Model configuration for AI-assisted evaluators
# in Foundry projects

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
    api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
)

---

## 4. Basic Quality Evaluators

Scores are typically numerical, generated using a Likert scale (1 to 5) with higher scores indicating better quality. The _threshold_ sets the cutoff for a "pass/fail" rating on that evaluator, helping you get a quick sense of where the primary issues lie.



In [None]:
# 2.1 Coherence
# ............
# CoherenceEvaluator measures the logical and orderly presentation of ideas in a response, 
# allowing the reader to easily follow and understand the writer's train of thought. 
# A coherent response directly addresses the question with clear connections between 
# sentences and paragraphs, using appropriate transitions and a logical sequence of ideas. 
# Higher scores mean better coherence.

from azure.ai.evaluation import CoherenceEvaluator

coherence = CoherenceEvaluator(model_config=model_config, threshold=3)
coherence(
    query="Do you sell hammers?", 
    response="Yes, we carry hammers in our tool section. They're located in aisle 7."
)

In [None]:
# 2.2 Fluency
# ..........
# FluencyEvaluator measures the effectiveness and clarity of written communication, 
# focusing on grammatical accuracy, vocabulary range, sentence complexity, coherence, 
# and overall readability. It assesses how smoothly ideas are conveyed and how easily 
# the reader can understand the text.

from azure.ai.evaluation import FluencyEvaluator

fluency = FluencyEvaluator(model_config=model_config, threshold=3)
fluency(
    query="Do you have paint brushes?",
    response="We have paint brushes available in various sizes for your project needs."
)

## 5. Composite QA Evaluator
QAEvaluator measures comprehensively various aspects in a question-answering scenario - including Relevance, Groundedness, Fluency, Coherence, Similarity, and F1 score.

In [None]:
# 2.3 QA Evaluation
# .................
# QAEvaluator measures comprehensively various aspects 
# in a question-answering scenario:
# Relevance / Groundedness / Fluency / Coherence
# Similarity / F1 score

from azure.ai.evaluation import QAEvaluator

qa_eval = QAEvaluator(model_config=model_config, threshold=3)
qa_eval(
    query="What type of drill should I buy for woodworking?", 
    context="Product Info: 1. We carry cordless drills ideal for woodworking. 2. Popular brands include DeWalt and Makita. 3. Prices range from $89 to $249.",
    response="For woodworking, I recommend our cordless drills. We have DeWalt and Makita models starting at $89.",
    ground_truth="Cordless drills from DeWalt or Makita are best for woodworking projects."
)

---

## 6.  Retrieval Augmented Generation (RAG) Evaluators

A retrieval-augmented generation (RAG) system tries to generate the most relevant answer consistent with grounding documents in response to a user's query.  This requires it to _retrieve_ documents that provide grounding context, and _generate_ responses that are relevance, consistent with grounding data, and complete.





### 6.1 Retrieval Evaluator
RetrievalEvaluator measures the textual quality of retrieval results with an LLM without requiring ground truth. This metric focuses on how relevant the context chunks (encoded as a string) are to address a query and how the most relevant context chunks are surfaced at the top of the list.

In [None]:
from azure.ai.evaluation import RetrievalEvaluator

retrieval = RetrievalEvaluator(model_config=model_config, threshold=3)
retrieval(
    query="What paint finish works best for bathrooms?", 
    context="Product Guide: 1. Semi-gloss paint is ideal for high-moisture areas. 2. Satin finish provides good moisture resistance. 3. Matte finish is not recommended for bathrooms.",
)


### 6.2 Groundedness Evaluator 
GroundednessEvaluator measures how well the generated response aligns with the given context (grounding source) and doesn't fabricate content outside of it. 

In [None]:
from azure.ai.evaluation import GroundednessEvaluator

groundedness = GroundednessEvaluator(model_config=model_config, threshold=3)
groundedness(
    query="Is this paint safe for children's rooms?", 
    context="Product Details: 1. Our Interior Paint is low-VOC certified. 2. Safe for indoor use including nurseries. 3. Meets EPA safety standards.",
    response="Yes, this paint is low-VOC and safe for children's rooms."
)

### 6.3 Relevance Evaluator

RelevanceEvaluator measures how effectively a response addresses a query. It assesses the accuracy, completeness, and direct relevance of the response based solely on the given query. Higher scores mean better relevance.

In [None]:
{
    "relevance": 5.0,
    "gpt_relevance": 5.0, 
    "relevance_reason": "The response directly answers the query by confirming the paint is low-VOC and safe for children's rooms, which precisely addresses the customer's question about safety.",
    "relevance_result": "pass", 
    "relevance_threshold": 3
}

In [None]:
from azure.ai.evaluation import CoherenceEvaluator
coherence_evaluator = CoherenceEvaluator(model_config)

result = coherence_evaluator(
    query="Which screwdriver set is best for DIY projects?",
    response="Our 20-piece precision screwdriver set is perfect for DIY work and includes both Phillips and flathead types."
)

from pprint import pprint
pprint(result)

### 6.4 Response Completeness Evaluator

ResponseCompletenessEvaluator that captures the recall aspect of response alignment with the expected response. This is complementary to GroundednessEvaluator which captures the precision aspect of response alignment with the grounding source.

In [None]:
from azure.ai.evaluation import ResponseCompletenessEvaluator

response_completeness = ResponseCompletenessEvaluator(model_config=model_config, threshold=3)
response_completeness(
    response="Our cordless drill has a 20V battery and weighs 4.5 pounds.",
    ground_truth="The drill features a 20V lithium battery, weighs 4.5 pounds, and includes a 2-year warranty."
)

---

## 7. Textual Similarity Evaluators


These evaluators compare how closely the textual response generated by your AI system matches the response you would expect, typically called the "ground truth".
- The SimilarityEvaluator uses an "LLM-as-Judge" (AI-assisted evaluation) approach to score the metric.
- The F1 Score, BLEU, GLEU, ROUGE and METEOR evaluators (NLP-based) use a mathematical approach to score the metric.

### 7.1 Similarity Evaluator
SimilarityEvaluator measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response (instead of simple overlap in tokens or n-grams) and also considers the broader context of a query.



In [None]:
from azure.ai.evaluation import SimilarityEvaluator

similarity = SimilarityEvaluator(model_config=model_config, threshold=3)
similarity(
    query="Do you have exterior paint?", 
    response="Yes, we stock weather-resistant exterior paint in multiple colors.",
    ground_truth="We carry exterior paint suitable for outdoor use."
)

### 7.2 F1 Score
F1ScoreEvaluator measures the similarity by shared tokens between the generated text and the ground truth, focusing on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. 

In [None]:
from azure.ai.evaluation import F1ScoreEvaluator

f1_score = F1ScoreEvaluator(threshold=0.5)
f1_score(
    response="We stock weather-resistant exterior paint in various colors.",
    ground_truth="We carry exterior paint suitable for outdoor use."
)

### 7.3 BLEU Score
BleuScoreEvaluator computes the BLEU (Bilingual Evaluation Understudy) score commonly used in natural language processing (NLP) and machine translation. It measures how closely the generated text matches the reference text.

In [None]:
from azure.ai.evaluation import BleuScoreEvaluator

bleu_score = BleuScoreEvaluator(threshold=0.3)
bleu_score(
    response="We stock weather-resistant exterior paint in various colors.",
    ground_truth="We carry exterior paint suitable for outdoor use."
)

### 7.4 GLEU Score

GleuScoreEvaluator computes the GLEU (Google-BLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth, similar to the BLEU score, focusing on both precision and recall. But it addresses the drawbacks of the BLEU score using a per-sentence reward objective. The numerical score is a 0-1 float and a higher score is better. 

In [None]:
from azure.ai.evaluation import GleuScoreEvaluator


gleu_score = GleuScoreEvaluator(threshold=0.2)
gleu_score(
    response="We stock weather-resistant exterior paint in various colors.",
    ground_truth="We carry exterior paint suitable for outdoor use."
)

### 7.5 ROUGE Score
RougeScoreEvaluator computes the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. The numerical score is a 0-1 float and a higher score is better. 

In [None]:
from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_L, precision_threshold=0.6, recall_threshold=0.5, f1_score_threshold=0.55) 
rouge(
    response="We stock weather-resistant exterior paint in various colors.",
    ground_truth="We carry exterior paint suitable for outdoor use."
)

### 7.6 METEOR Score
MeteorScoreEvaluator measures the similarity by shared n-grams between the generated text and the ground truth, similar to the BLEU score, focusing on precision and recall. But it addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.

In [None]:
from azure.ai.evaluation import MeteorScoreEvaluator

meteor_score = MeteorScoreEvaluator(threshold=0.9)
meteor_score(
    response="We stock weather-resistant exterior paint in various colors.",
    ground_truth="We carry exterior paint suitable for outdoor use."
)

---

## 8. Explore Custom Evaluators



### 8.1 Code-Based Evaluator



In [None]:
# Custom evaluator as a function to calculate response length
def response_length(response, **kwargs):
    return len(response)

# Custom class based evaluator to check for blocked words
class BlocklistEvaluator:
    def __init__(self, blocklist):
        self._blocklist = blocklist

    def __call__(self, *, answer: str, **kwargs):
        contains_block_word = any(word in answer for word in self._blocklist)
        return {"score": contains_block_word}

blocklist_evaluator = BlocklistEvaluator(blocklist=["unavailable", "discontinued", "out of stock"])

# Test custom evaluator 1
result = response_length("Our cordless drill is perfect for home projects.")
print(result)

# Test custom evaluator 2
result = blocklist_evaluator(answer="This drill is available in our store.")
print(result)

# Test custom evaluator 3
result = blocklist_evaluator(answer="Sorry, that item is currently out of stock.")
print(result)

### 8.2 Prompt-Based Evaluator
To build your own prompt-based large language model evaluator or AI-assisted annotator, you can create a custom evaluator based on a prompt template. [Learn more here](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/custom-evaluators). You can also create a custom grader using the Azure OpenAI grader with custom prompts. [See example here](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/samples/aoai_score_model_grader_sample.py)

---

## 9. Run Multiple Evaluators



In [None]:
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    ContentSafetyEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    GroundednessEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
)

# Create evaluators
# ContentSafetyEvaluator requires the azure_ai_project_url (string) format
content_safety_evaluator = ContentSafetyEvaluator(azure_ai_project=azure_ai_project_url, credential=credential)

# Other evaluators use the model_config
relevance_evaluator = RelevanceEvaluator(model_config)
coherence_evaluator = CoherenceEvaluator(model_config)
groundedness_evaluator = GroundednessEvaluator(model_config)
fluency_evaluator = FluencyEvaluator(model_config)
similarity_evaluator = SimilarityEvaluator(model_config)


result = evaluate(
    data="42-evaluate-quality.jsonl",
    evaluators={
        "content_safety": content_safety_evaluator,
        "coherence": coherence_evaluator,
        "relevance": relevance_evaluator,
        "groundedness": groundedness_evaluator,
        "fluency": fluency_evaluator,
        "similarity": similarity_evaluator,
    },
    evaluation_name="42-evaluate-quality",
    # column mapping
    evaluator_config={
        "content_safety": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}"
            }
        },
        "coherence": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}"
            }
        },
        "groundedness": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.ground_truth}",
                "response": "${data.response}"
            } 
        },
        "relevance": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}"
            } 
        },
        "fluency": {
            "column_mapping": {
                "response": "${data.response}"
            } 
        },
        "similarity": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
                "ground_truth": "${data.ground_truth}"
            } 
        },
    },

    # Specify the azure_ai_project_url (string format) to push results to portal
    azure_ai_project = azure_ai_project_url,
    
    # Specify the output path to push results also to local file
    output_path="./42-evaluate-quality.results.json"
)

--- 

### 9.1 View Results Online

Just as before, you can now view the results of the multi-evaluator run using the Evaluation tab in the Azure AI Foundry Studio. Here is what you should see:




#### Quality Evaluation

![Quality](./../docs/img/screenshots/lab-02-portal-quality.png)

### 9.2 View Results Locally

Just like before, you can see the results of the multi-evaluator run stored in a local json file at `02-quality-evaluators.results.json`. 
- Open the file in the VS Code editor
- Right click and select "Format Document" to read the results better
- Observe the metrics collected for each row - these are the multiple evaluators running on the same sample prompt/response pair
- Scroll down to the end of the file to see the summary of the evaluation run - get a sense of the overall metrics for the run.

---

## üéâ | Congratulations!

You have successfully completed the second lab in this module and got hands-on experience with a core subset of the the built-in quality evaluators. You also got a sense of how to create and run a custom evaluator.