# 🎯 Goal of the Exercise  

In this exercise, you'll learn how to evaluate the **quality of AI-generated text responses** using advanced, AI-assisted evaluators such as **Relevance**, **Coherence**, **Fluency**, **Groundedness**, and custom-built evaluators. Your tasks involve implementing, configuring, and applying these evaluators to ensure generated content meets high-quality standards.

Through this exercise, you'll gain practical experience in:

- Identifying the key dimensions of text quality (relevance, coherence, fluency, groundedness).
- Leveraging AI-assisted built-in evaluators (*LLM-as-a-judge*) to assess text outputs.
- Creating custom evaluators tailored to your specific use cases or quality standards.

# Links to documentation

https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/rag-evaluators

https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/custom-evaluators


In [13]:
from IPython.display import clear_output

# Evaluating Response Quality with RelevanceEvaluator

### 🔧 Task: Implement the Relevance Evaluator

Evaluates relevance score for a given query and response or a multi-turn conversation, including reasoning.

The relevance measure assesses the ability of answers to capture the key points of the context. High relevance scores signify the AI system's understanding of the input and its capability to produce coherent and contextually appropriate outputs. Conversely, low relevance scores indicate that generated responses might be off-topic, lacking in context, or insufficient in addressing the user's intended queries. Use the relevance metric when evaluating the AI system's performance in understanding the input and generating contextually appropriate responses.

Relevance scores range from 1 to 5, with 1 being the worst and 5 being the best.

Fill in the missing code to initialize and use the `RelevanceEvaluator`.

In [None]:
# TODO: Instantiate, configure and run the RelevanceEvaluator in order to eveluate
# the relevance of a response compared to the initial query.

# Evaluating Response Quality with CoherenceEvaluator

### 🔧 Task: Implement the Coherence Evaluator

Evaluates coherence score for a given query and response or a multi-turn conversation, including reasoning.

The coherence measure assesses the ability of the language model to generate text that reads naturally, flows smoothly, and resembles human-like language in its responses. Use it when assessing the readability and user-friendliness of a model's generated responses in real-world applications.

Complete the code below to instantiate and use the `CoherenceEvaluator`.

In [None]:
# TODO: Instantiate, configure and run the CoherenceEvaluator in order to evaluate
# the coherence of a response compared to the initial query.

# Evaluating Response Quality with FluencyEvaluator

### 🔧 Task: Implement the Fluency Evaluator

Evaluates the fluency of a given response or a multi-turn conversation, including reasoning.

The fluency measure assesses the extent to which the generated text conforms to grammatical rules, syntactic structures, and appropriate vocabulary usage, resulting in linguistically correct responses.

Fluency scores range from 1 to 5, with 1 being the least fluent and 5 being the most fluent.

Fill in the missing code to initialize and use the `FluencyEvaluator`.

In [None]:
# TODO: Instantiate, configure and run the FluencyEvaluator in order to evaluate
# the fluency of a response from a grammatical, syntactic and appropriate vocabulary usage.

### 🔧 Task: Implement the Groundedness Evaluator

Evaluates groundedness score for a given query (optional), response, and context or a multi-turn conversation, including reasoning.

The groundedness measure assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context. Even if the responses from LLM are factually correct, they'll be considered ungrounded if they can't be verified against the provided sources (such as your input source or your database). Use the groundedness metric when you need to verify that AI-generated responses align with and are validated by the provided context.

Groundedness scores range from 1 to 5, with 1 being the least grounded and 5 being the most grounded.

Complete the code below to instantiate and use the `GroundednessEvaluator`.

In [None]:
# TODO: Instantiate, configure and run the GroundednessEvaluator in order to evaluate
# the groundedness of a response compared to the retrieved context which is useful in RAG scenarios.

# Creating custom evaluators

## Code-based evaluator

### Function-based evaluator

### 🔧 Task: Implement a Custom Function-based Evaluator
Write a function to measure the length of a response.

**At this stage, it doesn't require to involve an LLM to assist you with this evaluation.**

In [None]:
# TODO: Instantiate, configure and run a function-based evaluator
# Example "Custom evaluator function to calculate response length"

### Class-based evaluator

### 🔧 Task: Implement a Custom Class-based Evaluator
Create a class-based evaluator that checks responses for blocked words.

**At this stage, it doesn't require to involve an LLM to assist you with this evaluation.**

In [None]:
# TODO: Instantiate, configure and run a class-based evaluator
# Example Custom class-based evaluator to check for blocked words

## Prompt-based evaluators

#### Helpfulness evaluator

As you can find in this folder, a custom evaluator named **"helpfulness"** has been created. Its purpose is evaluate, using an LLM, how much *"helpful"* is a given response.

You'll find 2 files:
 - ```helpfulness.prompty``` is a prompty file that aims to templatize your prompts, specifying model, hyperparameters, instructions, etc. This is here that we insert our instructions for the custom evaluator.
 - ```helpfulness.py``` is a python module that we use to create the **HelpfulnessEvaluator** class and be callable from the notebook.

### 👀 Observe: Using the custom Helpfulness Evaluator

In [None]:
from helpfulness import HelpfulnessEvaluator

helpfulness_evaluator = HelpfulnessEvaluator(model_config)

helpfulness_score = helpfulness_evaluator(
    query="What's the meaning of life?", 
    context="Arthur Schopenhauer was the first to explicitly ask the question, in an essay entitled 'Character'.", 
    response="The answer is 42."
)
print(helpfulness_score)

#### JSON accuracy evaluator

Based on the HelpfulnessEvaluator, implement your own custom evaluator.

**Idea**: create a JSON Schema evaluator. Goal of this custom evaluator is to evaluate how much a JSON output complies to a given schema. 

**jsons**: you'll find a ```jsons``` folder that contains some jsons outputs (```complete/poor/very_poor_output.json```) and a json schema file (```example_schema.json```). You can leverage these files to perform the evaluation and compare scores and reasons.

### 🔧 Task: Implement a JSON Schema Evaluator
Load a JSON schema and use it to evaluate JSON objects.

In [None]:
# TODO: Implement JSON schema evaluation
# You're expected to implement a JSON schema evaluation prompty file and a python module
# that will be used here (like the helpfulness module)

# Evaluating a dataset

### 🔧 Task: Evaluate the dataset with built-in and custom evaluators
Fill in the missing code to initialize and different evaluators such as built-in (`RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `GroundednessEvaluator`, `RetrievalEvaluator`, etc.) and custom ones.

In [None]:
# TODO: Instantiate, configure and run the different evaluators. 
# You can use the model_endpoint module as the target to interact with the model endpoint and get the output to evaluate.
# Example: RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, GroundednessEvaluator, HelpfulnessEvaluator
# You'll need to configure the evaluators with the appropriate column_mapping and run them on the data.

In [None]:
# TODO: Display results dataframe