# Evaluate Using Risk & Safety Metrics

Contoso Home Furnishings is developing an app that generates product descriptions for their selection of furniture. The app aims to generates engaging product descriptions based on the manufacturer's specification of the furniture.

In this exercise, you will evaluate the model output for the generated product description using performance and quality metrics. Provided below is an example of a row of data provided for the description generated for the Contoso Home Furnishings Dining Chair:

`context`

Dining chair. Wooden seat. Four legs. Backrest. Brown. 18" wide, 20" deep, 35" tall. Holds 250 lbs.

`query`

Given the product specfication for the Contoso Home Furnishings Dining Chair, provide a product description.

`ground_truth`

The dining chair is brown and wooden with four legs and a backrest. The dimensions are 18" wide, 20" deep, 35" tall. The dining chair has a weight capacity of 250 lbs.

`response`

Introducing our timeless wooden dining chair, designed for both comfort and durability. Crafted with a solid wood seat and sturdy four-legged base, this chair offers reliable support for up to 250 lbs. The smooth brown finish adds a touch of rustic elegance, while the ergonomically shaped backrest ensures a comfortable dining experience. Measuring 18" wide, 20" deep, and 35" tall, it's the perfect blend of form and function, making it a versatile addition to any dining space. Elevate your home with this beautifully simple yet sophisticated seating option.


## Install the package

The evaluator classes for assessing performance and quality are in the Azure AI Evaluation SDK. We'll begin by installing the package.

In [1]:
%pip install azure-ai-evaluation

Defaulting to user installation because normal site-packages is not writeable
Collecting azure-ai-evaluation
  Downloading azure_ai_evaluation-1.0.1-py3-none-any.whl.metadata (28 kB)
Collecting promptflow-devkit>=1.15.0 (from azure-ai-evaluation)
  Downloading promptflow_devkit-1.16.2-py3-none-any.whl.metadata (5.7 kB)
Collecting promptflow-core>=1.15.0 (from azure-ai-evaluation)
  Downloading promptflow_core-1.16.2-py3-none-any.whl.metadata (2.8 kB)
Collecting pyjwt>=2.8.0 (from azure-ai-evaluation)
  Downloading PyJWT-2.10.1-py3-none-any.whl.metadata (4.0 kB)
Collecting azure-identity>=1.16.0 (from azure-ai-evaluation)
  Downloading azure_identity-1.19.0-py3-none-any.whl.metadata (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.6/80.6 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
Collecting nltk>=3.9.1 (from azure-ai-evaluation)
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting cryptography>=2.5 (from azure-identity>=1.16.0->azure-ai-ev

## Import packages

We'll import `os` so that you can access the environment variables that you'll set in the next step.

In [2]:
import os

## Set environment variables to create an instance of the evaluators

We'll now set the environment variables that'll be required to create an instance of the evaluators. You'll need the following:

- Azure OpenAI endpoint
- Azure OpenAI API Key
- Azure deployment

You can locate your **Azure OpenAI endpoint** and **Azure OpenAI API Key** by navigating to **Models + endpoints**, selecting the model, and copying the respective credentials for your model deployment.

In [3]:
os.environ['AZURE_OPENAI_ENDPOINT'] = 'https://ai-ziggynewhub464429846644.openai.azure.com/'
os.environ['AZURE_OPENAI_API_KEY'] = '6ux3g3zvbq0tjl1Kq0GS2po8KxvnmSmWictyXOgiDr91i97cyJSWJQQJ99ALACHYHv6XJ3w3AAAAACOGPkdK'
os.environ['AZURE_OPENAI_DEPLOYMENT'] = 'gpt-4o'

## Configure the model_config

The `model_config` is necessary as it's a required parameter when creating an instance of the evaluator class. Let's configure the `model_config` with the following:

- Azure OpenAI endpoint
- Azure OpenAI API key
- Azure deployment

In [4]:
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}

## Create variables for the evaluation data

Since we'll be using the same context, query, response, and ground truth for the exercises, we'll create a variable to store each string and pass the variables into our evaluations.

In [5]:
context = "Dining chair. Wooden seat. Four legs. Backrest. Brown. 18\" wide, 20\" deep, 35\" tall. Holds 250 lbs."
query = "Given the product specification for the Contoso Home Furnishings Dining Chair, provide an engaging marketing product description."
ground_truth = "The dining chair is brown and wooden with four legs and a backrest. The dimensions are 18\" wide, 20\" deep, 35\" tall. The dining chair has a weight capacity of 250 lbs."
response = "Introducing our timeless wooden dining chair, designed for both comfort and durability. Crafted with a solid wood seat and sturdy four-legged base, this chair offers reliable support for up to 250 lbs. The smooth brown finish adds a touch of rustic elegance, while the ergonomically shaped backrest ensures a comfortable dining experience. Measuring 18\" wide, 20\" deep, and 35\" tall, it's the perfect blend of form and function, making it a versatile addition to any dining space. Elevate your home with this beautifully simple yet sophisticated seating option."

## Evaluate for Groundedness

Create an instance of the `GroundednessEvaluator` and run the evaluation.



In [6]:
from azure.ai.evaluation import GroundednessEvaluator

groundedness_eval = GroundednessEvaluator(model_config)

groundedness_score = groundedness_eval(
    response=response,
    context=context,
)

print(groundedness_score)

{'groundedness': 3.0, 'gpt_groundedness': 3.0, 'groundedness_reason': 'The RESPONSE accurately reflects the CONTEXT but includes additional details and descriptions that are not supported by the CONTEXT.'}


## Evaluate for Relevance

Create an instance of the `RelevanceEvaluator` and run the evaluation.

In [7]:
from azure.ai.evaluation import RelevanceEvaluator

relevance_eval = RelevanceEvaluator(model_config)

relevance_score = relevance_eval(
    response=response,
    context=context,
    query=query
)

print(relevance_score)

{'relevance': 4.0, 'gpt_relevance': 4.0, 'relevance_reason': 'The RESPONSE provides a complete and engaging marketing description of the dining chair, addressing all aspects of the QUERY effectively.'}


## Evaluate for Coherence

Create an instance of the `CoherenceEvaluator` and run the evaluation.

In [8]:
from azure.ai.evaluation import CoherenceEvaluator

coherence_eval = CoherenceEvaluator(model_config)

coherence_score = coherence_eval(
    response=response,
    query=query
)

print(coherence_score)

{'coherence': 4.0, 'gpt_coherence': 4.0, 'coherence_reason': 'The RESPONSE is coherent and effectively addresses the QUERY with a logical sequence of ideas and clear connections between sentences. It provides a comprehensive and engaging description of the dining chair, making it suitable for marketing purposes.'}


## Evaluate for Fluency

Create an instance of the `FluencyEvaluator` and run the evaluation.

In [9]:
from azure.ai.evaluation import FluencyEvaluator

fluency_eval = FluencyEvaluator(model_config)

fluency_score = fluency_eval(
    response=response,
    query=query
)

print(fluency_score)

{'fluency': 4.0, 'gpt_fluency': 4.0, 'fluency_reason': 'The RESPONSE demonstrates proficient fluency with well-articulated language, varied vocabulary, and complex sentence structures. It is coherent and cohesive, with minor errors that do not affect understanding. The text flows smoothly, connecting ideas logically.'}


## Evaluate for Similarity

Create an instance of the `SimiliartyEvaluator` and run the evaluation.

In [10]:
from azure.ai.evaluation import SimilarityEvaluator

similarity_eval = SimilarityEvaluator(model_config)

similarity_score = similarity_eval(
    response=response,
    query=query,
    ground_truth=ground_truth
)

print(similarity_score)

{'similarity': 5.0, 'gpt_similarity': 5.0}


## Evaluate for F1 Score

Create an instance of the `F1ScoreEvaluator` and run the evaluation.

In [11]:
from azure.ai.evaluation import F1ScoreEvaluator

f1_eval = F1ScoreEvaluator()

f1_score = f1_eval(
    response=response,
    ground_truth=ground_truth
)

print(f1_score)

{'f1_score': 0.35185185185185186}


## Evaluate for ROUGE
There are several types of ROUGE metrics: `ROUGE_1`, `ROUGE_2`, `ROUGE_3`, `ROUGE_4`, `ROUGE_5`, and `ROUGE_L`.

The initial 5 types are considered **ROUGE-N** which measures the overlap of n-grams (contiguous sequences of 'n' words) between the generated summary and reference summary. For example, `ROUGE_1` measures of the overalp of unigrams (single words), and `ROUGE_2` measures the overlap of bigrams (two-word sequences). We provide up to 5-grams.

`ROUGE_L` measures the longest common subsequence (LCS) between the generated and reference summaries. LCS takes into account sequence similarity whle maintaining word order, which makes `ROUGE_L` effective in capturing sentence-level structure.

Create an instance of the `RougeScoreEvaluator` and run the evaluation.

In [12]:
from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge_eval = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)

rouge_score = rouge_eval(
    response=response,
    ground_truth=ground_truth,
)

print(rouge_score)

{'rouge_precision': 0.2777777777777778, 'rouge_recall': 0.78125, 'rouge_f1_score': 0.40983606557377056}


## Evaluate for BLEU

Create an instance of the `BleuScoreEvaluator` and run the evaluation.

**Note**: The initial run may install a package. If this occurs, run the cell once more to receive the BLEU score.

In [14]:
from azure.ai.evaluation import BleuScoreEvaluator

bleu_eval = BleuScoreEvaluator()

bleu_score = bleu_eval(
    response=response,
    ground_truth=ground_truth
)

print(bleu_score)

{'bleu_score': 0.10903931692423613}


## Evaluate for METEOR

The METEOR metric takes an `alpha`, `beta`, and `gamma` parameter which control the balance between precision, recall, and the penalty for incorrect word order (fragmentation penalty). These parameters influence how the final METEOR score is calculated, helping fine-tune it's sensitivity to different aspects of the translation or summary quality.

Create an instance of the `MeteorScoreEvaluator` and run the evaluation.

In [15]:
from azure.ai.evaluation import MeteorScoreEvaluator

meteor_eval = MeteorScoreEvaluator(
    alpha=0.9,
    beta=3.0,
    gamma=0.5
)

meteor_score = meteor_eval(
    response=response,
    ground_truth=ground_truth,
)

print(meteor_score)

{'meteor_score': 0.5252285661368535}


## Evaluate for GLEU

Create an instance of the `GleuScoreEvaluator` and run the evaluation.

In [16]:
from azure.ai.evaluation import GleuScoreEvaluator

gleu_eval = GleuScoreEvaluator()

gleu_score = gleu_eval(
    response=response,
    ground_truth=ground_truth,
)

print(gleu_score)

{'gleu_score': 0.13658536585365855}


## Evaluate on a test dataset

We can run an evaluation for a dataset with the `evaluate` function. In addition, we can run the evaluation using multiple evaluators. In our case, we're going to run an evaluation using a few evaluators on the product description dataset within the `product-descriptions.jsonl` file. We'll also output the results to a new `evaluation_results.json` file.

Let's run an evalation using the `Relevance`, `Groundedness`, and `Fluency` evaluators.

In [17]:
from azure.ai.evaluation import evaluate
import json

path = "performance-quality-data.jsonl"

result = evaluate(
    data=path, # provide your data here
    evaluators={
        "relevance": relevance_eval,
        "groundedness": groundedness_eval,
        "fluency": fluency_eval
    },
    # column mapping
    evaluator_config={
        "default": {
            "query": "${data.query}",
            "response": "${data.response}",
            "context": "${data.context}",
            "ground_truth": "${data.ground_truth}"
        }
    }
)



Starting prompt flow service...
Starting prompt flow service...
Starting prompt flow service...
Start prompt flow service on 127.0.0.1:23333, version: 1.16.2.
Start prompt flow service on 127.0.0.1:23333, version: 1.16.2.
Start prompt flow service on 127.0.0.1:23333, version: 1.16.2.


[2024-12-04 16:25:22 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_cxo68abo_20241204_162511_976504, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_cxo68abo_20241204_162511_976504/logs.txt
[2024-12-04 16:25:22 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_q731411a_20241204_162511_972218, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_q731411a_20241204_162511_972218/logs.txt


You can stop the prompt flow service with the following command:'[1mpf service stop[0m'.

You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_q731411a_20241204_162511_972218
You can stop the prompt flow service with the following command:'[1mpf service stop[0m'.

You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_cxo68abo_20241204_162511_976504
You can stop the prompt flow service with the following command:'[1mpf service stop[0m'.

You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_w0jo7wsn_20241204_162511_973791


[2024-12-04 16:25:22 +0000][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_w0jo7wsn_20241204_162511_973791, log path: /home/vscode/.promptflow/.runs/azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_w0jo7wsn_20241204_162511_973791/logs.txt


2024-12-04 16:25:22 +0000    2592 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-12-04 16:25:25 +0000    2592 execution.bulk     INFO     Finished 1 / 3 lines.
2024-12-04 16:25:25 +0000    2592 execution.bulk     INFO     Average execution time for completed lines: 3.11 seconds. Estimated time for incomplete lines: 6.22 seconds.
2024-12-04 16:25:26 +0000    2592 execution.bulk     INFO     Finished 2 / 3 lines.
2024-12-04 16:25:26 +0000    2592 execution.bulk     INFO     Average execution time for completed lines: 2.05 seconds. Estimated time for incomplete lines: 2.05 seconds.
2024-12-04 16:25:27 +0000    2592 execution.bulk     INFO     Finished 3 / 3 lines.
2024-12-04 16:25:27 +0000    2592 execution.bulk     INFO     Average execution time for completed lines: 1.53 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_cxo68abo_20

## Print the results with Pretty Print

Now that we've run the evaluation, let's print the results using Pretty Print, which displays data in a structured and visually appealing way, making it easier to read and understand.

In [18]:
from pprint import pprint
pprint(result)

{'metrics': {'fluency.fluency': 4.333333333333333,
             'fluency.gpt_fluency': 4.333333333333333,
             'groundedness.gpt_groundedness': 3.0,
             'groundedness.groundedness': 3.0,
             'relevance.gpt_relevance': 3.6666666666666665,
             'relevance.relevance': 3.6666666666666665},
 'rows': [{'inputs.context': 'Couch. Fabric upholstery. Three seats. Wooden '
                             'frame. Grey. 85" wide, 35" deep, 32" tall. Holds '
                             '750 lbs.',
           'inputs.ground_truth': 'The couch has a wood frame with gray '
                                  'upholstered fabric. There are 3 seats on '
                                  'the couch which can accommodate 750 lbs. '
                                  'The dimensions are 85" wide, 35" deep, 32" '
                                  'tall.',
           'inputs.query': 'Given the product specfication for the Contoso '
                           'Home Furnishings Couc

## Print the results as table

We can also print the results as a table using `Pandas`.

In [19]:
import pandas as pd
pd.DataFrame(result["rows"])

Unnamed: 0,inputs.query,inputs.response,inputs.context,inputs.ground_truth,outputs.relevance.relevance,outputs.relevance.gpt_relevance,outputs.relevance.relevance_reason,outputs.groundedness.groundedness,outputs.groundedness.gpt_groundedness,outputs.groundedness.groundedness_reason,outputs.fluency.fluency,outputs.fluency.gpt_fluency,outputs.fluency.fluency_reason
0,Given the product specfication for the Contoso...,Sink into comfort with this stylish grey three...,Couch. Fabric upholstery. Three seats. Wooden ...,The couch has a wood frame with gray upholster...,4,4,The RESPONSE fully addresses the QUERY with ac...,3,3,The RESPONSE accurately reflects the CONTEXT b...,4,4,The RESPONSE demonstrates proficient fluency w...
1,Given the product specfication for the Contoso...,Elevate your living space with this modern rou...,Coffee table. Glass top. Metal frame. Round. B...,The coffee table has a metal frame and glass t...,4,4,The RESPONSE fully addresses the QUERY by prov...,3,3,The RESPONSE accurately includes all the detai...,5,5,The response should get a high score because i...
2,Given the product specfication for the Contoso...,Boost your productivity with this versatile de...,Desk. Wooden surface. Metal legs. Adjustable h...,The desk has a wooden surface and metal legs. ...,3,3,The RESPONSE provides a complete description o...,3,3,The RESPONSE accurately reflects the informati...,4,4,The RESPONSE demonstrates proficient fluency w...


## Delete resources

If you've finished exploring Azure AI Services, delete the Azure resource that you created during the workshop.

**Note**: You may be prompted to delete your deployed model(s) before deleting the resource group.