# Local Evaluation for testing and development of metrics

Inspiration: https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/evaluate-sdk

## Documentation

Azure AI Evaluation client library for Python<br>
https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/evaluation/azure-ai-evaluation

Evaluate your Generative AI application locally with the Azure AI Evaluation SDK<br>
https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/evaluate-sdk


## Dependencies

In [16]:
#%%cmd
#pip install azure-identity azure-ai-projects azure-ai-ml azure-ai-evaluation

## Setup

### Common packages

In [2]:
import os
import dotenv
from pathlib import Path

### Global settings

In [3]:
# Global variables
PRIVATE = False
DATA_DIR = Path("data")
TMP_DIR = Path("tmp")

### Load environment variables

In [4]:
# Import override environment variables from .env file
# or from private.env file if PRIVATE is True
dotenv.load_dotenv('.env' if not PRIVATE else 'private.env', override=True)

True

### Config dictionaries used by Azure AI SDK

In [5]:
# Configuration for Azure AI Foundry project
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP_AI"),
    "project_name": os.environ.get("AZURE_AI_PROJECT_NAME"),
}

# Configuration for Azure OpenAI model
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
    "api_version": os.environ.get("AZURE_OPENAI_API_VERSION"),
    "type": "azure_openai"
}

### Azure Credentials

In [6]:
# https://learn.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()

## Built-in Evaluators

In [7]:
from azure.ai.evaluation import GroundednessEvaluator, GroundednessProEvaluator

# https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation.groundednessevaluator
groundedness_eval = GroundednessEvaluator(model_config)

# https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation.groundednessproevaluator
groundedness_pro_eval = GroundednessProEvaluator(azure_ai_project=azure_ai_project, credential=credential)


Class GroundednessProEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


### Local testing of built-in evaluators

In [8]:
_query_response_data = dict(
    query="Which tent is the most waterproof?",
    context="The Alpine Explorer Tent is the second most water-proof of all tents available.",
    response="The Alpine Explorer Tent is the most waterproof."
)

In [9]:
# Running Groundedness Evaluator on a query and response pair
groundedness_score = groundedness_eval(
    **_query_response_data
)
print(groundedness_score)

{'groundedness': 3.0, 'gpt_groundedness': 3.0, 'groundedness_reason': 'The response contradicts the context by stating that the Alpine Explorer Tent is the most waterproof, while the context specifies it is the second most waterproof.'}


In [10]:
groundedness_pro_score = groundedness_pro_eval(
    **_query_response_data
)
print(groundedness_pro_score)

{'groundedness_pro_label': False, 'groundedness_pro_reason': '\'The Alpine Explorer Tent is the most waterproof.\' is ungrounded because "The Alpine Explorer Tent is the second most water-proof of all tents available." Thus, it is not the most waterproof. It\'s contradiction.'}


## Custom evaluators

### Simple deterministic custom evaluator

In [11]:
from custom.answer_len.answer_length import AnswerLengthEvaluator

answer_length_evaluator = AnswerLengthEvaluator()
answer_length = answer_length_evaluator(answer="What is the speed of light?")

print(answer_length)

{'answer_length': 27}


### LLM based custom evaluator

In [None]:
# Import your prompt-based custom evaluator
from custom.friendliness.friend import FriendlinessEvaluator

friendliness_evaluator = FriendlinessEvaluator(model_config=model_config)
friendliness_score = friendliness_evaluator(
    response="I will not apologize for my behavior!"
)
friendliness_score

{'score': 3,
 'reason': 'The response is neutral, providing factual information without any warmth or hostility.'}

## Use local compute for evaluation and Azure AI Foundry for tracking results 

In [None]:
from azure.ai.evaluation import evaluate

# https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python#functions
result = evaluate(
    data=DATA_DIR / "data.jsonl",

    # Specific evaluators to use
    evaluators={
        "groundedness": groundedness_eval,
        "groundedness_pro": groundedness_pro_eval,
        "answer_length": answer_length_evaluator,
        "friendliness": friendliness_evaluator
    },
    
    # Column mapping for each evaluator
    # The column mapping is used to map the columns in your data to the columns expected by the evaluator
    # Skip if using your data uses the default column names expected by the evaluators. 
    # For example, the default column names for the GroundednessEvaluator are "query", "context", and "response"
    evaluator_config={
        "groundedness": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}"
            }, 
        },
        "groundedness_pro": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}"
            }, 
        },
        "answer_length": {
            "column_mapping": {
                "answer": "${data.response}"
            } 
        },
        "friendliness": {
            "column_mapping": {
                "response": "${data.response}"
            } 
        }

    },
    
    # Provide your Azure AI project information to track your evaluation results in your Azure AI project
    azure_ai_project = azure_ai_project,
    
    # Optionally provide an output path to dump a json of metric summary, row level data and metric and Azure AI project URL
    output_path=TMP_DIR / "local_eval_results.json"
)

[2025-04-17 12:50:13 +0200][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_0m7mps0g_20250417_125011_852524, log path: C:\Users\rohoff\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_0m7mps0g_20250417_125011_852524\logs.txt
[2025-04-17 12:50:13 +0200][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run custom_friendliness_friend_friendlinessevaluator_5mkg98gu_20250417_125011_861524, log path: C:\Users\rohoff\.promptflow\.runs\custom_friendliness_friend_friendlinessevaluator_5mkg98gu_20250417_125011_861524\logs.txt
[2025-04-17 12:50:13 +0200][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run custom_answer_len_answer_length_answerlengthevaluator_7p5q4ab_20250417_125011_875525, log path: C:\Users\rohoff\.promptflow\.runs\custom_answer_len_answer_length_answerlengthevaluator_7p5q4ab_20250417_125011_875525\logs.txt
[2025-04-17 12:50:13

2025-04-17 12:50:13 +0200    8456 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-17 12:50:15 +0200    8456 execution.bulk     INFO     Finished 1 / 2 lines.
2025-04-17 12:50:15 +0200    8456 execution.bulk     INFO     Average execution time for completed lines: 1.9 seconds. Estimated time for incomplete lines: 1.9 seconds.
2025-04-17 12:50:15 +0200    8456 execution.bulk     INFO     Finished 2 / 2 lines.
2025-04-17 12:50:15 +0200    8456 execution.bulk     INFO     Average execution time for completed lines: 0.97 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_biuirwie_20250417_125011_854521"
Run status: "Completed"
Start time: "2025-04-17 12:50:11.848522+02:00"
Duration: "0:00:03.579328"
Output path: "C:\Users\rohoff\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_biuirwie_20250417_1250

----------------------------------------------------------------
AI project URI:  https://ai.azure.com/build/evaluation/b6bac1bd-dbde-47be-b361-a94a5ee4a414?wsid=/subscriptions/c11caebe-ea81-4036-9e58-ccf406d87ead/resourceGroups/sbn-ai-prod-swc/providers/Microsoft.MachineLearningServices/workspaces/10_stable
----------------------------------------------------------------
----------------------------------------------------------------
AI project URI:  https://ai.azure.com/build/evaluation/b6bac1bd-dbde-47be-b361-a94a5ee4a414?wsid=/subscriptions/c11caebe-ea81-4036-9e58-ccf406d87ead/resourceGroups/sbn-ai-prod-swc/providers/Microsoft.MachineLearningServices/workspaces/10_stable
----------------------------------------------------------------


In [15]:
print("----------------------------------------------------------------")
print("AI project URI: ", result["studio_url"])
print("----------------------------------------------------------------")