# Local Development -- evaluators and testing evaluation

Inspiration: https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/evaluate-sdk

## Documentation

Azure AI Evaluation client library for Python<br>
https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/evaluation/azure-ai-evaluation

Evaluate your Generative AI application locally with the Azure AI Evaluation SDK<br>
https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/evaluate-sdk


### Environment setup
python 00_setup.py

## Setup

### Common packages

In [1]:
import os
import dotenv
from pathlib import Path

### Global settings

In [2]:
# Global variables
PRIVATE = False
DATA_DIR = Path("data")
TMP_DIR = Path("tmp")

### Load environment variables

In [3]:
# Import override environment variables from .env file
# or from private.env file if PRIVATE is True
dotenv.load_dotenv('.env' if not PRIVATE else 'private.env', override=True)

True

### Config dictionaries used by Azure AI SDK

In [4]:
# Configuration for Azure AI Foundry project
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP_AI"),
    "project_name": os.environ.get("AZURE_AI_FOUNDRY_PROJECT_NAME"),
}

# Configuration for Azure OpenAI model
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
    "api_version": os.environ.get("AZURE_OPENAI_API_VERSION"),
    "type": "azure_openai"
}

### Azure Credentials

In [5]:
# https://learn.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()

## Built-in Evaluators

In [6]:
from azure.ai.evaluation import GroundednessEvaluator

# https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation.groundednessevaluator
groundedness_eval = GroundednessEvaluator(model_config)

### Local testing of built-in evaluators

In [7]:
_query_response_data = dict(
    query="Which tent is the most waterproof?",
    context="The Alpine Explorer Tent is the second most water-proof of all tents available.",
    response="The Alpine Explorer Tent is the most waterproof."
)

In [8]:
# Running Groundedness Evaluator on a query and response pair
groundedness_score = groundedness_eval(
    **_query_response_data
)
print(groundedness_score)

{'groundedness': 3.0, 'gpt_groundedness': 3.0, 'groundedness_reason': 'The RESPONSE is incorrect because it contradicts the CONTEXT, which explicitly states that the Alpine Explorer Tent is the second most waterproof tent, not the most waterproof.', 'groundedness_result': 'pass', 'groundedness_threshold': 3}


## Custom evaluators

### Simple deterministic custom evaluator

In [9]:
from custom.answer_len.answer_length import AnswerLengthEvaluator

answer_length_evaluator = AnswerLengthEvaluator()
answer_length = answer_length_evaluator(answer="What is the speed of light?")

print(answer_length)

{'answer_length': 27}


### LLM based custom evaluator

In [10]:
# Import your prompt-based custom evaluator
from custom.friendliness.friend import FriendlinessEvaluator

friendliness_evaluator = FriendlinessEvaluator(model_config=model_config)
friendliness_score = friendliness_evaluator(
    response="I will not apologize for my behavior!"
)
friendliness_score

{'score': 2,
 'reason': 'The response is defensive and lacks warmth or approachability, though it is not overtly hostile.'}

## Use local compute for evaluation and Azure AI Foundry for tracking results 

In [11]:
from azure.ai.evaluation import evaluate

# https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python#functions
result = evaluate(
    data=DATA_DIR / "science-trivia__context_response_feedback_v12.jsonl",

    # Specific evaluators to use
    evaluators={
        "Groundedness": groundedness_eval,
        "Answer_length": answer_length_evaluator,
        "Friendliness": friendliness_evaluator
    },
    
    # Column mapping for each evaluator
    # The column mapping is used to map the columns in your data to the columns expected by the evaluator
    # Skip if using your data uses the default column names expected by the evaluators. 
    # For example, the default column names for the GroundednessEvaluator are "query", "context", and "response"
    evaluator_config={
        "Groundedness": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}"
            }, 
        },
        "Answer_length": {
            "column_mapping": {
                "answer": "${data.response}"
            } 
        },
        "Friendliness": {
            "column_mapping": {
                "response": "${data.response}"
            } 
        }

    },
    
    # Provide your Azure AI project information to track your evaluation results in your Azure AI project
    azure_ai_project = azure_ai_project,
    
    # Optionally provide an output path to dump a json of metric summary, row level data and metric and Azure AI project URL
    # output_path=TMP_DIR / "local-eval-result.json"
)

[2025-06-23 16:19:44 +0200][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_Friendliness_20250623_161944_318377, log path: C:\Users\rohoff\.promptflow\.runs\azure_ai_evaluation_evaluators_Friendliness_20250623_161944_318377\logs.txt
[2025-06-23 16:19:44 +0200][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_Groundedness_20250623_161944_314849, log path: C:\Users\rohoff\.promptflow\.runs\azure_ai_evaluation_evaluators_Groundedness_20250623_161944_314849\logs.txt
[2025-06-23 16:19:44 +0200][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_Answer_length_20250623_161944_316377, log path: C:\Users\rohoff\.promptflow\.runs\azure_ai_evaluation_evaluators_Answer_length_20250623_161944_316377\logs.txt


2025-06-23 16:19:45 +0200   47736 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-06-23 16:19:45 +0200   47736 execution.bulk     INFO     The timeout for the batch run is 3600 seconds.
2025-06-23 16:19:45 +0200   47736 execution.bulk     INFO     Current system's available memory is 7191.7109375MB, memory consumption of current process is 293.32421875MB, estimated available worker count is 7191.7109375/293.32421875 = 24
2025-06-23 16:19:45 +0200   47736 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 75, 'estimated_worker_count_based_on_memory_usage': 24}.
2025-06-23 16:19:55 +0200   47736 execution.bulk     INFO     Process name(SpawnProcess-5)-Process id(20160)-Line number(1) start execution.
2025-06-23 16:19:55 +0200   47736 execution.bulk     INFO     Process name(SpawnProcess-3)-Process id(49864)-Line number(0) start exe

----------------------------------------------------------------
AI project URI:  https://ai.azure.com/build/evaluation/b45e7f9c-5848-489d-ae10-fdf86b5f0805?wsid=/subscriptions/c11caebe-ea81-4036-9e58-ccf406d87ead/resourceGroups/mdwsrh/providers/Microsoft.MachineLearningServices/workspaces/rohoff-0016
----------------------------------------------------------------
----------------------------------------------------------------
AI project URI:  https://ai.azure.com/build/evaluation/b45e7f9c-5848-489d-ae10-fdf86b5f0805?wsid=/subscriptions/c11caebe-ea81-4036-9e58-ccf406d87ead/resourceGroups/mdwsrh/providers/Microsoft.MachineLearningServices/workspaces/rohoff-0016
----------------------------------------------------------------


## Check out evaluation in AI Foundry

In [None]:
print("----------------------------------------------------------------")
print("AI project URI: ", result["studio_url"])
print("----------------------------------------------------------------")


## Write evaluated data to file

In [18]:
import json 
import shutil

_file = TMP_DIR / 'science-trivia__context_response_feedback_v12_locally_evaluated.jsonl'

# use os to del _file ignore errors
if _file.exists():
    os.remove(_file)

with open(_file, 'a') as file:
    for item in result['rows']:
        file.write(json.dumps(item) + '\n')        