# Retrieval evaluators

**A retrieval-augmented generation (RAG)** system tries to generate the most relevant answer consistent with grounding documents in response to a user's query. At a high level, a user's query triggers a search retrieval in the corpus of grounding documents to provide grounding context for the AI model to generate a response. It's important to evaluate:

- The relevance of the retrieval results to the user's query: use Document Retrieval if you have labels for query-specific document relevance, or query relevance judgement (qrels) for more accurate measurements. Use Retrieval if you only have the retrieved context, but you don't have such labels and have a higher tolerance for a less fine-grained measurement.
- The consistency of the generated response with respect to the grounding documents: use Groundedness if you want to potentially customize the definition of groundedness in our open-source LLM-judge prompt, Groundedness Pro if you want a straightforward definition.
- The relevance of the final response to the query: Relevance if you don't have ground truth, and Response Completeness if you have ground truth and don't want your response to miss critical information.
  
A good way to think about Groundedness and Response Completeness is: groundedness is about the precision aspect of the response that it shouldn't contain content outside of the grounding context, whereas response completeness is about the recall aspect of the response that it shouldn't miss critical information compared to the expected response (ground truth).

> https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/rag-evaluators

In [1]:
import datetime
import os
import sys

from azure.identity import DefaultAzureCredential
from azure.ai.evaluation import AzureOpenAIModelConfiguration, DocumentRetrievalEvaluator, GroundednessEvaluator, GroundednessProEvaluator, RelevanceEvaluator, ResponseCompletenessEvaluator, RetrievalEvaluator
from dotenv import load_dotenv

In [2]:
sys.version

'3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]'

In [3]:
print(f"Today is {datetime.datetime.today().strftime('%d-%b-%Y %H:%M:%S')}")

Today is 26-Jun-2025 12:30:11


In [4]:
load_dotenv("azure.env")

endpoint = os.getenv("endpoint")
key = os.getenv("key")
azure_foundry_project = os.environ.get("azure_foundry_project")

In [5]:
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=endpoint,
    api_key=key,
    azure_deployment="gpt-4.1",
    api_version="2024-10-21",
)

## Retrieval evaluator

> Measures how effectively the system retrieves relevant information.

Retrieval quality is very important given its upstream role in RAG: if the retrieval quality is poor and the response requires corpus-specific knowledge, there's less chance your LLM model gives you a satisfactory answer. RetrievalEvaluator measures the textual quality of retrieval results with an LLM without requiring ground truth (also known as query relevance judgment), which provides value compared to DocumentRetrievalEvaluator measuring ndcg, xdcg, fidelity, and other classical information retrieval metrics that require ground truth. This metric focuses on how relevant the context chunks (encoded as a string) are to address a query and how the most relevant context chunks are surfaced at the top of the list.

In [6]:
retrieval_evaluator = RetrievalEvaluator(model_config=model_config, threshold=3)

In [7]:
retrieval_evaluator(
    query="Where was Marie Curie born?",
    context=
    "Background: 1. Marie Curie was born in Warsaw. 2. Marie Curie was born on November 7, 1867. 3. Marie Curie is a French scientist. ",
)

{'retrieval': 5.0,
 'gpt_retrieval': 5.0,
 'retrieval_reason': 'The context directly and immediately answers the query with the most relevant information at the top, with no external knowledge or bias introduced.',
 'retrieval_result': 'pass',
 'retrieval_threshold': 3}

## Document retrieval evaluator

> Measures accuracy in retrieval results given ground truth.

Retrieval quality is very important given its upstream role in RAG: if the retrieval quality is poor and the response requires corpus-specific knowledge, there's less chance your LLM model gives you a satisfactory answer. Therefore, it's important to use DocumentRetrievalEvaluator to evaluate the retrieval quality but also optimize your search parameters for RAG.

In [8]:
# these query_relevance_label are given by your human- or LLM-judges.
retrieval_ground_truth = [
    {
        "document_id": "1",
        "query_relevance_label": 4
    },
    {
        "document_id": "2",
        "query_relevance_label": 2
    },
    {
        "document_id": "3",
        "query_relevance_label": 3
    },
    {
        "document_id": "4",
        "query_relevance_label": 1
    },
    {
        "document_id": "5",
        "query_relevance_label": 0
    },
]
# the min and max of the label scores are inputs to document retrieval evaluator
ground_truth_label_min = 0
ground_truth_label_max = 4

# these relevance scores come from your search retrieval system
retrieved_documents = [
    {
        "document_id": "2",
        "relevance_score": 45.1
    },
    {
        "document_id": "6",
        "relevance_score": 35.8
    },
    {
        "document_id": "3",
        "relevance_score": 29.2
    },
    {
        "document_id": "5",
        "relevance_score": 25.4
    },
    {
        "document_id": "7",
        "relevance_score": 18.8
    },
]

document_retrieval_evaluator = DocumentRetrievalEvaluator(
    ground_truth_label_min=ground_truth_label_min,
    ground_truth_label_max=ground_truth_label_max,
    ndcg_threshold=0.5,
    xdcg_threshold=50.0,
    fidelity_threshold=0.5,
    top1_relevance_threshold=50.0,
    top3_max_relevance_threshold=50.0,
    total_retrieved_documents_threshold=50,
    total_ground_truth_documents_threshold=50)

document_retrieval_evaluator(retrieval_ground_truth=retrieval_ground_truth,
                             retrieved_documents=retrieved_documents)

{'ndcg@3': 0.31075932533963707,
 'xdcg@3': 39.285714285714285,
 'fidelity': 0.39285714285714285,
 'top1_relevance': 2,
 'top3_max_relevance': 3,
 'holes': 2,
 'holes_ratio': 0.4,
 'total_retrieved_documents': 5,
 'total_ground_truth_documents': 5,
 'ndcg@3_result': 'fail',
 'ndcg@3_threshold': 0.5,
 'ndcg@3_higher_is_better': True,
 'xdcg@3_result': 'fail',
 'xdcg@3_threshold': 50.0,
 'xdcg@3_higher_is_better': True,
 'fidelity_result': 'fail',
 'fidelity_threshold': 0.5,
 'fidelity_higher_is_better': True,
 'top1_relevance_result': 'fail',
 'top1_relevance_threshold': 50.0,
 'top1_relevance_higher_is_better': True,
 'top3_max_relevance_result': 'fail',
 'top3_max_relevance_threshold': 50.0,
 'top3_max_relevance_higher_is_better': True,
 'holes_result': 'fail',
 'holes_threshold': 0,
 'holes_higher_is_better': False,
 'holes_ratio_result': 'fail',
 'holes_ratio_threshold': 0,
 'holes_ratio_higher_is_better': False,
 'total_retrieved_documents_result': 'fail',
 'total_retrieved_document

## Groundedness Evaluator

> Measures how consistent the response is with respect to the retrieved context.

It's important to evaluate how grounded the response is in relation to the context, because AI models can fabricate content or generate irrelevant responses. GroundednessEvaluator measures how well the generated response aligns with the given context (grounding source) and doesn't fabricate content outside of it. This metric captures the precision aspect of response alignment with the grounding source. Lower score means the response is irrelevant to the query or fabricated inaccurate content outside the context. This metric is complementary to ResponseCompletenessEvaluator that captures the recall aspect of response alignment with the expected response.

In [9]:
groundedness_evaluator = GroundednessEvaluator(
    model_config=model_config, threshold=3)

groundedness_evaluator(
    query="Is Marie Curie is born in Paris?",
    context="Background: 1. Marie Curie is born on November 7, 1867. 2. Marie Curie is born in Warsaw.",
    response="No, Marie Curie is born in Warsaw.")

{'groundedness': 5.0,
 'gpt_groundedness': 5.0,
 'groundedness_reason': 'The response is fully correct and complete, directly using the context to answer the query.',
 'groundedness_result': 'pass',
 'groundedness_threshold': 3}

## Groundedness ProEvaluator

> Measures whether the response is consistent with respect to the retrieved context.

AI systems can fabricate content or generate irrelevant responses outside the given context. Powered by Azure AI Content Safety, GroundednessProEvaluator detects whether the generated text response is consistent or accurate with respect to the given context in a retrieval-augmented generation question-and-answering scenario. It checks whether the response adheres closely to the context in order to answer the query, avoiding speculation or fabrication, and outputs a binary label.

In [10]:
groundedness_pro_evaluator = GroundednessProEvaluator(
    azure_ai_project=azure_foundry_project,
    credential=DefaultAzureCredential())

Class GroundednessProEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [11]:
groundedness_pro_evaluator(
    query="Is Marie Curie is born in Paris?",
    context=
    "Background: 1. Marie Curie is born on November 7, 1867. 2. Marie Curie is born in Warsaw.",
    response="No, Marie Curie is born in Warsaw.")

{'groundedness_pro_reason': 'All Contents are grounded',
 'groundedness_pro_label': True,
 'groundedness_pro_score': 1,
 'groundedness_pro_threshold': 5,
 'groundedness_pro_result': 'pass'}

## Relevance Evaluator
> Measures how relevant the response is with respect to the query.

It's important to evaluate the final response because AI models can generate irrelevant responses with respect to a user query. To address this, you can use RelevanceEvaluator which measures how effectively a response addresses a query. It assesses the accuracy, completeness, and direct relevance of the response based solely on the given query. Higher scores mean better relevance.

In [12]:
relevance_evaluator = RelevanceEvaluator(model_config=model_config, threshold=3)

In [13]:
relevance_evaluator(query="Is Marie Curie is born in Paris?",
                    response="No, Marie Curie is born in Warsaw.")

{'relevance': 4.0,
 'gpt_relevance': 4.0,
 'relevance_reason': 'The response fully and accurately answers the question, providing all essential details for understanding.',
 'relevance_result': 'pass',
 'relevance_threshold': 3}

## Response Completeness Evaluator

> Measures to what extent the response is complete (not missing critical information) with respect to the ground truth.

AI systems can fabricate content or generate irrelevant responses outside the given context. Given ground truth response, ResponseCompletenessEvaluator that captures the recall aspect of response alignment with the expected response. This is complementary to GroundednessEvaluator which captures the precision aspect of response alignment with the grounding source.

In [14]:
response_completeness_evaluator = ResponseCompletenessEvaluator(model_config=model_config, threshold=3)

Class ResponseCompletenessEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [15]:
response_completeness_evaluator(
    response=
    "Based on the retrieved documents, the shareholder meeting discussed the operational efficiency of the company and financing options.",
    ground_truth=
    "The shareholder meeting discussed the compensation package of the company CEO."
)

{'response_completeness': 1,
 'response_completeness_result': 'fail',
 'response_completeness_threshold': 3,
 'response_completeness_reason': "The response does not include any information about the CEO's compensation package, which is the sole topic in the ground truth. Therefore, it is fully incomplete."}