# Evaluate a Fabric Data Agent

In this notebook, we'll walk through how to evaluate a Fabric Data Agent using the `fabric-data-agent-sdk`. We'll cover the full workflow, including:

- ✅ Creating a new Data Agent from the SDK
- 🗂️ Adding data sources and selecting relevant tables
- 📋 Defining a ground truth dataset with questions and expected answers
- 🧪 Running an automated evaluation to compare actual vs. expected responses
- 📈 Reviewing evaluation summaries and detailed results

This end-to-end example is designed to help you validate the accuracy of your Data Agent and iterate on improvements with structured feedback.

Let's get started!


> **Prerequisite: Load Sample Data into the Lakehouse**

Before running this notebook, make sure you’ve created a Lakehouse and loaded the sample **AdventureWorks** dataset.

Follow the steps in the official guide to create the Lakehouse and populate it with sample tables:
👉 [Create a Lakehouse with AdventureWorksLH](https://learn.microsoft.com/en-us/fabric/data-science/data-agent-scenario#create-a-lakehouse-with-adventureworkslh)

This ensures that the required tables are available for your Data Agent to access during evaluation.


## Install Fabric Data Agent SDK

Before we begin, install the latest version of the `fabric-data-agent-sdk`. This SDK provides all the tools you need to create, configure, and evaluate your Data Agent programmatically.

Run the following cell to install or upgrade the SDK in your notebook environment:


In [None]:
%pip install -U fabric-data-agent-sdk

## Connect to a Data Agent

Now that our data is available in the Lakehouse, we’ll create a new **Fabric Data Agent** using the Python SDK.

In this step:
- We define a name for the agent (e.g., `"ProductSalesDataAgent"`)
- Use `create_data_agent()` to create a new agent instance
- Alternatively, use `FabricDataAgentManagement()` to connect to an existing agent with the same name

This agent will be configured to understand your data and respond to natural language questions.

In [None]:
from fabric.dataagent.client import (
    FabricDataAgentManagement,
    create_data_agent,
    delete_data_agent,
)

# Define the name for the Data Agent
data_agent_name = "AdvWorksDataAgent"

# Create a new Data Agent (run this once)
data_agent = create_data_agent(data_agent_name)

# If the Data Agent already exists, use this instead to connect:
# data_agent = FabricDataAgentManagement(data_agent_name)


In this step, we configure the Data Agent to work with a **Lakehouse** data source.

- We specify the Lakehouse name (e.g., `EvaluationLH`)
- Optionally, we register it with the agent if it hasn’t been added yet
- We then select specific tables from the `dbo` schema that the agent should use to answer questions

These tables will form the structured foundation the agent relies on to generate accurate responses.

In [None]:
# Add a Lakehouse as the data source for the agent
lakehouse_name = "EvaluationLH"

# Supported types include: "lakehouse", "kqldatabase", "datawarehouse", or "semanticmodel"
data_agent.add_datasource(lakehouse_name, type="lakehouse")

# Retrieve the data source object (assumes one was added)
datasource = data_agent.get_datasources()[0]

# Select relevant tables from the Lakehouse (schema: dbo)
datasource.select("dbo", "dimcustomer")
datasource.select("dbo", "dimdate")
datasource.select("dbo", "dimgeography")
datasource.select("dbo", "dimproduct")
datasource.select("dbo", "dimproductcategory")
datasource.select("dbo", "dimpromotion")
datasource.select("dbo", "dimreseller")
datasource.select("dbo", "dimsalesterritory")
datasource.select("dbo", "factinternetsales")
datasource.select("dbo", "factresellersales")

# Publish the data agent
data_agent.publish()



## Define ground truth questions and expected answers

To evaluate the accuracy of your Data Agent, you'll need a test dataset consisting of natural language questions and their expected answers.

In this step:
- We define a small set of ground truth examples using a pandas DataFrame
- Each row contains a `question` and the `expected_answer`
- You can customize these examples based on the data and use cases relevant to your agent

Optionally, you can load this dataset from a CSV file if you're working with a larger or pre-curated set of evaluation cases.

In [None]:
import pandas as pd

# Create DataFrame with "question,expected_answer". Please update the questions and expected_answers as per the requirement.
df = pd.DataFrame(columns=["question", "expected_answer"],
                  data=[
                    ["What were our total sales in 2014?", "45,694.7"],
                    ["What is the most sold product?", "Mountain-200 Black, 42"],
                    ["What are the most expensive items that have never been sold?", "Road-450 Red, 60"],
                ])

# You can also oad from input CSV file with data in format "question,expected_answer"
# input_file_path = "/lakehouse/default/Files/Data/Input/groundtruth.csv"
# df = pd.read_csv(input_file_path)


## Configure Evaluation Parameters

Before running the evaluation, we define a few optional parameters to control where and how results are stored:

- `workspace_name`: (Optional) Use this if your Data Agent is located in a different workspace.
- `table_name`: (Optional) The base name of the output table where evaluation results will be stored. Default to 'evaluation_output'. This will generate:
  - `<table_name>`: A summary of the evaluation results.
  - `<table_name>_steps`: A detailed log of reasoning steps for each question.
- `data_agent_stage`: (Optional) Set to `"sandbox"/"draft"` or `"production"/"publishing"` depending on which version of the agent you want to evaluate. Default to production.
- `max_workers`: (Optional) Maximun worker nodes that need to run parallely. Default to 5.
- `no_of_variations`: (Optional) Number of times to evaluate each question. Default is 1.

These settings help you organize and retrieve evaluation outputs from your Lakehouse environment.


In [None]:
from fabric.dataagent.evaluation import evaluate_data_agent


# Workspace Name (Optional) if Data Agent is in different workspace
workspace_name = None

# Table name (Optional) to store the evaluation result. Default value is 'evaluation_output'
# After evaluation there will be two tables one with provided <table_name> for evaluation output and other with <table_name>_steps for detailed steps.
table_name = "demo_evaluation_output"

# Data Agent stage ie., sandbox or production. Default to production.
data_agent_stage = "sandbox"

# Evaluation output table name
table_name = "demo_evaluation_output"

# Number of parallel workers to use for evaluation
max_workers = 4  

# Number of variations to generate for each question
no_of_variations = 3

## Run the Evaluation

Now we're ready to evaluate the Data Agent using the ground truth dataset we defined earlier.

The `evaluate_data_agent()` function will:
- Run each question against the Data Agent
- Compare the actual response to the expected answer
- Log results and reasoning steps to the specified Lakehouse tables

It returns a unique `evaluation_id` which you can use to retrieve summaries or detailed results later.

Let's run the evaluation and capture the ID for this run.


In [None]:
# Evaluate the Data Agent. Returns the unique id for the evaluation run
evaluation_id = evaluate_data_agent(df, data_agent_name, workspace_name=workspace_name, table_name=table_name, data_agent_stage=data_agent_stage, max_workers=max_workers, no_of_variations=no_of_variations)

print(f"Unique Id for the current evaluation run: {evaluation_id}")

## View evaluation summary

After the evaluation run completes, you can retrieve a high-level summary using the `get_evaluation_summary()` function.

This summary includes:
- Total number of questions evaluated
- Counts of correct, incorrect, and unclear responses
- Overall accuracy metrics

Use this step to quickly assess how well your Data Agent performed.


In [None]:
# Import the function to retrieve evaluation summaries
from fabric.dataagent.evaluation import get_evaluation_summary

# Retrieve the summary of the evaluation results using the specified table name
# This returns a DataFrame with aggregated metrics like counts of true/false/unclear responses
eval_summary_df = get_evaluation_summary(table_name, verbose=True)


## View evaluation summary per question

You can retrieve a high-level summary per question using the `get_evaluation_summary_per_question()` function.
- `evaluation_id`: (Optional) Unique identifier of the evaluation run. If not used gives overall summary of all the evaluation runs in the output table.
- `table_name`: (Optional) The base name of the output table where evaluation results will be stored. Default to 'evaluation_output'.
- `verbose`: (Optional) Set to `True` to print a summary alongside the DataFrame.

This summary includes:
- Total number of questions evaluated
- Counts of correct, incorrect, and unclear responses per question
- Overall accuracy metrics

Use this step to quickly assess how well your Data Agent performed.


In [None]:
# Import the function to retrieve evaluation summaries
from fabric.dataagent.evaluation import get_evaluation_summary_per_question

# Retrieve the summary of the evaluation results per question using the specified evaluation_id and table name
# This returns a DataFrame with aggregated metrics like counts of true/false/unclear responses
eval_summary_df = get_evaluation_summary_per_question(evaluation_id, table_name, verbose=True)


##### Display the DataFrame
To avoid the column width limit, use the below line to display the DataFrame.

In [None]:
import pandas as pd

with pd.option_context('display.max_colwidth', None):
  display(eval_summary_df)

## Retrieve Detailed Evaluation Results

To analyze the agent's performance question-by-question, use the `get_evaluation_details()` function.

This provides a detailed view of:
- The original question
- The expected answer
- The agent's actual response
- The evaluation outcome (`true`, `false`, or `unclear`)
- A link to the Fabric thread (accessible only to the evaluator)

You can also control:
- `get_all_rows`: Set to `True` to return both successful and failed evaluations (defaults to `False`, which returns only failed cases).
- `verbose`: Set to `True` to print a summary alongside the DataFrame.

This is especially useful for debugging incorrect responses and improving your agent's accuracy over time.


In [None]:
# Import the function to retrieve detailed evaluation results
from fabric.dataagent.evaluation import get_evaluation_details

# Unique identifier for the evaluation run (already captured earlier)
# You can hardcode an ID here if needed
# evaluation_id = 'd36ce205-a88d-42bd-927d-260ec2e2a479'

# Whether to return all evaluation results (True) or only failed ones (False, default)
get_all_rows = True

# Whether to print a summary of the evaluation results to the console (optional)
verbose = True

# Fetch detailed evaluation results as a DataFrame
# This includes question, expected answer, actual answer, evaluation status, and diagnostic info
eval_details_df = get_evaluation_details(
    evaluation_id,
    table_name,
    get_all_rows=get_all_rows,
    verbose=verbose
)


## Use a custom prompt to evaluate agent responses

In some cases, simple string matching may not be sufficient to determine if the agent's response is correct—especially when responses vary in format but are semantically equivalent.

You can define a **custom critic prompt** using the `critic_prompt` parameter in `evaluate_data_agent()`. This prompt will be used by an LLM to decide whether the actual answer is equivalent to the expected answer.

The prompt must include the following placeholders:
- `{query}`: The original user question
- `{expected_answer}`: The expected result
- `{actual_answer}`: The agent's generated response

Once the evaluation is complete, you can retrieve the summary results using `get_evaluation_summary()` and track the run using the printed `evaluation_id`.

This method gives you more flexibility in how you assess correctness, especially for complex or domain-specific outputs.


In [None]:
from fabric.dataagent.evaluation import evaluate_data_agent

# Define a custom prompt to evaluate whether the agent's actual response matches the expected answer.
# The prompt should include placeholders: {query}, {expected_answer}, and {actual_answer}
critic_prompt = """
        Given the following query, expected answer, and actual answer, please determine if the actual answer is equivalent to expected answer. If they are equivalent, respond with 'yes'.

        Query: {query}

        Expected Answer:
        {expected_answer}

        Actual Answer:
        {actual_answer}

        Is the actual answer equivalent to the expected answer?
        """

# Evaluate the Data Agent using the custom critic prompt
# Returns a unique evaluation ID for tracking and analysis
evaluation_id_critic = evaluate_data_agent(
    df,
    data_agent_name,
    critic_prompt=critic_prompt,
    table_name=table_name,
    data_agent_stage="sandbox"
)

# Retrieve the summary of this evaluation run
eval_summary_df_critic = get_evaluation_summary(table_name)

# Display the unique ID for reference
print(f"Unique Id for the current evaluation run: {evaluation_id_critic}")

eval_summary_df_critic

## 