In [0]:
%load_ext autoreload
%autoreload 2
# Enables autoreload; learn more at https://docs.databricks.com/en/files/workspace-modules.html#autoreload-for-python-modules
# To disable autoreload; run %autoreload 0

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Hands-On Lab: Building Agent Systems with Databricks

## Part 2 - Agent Evaluation
Now that we've created an agent, how do we evaluate its performance?
For the second part, we're going to create a product support agent so we can focus on evaluation.
This agent will use a RAG approach to help answer questions about products using the product documentation.

### 2.1 Define our new Agent and retriever tool
- [**agent.py**]($./agent.py): An example Agent has been configured - first we'll explore this file and understand the building blocks
- **Vector Search**: We've created a Vector Search endpoint that can be queried to find related documentation about a specific product.
- **Create Retriever Function**: Define some properties about our retriever and package it so it can be called by our LLM.

### 2.2 Create Evaluation Dataset
- We've provided an example evaluation dataset - though you can also generate this [synthetically](https://www.databricks.com/blog/streamline-ai-agent-evaluation-with-new-synthetic-data-capabilities).

### 2.3 Run MLflow.genai.evaluate() 
- MLflow will take your evaluation dataset and test your agent's responses against it
- LLM Judges will score the outputs and collect everything in a nice UI for review

### 2.4 Make Needed Improvements and re-run Evaluations
- Take feedback from our evaluation run and change the prompt
- Run evals again and see the improvement!

In [0]:
%pip install -U -qqqq backoff databricks-openai uv databricks-agents mlflow-skinny[databricks] unitycatalog-langchain[databricks] databricks-langchain

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-server 1.23.4 requires anyio<4,>=3.1.0, but you have anyio 4.12.1 which is incompatible.
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
dbutils.library.restartPython()

In [0]:
# --- Lab hygiene: suppress known non-actionable warnings ---
import warnings, logging

# Pydantic v2 serializer warnings (safe to ignore for this lab)
warnings.filterwarnings(
    "ignore",
    message=r"^Pydantic serializer warnings:",
    category=UserWarning,
    module=r"pydantic\..*",
)

# MLflow tracing warnings (otel not fully enabled in this runtime)
logging.getLogger("mlflow.tracing").setLevel(logging.ERROR)
logging.getLogger("mlflow.tracing.fluent").setLevel(logging.ERROR)

In [0]:
%reload_ext autoreload

In [0]:
from agent import AGENT

AGENT.predict({"input": [{"role": "user", "content": "Hello, what do you do?"}]})



ResponsesAgentResponse(tool_choice=None, truncation=None, id=None, created_at=None, error=None, incomplete_details=None, instructions=None, metadata=None, model=None, object='response', output=[OutputItem(type='reasoning', summary=[{'type': 'summary_text', 'text': 'The user asks "Hello, what do you do?" They want to know what the assistant does. As a customer success specialist for Databricks lab. We should respond explaining role. No need to call any tool.'}], id='chatcmpl_b5de0041-b105-4d0f-bfca-c66047ea4c42'), OutputItem(type='message', id='chatcmpl_b5de0041-b105-4d0f-bfca-c66047ea4c42', content=[{'text': 'Hi there! I’m a Customer Success Specialist for the Databricks Labs program. My role is to help you get the most out of the Databricks platform and any lab‑related resources we provide. \n\nHere’s a quick rundown of what I can do for you:\n\n- **Answer product‑related questions** about Spark, Delta Lake, MLflow, Unity Catalog, and other Databricks components.  \n- **Guide you thro

Trace(trace_id=tr-848126690fd65db04f2d619119dd9d64)

### Log the `agent` as an MLflow model
Log the agent as code from the [agent]($./agent) notebook. See [MLflow - Models from Code](https://mlflow.org/docs/latest/models.html#models-from-code).

In [0]:
# Determine Databricks resources to specify for automatic auth passthrough at deployment time
import mlflow
from agent import VECTOR_SEARCH_TOOLS, LLM_ENDPOINT_NAME
from databricks_openai import UCFunctionToolkit, VectorSearchRetrieverTool
from mlflow.models.resources import DatabricksFunction, DatabricksServingEndpoint
from unitycatalog.ai.langchain.toolkit import UnityCatalogTool


resources = [DatabricksServingEndpoint(endpoint_name=LLM_ENDPOINT_NAME)]
for tool in VECTOR_SEARCH_TOOLS:
    if isinstance(tool, VectorSearchRetrieverTool):
        resources.extend(tool.resources)
    elif isinstance(tool, UnityCatalogTool):
        resources.append(DatabricksFunction(function_name=tool.uc_function_name))

input_example = {
    "input": [
        {"role": "user", "content": "What color options are available for the Aria Modern Bookshelf?"}
    ],
}

with mlflow.start_run():
    logged_agent_info = mlflow.pyfunc.log_model(
        artifact_path="agent",
        python_model="agent.py",  
        input_example=input_example,
        resources=resources,
        extra_pip_requirements=[
            "mlflow>=3.1.3",
            "databricks-agents>=1.1.0",
            "databricks-openai",
        ],
    )

🔗 View Logged Model at: https://dbc-7aabc6fb-674f.cloud.databricks.com/ml/experiments/3390865860258818/models/m-63886b62e213408db61c7e91b4f5fa81?o=7474655838774581
2026/02/13 09:29:07 INFO mlflow.pyfunc: Predicting on input example to validate output


In [0]:
# Load the model and create a prediction function
logged_model_uri = f"runs:/{logged_agent_info.run_id}/agent"
loaded_model = mlflow.pyfunc.load_model(logged_model_uri)

def predict_wrapper(query):
    model_input = {
        "input": [
            {"role": "user", "content": query}
        ]
    }
    response = loaded_model.predict(model_input)
    # Find the last output item of type "message"
    message = next(
        (item for item in reversed(response["output"]) if item.get("type") == "message"),
        None
    )
    if message and "content" in message:
        # Find the first content item of type "output_text"
        content_item = next(
            (c for c in message["content"] if c.get("type") == "output_text"),
            None
        )
        if content_item:
            return content_item["text"]
    return None

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/12 [00:00<?, ?it/s]

## Evaluate the agent with [Agent Evaluation](https://docs.databricks.com/generative-ai/agent-evaluation/index.html)

You can edit the requests or expected responses in your evaluation dataset and run evaluation as you iterate your agent, leveraging mlflow to track the computed quality metrics.

In [0]:
import pandas as pd

data = {
    "request": [
        "What color options are available for the Aria Modern Bookshelf?",
        "How should I clean the Aurora Oak Coffee Table to avoid damaging it?",
        "What sizes are available for the StormShield Pro Men's Weatherproof Jacket?"
    ],
    "expected_facts": [
        [
            "The Aria Modern Bookshelf is available in natural oak finish",
            "The Aria Modern Bookshelf is available in black finish",
            "The Aria Modern Bookshelf is available in white finish"
        ],
        [
            "Use a soft, slightly damp cloth for cleaning.",
            "Avoid using abrasive cleaners."
        ],
        [
            "The available sizes for the StormShield Pro Men's Weatherproof Jacket are Small, Medium, Large, XL, and XXL."
        ]
    ]
}

eval_dataset = pd.DataFrame(data)

In [0]:
from mlflow.genai.scorers import RetrievalGroundedness, RelevanceToQuery, Safety, Guidelines
import mlflow.genai

eval_data = []
for request, facts in zip(data["request"], data["expected_facts"]):
    eval_data.append({
        "inputs": {
            "query": request  # This matches the function parameter
        },
        "expected_response": "\n".join(facts)
    })

# Define custom scorers tailored to product information evaluation
scorers = [
    #RetrievalGroundedness(),  # Pre-defined judge that checks against retrieval results
    RelevanceToQuery(),  # Checks if answer is relevant to the question
    Safety(),  # Checks for harmful or inappropriate content
    Guidelines(
        guidelines="""Response must be clear and direct:
        - Answers the exact question asked
        - Uses lists for options, steps for instructions
        - No marketing fluff or extra background
        - Does not tell user to contact customer support
        - Concise but complete.""",
        name="clarity_and_structure",
    ),
    #Guidelines(
    #    guidelines="""Response must include ALL expected facts:
    #    - Lists ALL colors/sizes if relevant (not partial lists)
    #    - States EXACT specs if relevant (e.g., "5 ATM" not "water resistant")
    #    - Includes ALL cleaning steps if asked
    #    Fails if ANY fact is missing or wrong.""",
    #    name="completeness_and_accuracy",
    #)
]

In [0]:
print("Running evaluation...")
with mlflow.start_run():
    results = mlflow.genai.evaluate(
        data=eval_data,
        predict_fn=predict_wrapper, 
        scorers=scorers,
    )

Running evaluation...


2026/02/13 09:31:08 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.
2026/02/13 09:31:08 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset. To disable this check, set the MLFLOW_GENAI_EVAL_SKIP_TRACE_VALIDATION environment variable to True.


Evaluating:   0%|          | 0/3 [Elapsed: 00:00, Remaining: ?] 

## Lets go back to the [agent.py]($./agent.py) file and change our prompt to better fit how we'd like it to respond and re-evaluate.

## Register the model to Unity Catalog

Update the `catalog`, `schema`, and `model_name` below to register the MLflow model to Unity Catalog.

In [0]:
from databricks.sdk import WorkspaceClient
import os

mlflow.set_registry_uri("databricks-uc")

# Use the workspace client to retrieve information about the current user
w = WorkspaceClient()
user_email = w.current_user.me().display_name
username = user_email.split("@")[0]

# Catalog and schema have been automatically created thanks to lab environment
catalog_name = f"{username}"
schema_name = "agents"

# TODO: define the catalog, schema, and model name for your UC model
model_name = "product_agent"
UC_MODEL_NAME = f"{catalog_name}.{schema_name}.{model_name}"

# register the model to UC
uc_registered_model_info = mlflow.register_model(model_uri=logged_agent_info.model_uri, name=UC_MODEL_NAME)

Successfully registered model 'labuser13792508_1770974336.agents.product_agent'.


Downloading artifacts:   0%|          | 0/12 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/13 [00:00<?, ?it/s]

🔗 Created version '1' of model 'labuser13792508_1770974336.agents.product_agent': https://dbc-7aabc6fb-674f.cloud.databricks.com/explore/data/models/labuser13792508_1770974336/agents/product_agent/version/1?o=7474655838774581


In [0]:
from IPython.display import display, HTML

# Retrieve the Databricks host URL
workspace_url = spark.conf.get('spark.databricks.workspaceUrl')

# Create HTML link to created agent
html_link = f'<a href="https://{workspace_url}/explore/data/models/{catalog_name}/{schema_name}/product_agent" target="_blank">Go to Unity Catalog to see Registered Agent</a>'
display(HTML(html_link))

## Deploy the agent

##### Note: This is disabled for lab users but will work on your own workspace

In [0]:
from databricks import agents

# Deploy the model to the review app and a model serving endpoint

#Disabled for the lab environment but we've deployed the agent already!
agents.deploy(UC_MODEL_NAME, uc_registered_model_info.version, tags = {"endpointSource": "DI Days"})

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

For more information, see: https://docs.databricks.com/aws/en/generative-ai/agent-framework/feedback-model



    Deployment of labuser13792508_1770974336.agents.product_agent version 1 initiated.  This can take up to 15 minutes and the Review App & Query Endpoint will not work until this deployment finishes.

    View status: https://dbc-7aabc6fb-674f.cloud.databricks.com/ml/endpoints/agents_labuser13792508_1770974336-agents-product_agent/?o=7474655838774581
    Review App: https://dbc-7aabc6fb-674f.cloud.databricks.com/ml/review-v2/c9b541e1f1354889ac47c0dff1eba4b2/chat?o=7474655838774581

You can refer back to the links above from the endpoint detail page at https://dbc-7aabc6fb-674f.cloud.databricks.com/ml/endpoints/agents_labuser13792508_1770974336-agents-product_agent/?o=7474655838774581.

To set up monitoring for your deployed agent, see:
https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/production-monitoring


Deployment(model_name='labuser13792508_1770974336.agents.product_agent', model_version='1', endpoint_name='agents_labuser13792508_1770974336-agents-product_agent', served_entity_name='labuser13792508_1770974336-agents-product_agent_1', query_endpoint='https://dbc-7aabc6fb-674f.cloud.databricks.com/serving-endpoints/agents_labuser13792508_1770974336-agents-product_agent/served-models/labuser13792508_1770974336-agents-product_agent_1/invocations?o=7474655838774581', endpoint_url='https://dbc-7aabc6fb-674f.cloud.databricks.com/ml/endpoints/agents_labuser13792508_1770974336-agents-product_agent/?o=7474655838774581', review_app_url='https://dbc-7aabc6fb-674f.cloud.databricks.com/ml/review-v2/c9b541e1f1354889ac47c0dff1eba4b2/chat?o=7474655838774581')