# Hands-on Lab: Building an Agent System with Databricks

**Please use Serverless Compute for Environment Version 1.**

## Part 2 - Agent Evaluation
Now that we have created an agent, how do we evaluate its performance?
In Part 2, we will focus on evaluation by creating a product support agent.
This agent will use the RAG approach to leverage product documents and answer questions about the products.

### 2.1 Define New Agent and Retriever Tools
- [**agent.py**]($./agent.py): A sample agent is set up. First, review this file to understand its components.
- **Vector Search**: A vector search endpoint has been created to search for documents related to specific products.
- **Create Retriever Functions**: Define the properties of the retriever and package it so that it can be called from the LLM.

### 2.2 Create Evaluation Dataset
- A sample evaluation dataset is provided, but it is also possible to [generate synthetically](https://www.databricks.com/jp/blog/streamline-ai-agent-evaluation-with-new-synthetic-data-capabilities).

### 2.3 Run MLflow.evaluate()
- MLflow tests the agent's responses using the evaluation dataset.
- The LLM judge scores the output and summarizes everything in an easy-to-read UI.

### 2.4 Make Necessary Improvements and Re-evaluate
- Get feedback from the evaluation results and change the retriever settings.
- Re-evaluate to confirm improvements!

In [0]:
%pip install -U -qqqq mlflow-skinny[databricks] langgraph==0.3.4 databricks-langchain databricks-agents uv
dbutils.library.restartPython()

In [0]:
%run ../config

## Agent Configuration via Environment Variables

Set parameters for `agent.py` from environment variables.

In [0]:
from databricks.sdk import WorkspaceClient
import os
import re

# Get information about the current user using the workspace client
w = WorkspaceClient()
user_email = w.current_user.me().emails[0].value
username = user_email.split('@')[0]
username = re.sub(r'[^a-zA-Z0-9_]', '_', username) # Replace special characters with underscores

# Specify schema
user_schema_name = f"agents_lab_{username}" # Per-user schema

In [0]:
# LLM endpoint name
os.environ["LLM_ENDPOINT_NAME"] = "databricks-claude-3-7-sonnet"

# UC function tools
os.environ["UC_TOOL_NAMES"] = f"{catalog_name}.{user_schema_name}.*"

# Vector Search name
os.environ["VS_NAME"] = f"{catalog_name}.{system_schema_name}.product_docs_index"

print("Environment variables set:")
print(f"LLM_ENDPOINT_NAME: {os.environ.get('LLM_ENDPOINT_NAME')}")
print(f"UC_TOOL_NAMES: {os.environ.get('UC_TOOL_NAMES')}")
print(f"VS_NAME: {os.environ.get('VS_NAME')}")

In [0]:
from agent import AGENT

AGENT.predict({"messages": [{"role": "user", "content": "Please provide troubleshooting tips for the Soundwave X5 Pro headphones."}]})

In [0]:
AGENT.predict({"messages": [{"role": "user", "content": "What is today's date?"}]})

In [0]:
from IPython.display import Image, display

# Visualize the agent's graph structure
display(Image(AGENT.agent.get_graph().draw_mermaid_png()))

## Log `agent` as an MLflow Model
Log the agent as code in the [agent]($./agent) notebook. For more details, refer to [MLflow - Models from Code](https://mlflow.org/docs/latest/models.html#models-from-code).

In [0]:
# Determine Databricks resources for automatic authentication passthrough at deployment
import mlflow
from agent import tools, LLM_ENDPOINT_NAME
from databricks_langchain import VectorSearchRetrieverTool
from mlflow.models.resources import DatabricksFunction, DatabricksServingEndpoint
from unitycatalog.ai.langchain.toolkit import UnityCatalogTool

resources = [DatabricksServingEndpoint(endpoint_name=LLM_ENDPOINT_NAME)]
for tool in tools:
    if isinstance(tool, VectorSearchRetrieverTool):
        resources.extend(tool.resources)
    elif isinstance(tool, UnityCatalogTool):
        resources.append(DatabricksFunction(function_name=tool.uc_function_name))

input_example = {
    "messages": [
        {
            "role": "user",
            "content": "What color options are available for the Aria Modern Bookshelf?"
        }
    ]
}

with mlflow.start_run():
    logged_agent_info = mlflow.pyfunc.log_model(
        name="agent",
        python_model="agent.py",
        input_example=input_example,
        resources=resources,
        extra_pip_requirements=[
            "databricks-connect"
        ]
    )

In [0]:
# Load the model and create a prediction function
logged_model_uri = f"runs:/{logged_agent_info.run_id}/agent"
loaded_model = mlflow.pyfunc.load_model(logged_model_uri)

def predict_wrapper(query):
    # Format input for chat-style model
    model_input = {
        "messages": [{"role": "user", "content": query}]
    }
    response = loaded_model.predict(model_input)
    
    messages = response['messages']
    return messages[-1]['content']

## Evaluate the Agent with [Agent Evaluation](https://docs.databricks.com/aws/ja/generative-ai/agent-evaluation)

Edit the requests and expected responses in the evaluation dataset, run the evaluation iteratively, and track the calculated quality metrics using MLflow.

In [0]:
import pandas as pd

data = {
    "request": [
        "What color options are available for the Aria Modern Bookshelf?",
        "How can I clean the Aurora Oak Coffee Table without damaging it?",
        "How should I clean the BlendMaster Elite 4000 after use?",
        "What color variations are available for the Flexi-Comfort Office Desk?",
        "What sizes are available for the StormShield Pro Men's Waterproof Jacket?"
    ],
    "expected_facts": [
        [
            "The Aria Modern Bookshelf is available in natural oak finish.",
            "The Aria Modern Bookshelf is available in black finish.",
            "The Aria Modern Bookshelf is available in white finish."
        ],
        [
            "Wipe with a soft, slightly damp cloth.",
            "Do not use abrasive cleaners."
        ],
        [
            "Rinse the jar of the BlendMaster Elite 4000.",
            "Rinse with lukewarm water.",
            "Clean after each use."
        ],
        [
            "The Flexi-Comfort Office Desk is available in 3 colors."
        ],
        [
            "The StormShield Pro Men's Waterproof Jacket is available in sizes S, M, L, XL, XXL."
        ]
    ]
}

eval_dataset = pd.DataFrame(data)
display(eval_dataset)

Define the [scorer](https://docs.databricks.com/gcp/ja/mlflow3/genai/eval-monitor/custom-judge/meets-guidelines) as the LLM judge.

In [0]:
from mlflow.genai.scorers import Guidelines, Safety
import mlflow.genai

# Create evaluation dataset
eval_data = []
for request, facts in zip(data["request"], data["expected_facts"]):
    eval_data.append({
        "inputs": {
            "query": request  # Match the function argument
        },
        "expected_response": "\n".join(facts)
    })

# Define evaluation scorers
# Guidelines for the LLM judge to evaluate responses

# Define a custom scorer specialized for product information evaluation
scorers = [
    Guidelines(
        guidelines="""Responses must include all expected facts:
        - List all colors and sizes if applicable (partial lists are not allowed)
        - Provide exact specifications if applicable (e.g., '5 ATM'; vague expressions are not allowed)
        - If asked about cleaning procedures, include all steps
        If any fact is missing or incorrect, the response fails.""",
        name="completeness_and_accuracy",
    ),
    Guidelines(
        guidelines="""Responses must be clear and direct:
        - Answer the question precisely
        - List options in list format, steps in step format
        - No marketing language or unnecessary background
        - Be concise and complete.""",
        name="relevance_and_structure",
    ),
    Guidelines(
        guidelines="""Responses must not deviate from the topic:
        - Only answer about the product asked
        - Do not add fictional features or colors
        - Do not include general advice
        - Use the product name exactly as stated in the request.""",
        name="product_specificity",
    ),
]

Run the evaluation.

In [0]:
print("Running evaluation...")
with mlflow.start_run():
    results = mlflow.genai.evaluate(
        data=eval_data,
        predict_fn=predict_wrapper, 
        scorers=scorers
    )

## Return to the [agent.py]($./agent.py) file and modify the prompt to reduce marketing exaggerations.

In [0]:
with mlflow.start_run():
    logged_agent_info = mlflow.pyfunc.log_model(
        name="agent",
        python_model="agent.py",
        input_example=input_example,
        resources=resources,
        extra_pip_requirements=[
            "databricks-connect"
        ]
    )

# Load the model and create a prediction function
logged_model_uri = f"runs:/{logged_agent_info.run_id}/agent"
loaded_model = mlflow.pyfunc.load_model(logged_model_uri)

def predict_wrapper(query):
    # Format input for chat-style model
    model_input = {
        "messages": [{"role": "user", "content": query}]
    }
    response = loaded_model.predict(model_input)
    
    messages = response['messages']
    return messages[-1]['content']
  
print("Running evaluation...")
with mlflow.start_run():
    results = mlflow.genai.evaluate(
        data=eval_data,
        predict_fn=predict_wrapper, 
        scorers=scorers,
    )

## Register the Model in Unity Catalog

Update the following `catalog`, `schema`, and `model_name` to register the MLflow model in Unity Catalog.

In [0]:
mlflow.set_registry_uri("databricks-uc")

# Define catalog, schema, and model name for UC model
model_name = "product_agent"
UC_MODEL_NAME = f"{catalog_name}.{user_schema_name}.{model_name}"

# Register the model in UC
uc_registered_model_info = mlflow.register_model(model_uri=logged_agent_info.model_uri, name=UC_MODEL_NAME)

Access the model version and check the agent's lineage in the **Dependencies** tab.

In [0]:
from IPython.display import display, HTML

# Get the Databricks host URL
workspace_url = spark.conf.get('spark.databricks.workspaceUrl')

# Create an HTML link to the created agent
html_link = f'<a href="https://{workspace_url}/explore/data/models/{catalog_name}/{user_schema_name}/product_agent" target="_blank">View registered agent in Unity Catalog</a>'
display(HTML(html_link))

# Access the model version and check the agent's lineage in the **Dependencies** tab.

## Agent Deployment

Set the environment variables used above and deploy the agent to a model serving endpoint.

In [0]:
from databricks import agents

# Define environment variables as a dictionary
environment_vars = {
    "LLM_ENDPOINT_NAME": os.environ["LLM_ENDPOINT_NAME"],
    "UC_TOOL_NAMES": os.environ["UC_TOOL_NAMES"],
    "VS_NAME": os.environ["VS_NAME"],
}

# Deploy the model to the review app and model serving endpoint
endpoint_info = agents.deploy(
    UC_MODEL_NAME,
    uc_registered_model_info.version,
    tags={"endpointSource": "Agent Lab"},
    environment_vars=environment_vars,
    timeout=900,  # Extended to 15 minutes
)

Run the following cell and access the displayed link. Once the endpoint is **Ready**, select **Use > Try in Playground** in the top right corner to test it in the Playground.

Sample query: `What color variations are available for the Flexi-Comfort Office Desk?`

In [0]:
from IPython.display import display, HTML

serving_endpoint_url = f"https://{workspace_url}/ml/endpoints/{endpoint_info.endpoint_name}"
html_endpoint_link = f'<a href="{serving_endpoint_url}" target="_blank">View Serving Endpoint</a>'
display(HTML(html_endpoint_link))

# Run the following cell and access the displayed link. Once the endpoint is **Ready**, select **Use > Try in Playground** in the top right corner to test it in the Playground.

### Delete the Serving Endpoint

Since serving endpoints incur charges, delete the endpoint once the hands-on lab is complete.

In [0]:
# Delete agent deployment
from databricks import agents

agents.delete_deployment(
    model_name=UC_MODEL_NAME,
    model_version=uc_registered_model_info.version
)

## Future Directions

- Further [evaluation](https://docs.databricks.com/aws/ja/mlflow3/genai/getting-started/eval) and [monitoring](https://docs.databricks.com/aws/ja/mlflow3/genai/eval-monitor/) for improvements
- Integration with [apps](https://docs.databricks.com/aws/ja/dev-tools/databricks-apps/get-started)