# Evaluate the bot

Run this notebook on the serverless cluster.

Select Serverless environment v3, from the settings button on the right side menu.

![Serverless env](https://raw.githubusercontent.com/brickops/databricks-botops-course/refs/heads/main/images/serverless_env_wide.png).

## Evaluate the agent with Agent Evaluation

Use Mosaic AI Agent Evaluation to evaluate the agent's responses based on expected responses and other evaluation criteria. Use the evaluation criteria you specify to guide iterations, using MLflow to track the computed quality metrics.
See Databricks documentation ([AWS]((https://docs.databricks.com/aws/generative-ai/agent-evaluation) | [Azure](https://learn.microsoft.com/azure/databricks/generative-ai/agent-evaluation/)).


To evaluate your tool calls, add custom metrics. See Databricks documentation ([AWS](https://docs.databricks.com/en/generative-ai/agent-evaluation/custom-metrics.html#evaluating-tool-calls) | [Azure](https://learn.microsoft.com/en-us/azure/databricks/generative-ai/agent-evaluation/custom-metrics#evaluating-tool-calls)).

##Install dependencies

In [0]:
%run ../../../libs/botops/buildsetup_serverless

### Setup eval dataset

This is often known as the golden data set

In [0]:
import pandas as pd

eval_examples = [
    {
        "request": {"messages": [{"role": "user", "content": "Which borough has most taxi trips?"}]},
        "expected_response": None,
    },
    {
        "request": {"messages": [{"role": "user", "content": """Which borough has least taxi trips?"""}]},
        "expected_response": "Staten Island",
    }
]

eval_dataset = pd.DataFrame(eval_examples)
display(eval_dataset)

#### Set the bot name

Set bot name to the name of the current folder

In [0]:
BOT_NAME = folder()


#### Task: Set the correct tripbot run id

Replace run id with the id for the run where you logged the model

In [0]:
TRIPBOT_RUN_ID = "b7027f2a2e824a768e49ef26d960c372"

In [0]:
import mlflow

# with mlflow.start_run(run_id=logged_agent_info.run_id):
with mlflow.start_run(run_id=TRIPBOT_RUN_ID):
    eval_results = mlflow.evaluate(
        f"runs:/{TRIPBOT_RUN_ID}/{BOT_NAME}",
        data=eval_dataset,  # Your evaluation dataset
        model_type="databricks-agent",  # Enable Mosaic AI Agent Evaluation
    )

In [0]:
# Some metrics
eval_results.metrics

## Investigate the eval results

Press the `View evaluation results` button in the output above.

![View evaluation results](https://github.com/paalvibe/databricks-botops-course/blob/main/images/view_evaluation_results.png?raw=true)

Study the Evaluation results.

Why is there only a *correctness* result for the question about "Which borough has the least taxi trips"? Write your reply below.
Start the cell with `%md` to make it a markdown cell.

Because most of the trips in the data does not have borough (ie defined as null)

What does the Traces tab contain? Write your reply below.

Actual evaluations done

What does the Artifacts tab contain? Write your reply below.