# Turn the Review App logs into an Evaluation Set

The Review application captures your user feedbacks.

This feedback is saved under 2 tables within your schema.

In this notebook, we will show you how to extract the logs from the Review App into an Evaluation Set.  It is important to review each row and ensure the data quality is high e.g., the question is logical and the response makes sense.

1. Requests with a 👍 :
    - `request`: As entered by the user
    - `expected_response`: If the user edited the response, that is used, otherwise, the model's generated response.
2. Requests with a 👎 :
    - `request`: As entered by the user
    - `expected_response`: If the user edited the response, that is used, otherwise, null.
3. Requests without any feedback
    - `request`: As entered by the user

Across all types of requests, if the user 👍 a chunk from the `retrieved_context`, the `doc_uri` of that chunk is included in `expected_retrieved_context` for the question.

<!-- Collect usage data (view). Remove it to disable collection or disable tracker during installation. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=data-science&org_id=429769148865153&notebook=%2F03-advanced-app%2F03-Offline-Evaluation&demo_name=llm-rag-chatbot&event=VIEW&path=%2F_dbdemos%2Fdata-science%2Fllm-rag-chatbot%2F03-advanced-app%2F03-Offline-Evaluation&version=1">

In [0]:
%pip install --quiet -U databricks-agents mlflow mlflow-skinny databricks-sdk==0.23.0
dbutils.library.restartPython()

In [None]:
%run "./0. Init"


## 1.1/ Extracting the logs 


*Note: for now, this part requires a few SQL queries that we provide in this notebook to properly format the review app into training dataset.*

*We'll update this notebook soon with an simpler version - stay tuned!*

<!-- Collect usage data (view). Remove it to disable collection or disable tracker during installation. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=data-science&org_id=429769148865153&notebook=%2F03-advanced-app%2F03-Offline-Evaluation&demo_name=llm-rag-chatbot&event=VIEW&path=%2F_dbdemos%2Fdata-science%2Fllm-rag-chatbot%2F03-advanced-app%2F03-Offline-Evaluation&version=1">

In [0]:
from databricks import agents
import mlflow

browser_url = mlflow.utils.databricks_utils.get_browser_hostname()

# # Get the name of the Inference Tables where logs are stored
active_deployments = agents.list_deployments()
active_deployment = next((item for item in active_deployments if item.model_name == MODEL_NAME_FQN), None)

In [0]:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
print(active_deployment)
endpoint = w.serving_endpoints.get(active_deployment.endpoint_name)

try:
    endpoint_config = endpoint.config.auto_capture_config
except AttributeError as e:
    endpoint_config = endpoint.pending_config.auto_capture_config

inference_table_name = endpoint_config.state.payload_table.name
inference_table_catalog = endpoint_config.catalog_name
inference_table_schema = endpoint_config.schema_name

# Cleanly formatted tables
assessment_table = f"{inference_table_catalog}.{inference_table_schema}.`{inference_table_name}_assessment_logs`"
request_table = f"{inference_table_catalog}.{inference_table_schema}.`{inference_table_name}_request_logs`"

# Note: you might have to wait a bit for the tables to be ready
print(f"Request logs: {request_table}")
requests_df = spark.table(request_table)
print(f"Assessment logs: {assessment_table}")
#Temporary helper to extract the table - see _resources/00-init-advanced 
assessment_df = deduplicate_assessments_table(assessment_table)

In [0]:
requests_with_feedback_df = requests_df.join(assessment_df, requests_df.databricks_request_id == assessment_df.request_id, "left")
display(requests_with_feedback_df.select("request_raw", "trace", "source", "text_assessment", "retrieval_assessments"))

In [0]:

requests_with_feedback_df.createOrReplaceTempView('latest_assessments')
eval_dataset = spark.sql(f"""
-- Thumbs up.  Use the model's generated response as the expected_response
select
  a.request_id,
  r.request,
  r.response as expected_response,
  'thumbs_up' as type,
  a.source.id as user_id
from
  latest_assessments as a
  join {request_table} as r on a.request_id = r.databricks_request_id
where
  a.text_assessment.ratings ["answer_correct"].value == "positive"
union all
  --Thumbs down.  If edited, use that as the expected_response.
select
  a.request_id,
  r.request,
  IF(
    a.text_assessment.suggested_output != "",
    a.text_assessment.suggested_output,
    NULL
  ) as expected_response,
  'thumbs_down' as type,
  a.source.id as user_id
from
  latest_assessments as a
  join {request_table} as r on a.request_id = r.databricks_request_id
where
  a.text_assessment.ratings ["answer_correct"].value = "negative"
union all
  -- No feedback.  Include the request, but no expected_response
select
  a.request_id,
  r.request,
  IF(
    a.text_assessment.suggested_output != "",
    a.text_assessment.suggested_output,
    NULL
  ) as expected_response,
  'no_feedback_provided' as type,
  a.source.id as user_id
from
  latest_assessments as a
  join {request_table} as r on a.request_id = r.databricks_request_id
where
  a.text_assessment.ratings ["answer_correct"].value != "negative"
  and a.text_assessment.ratings ["answer_correct"].value != "positive"
  """)
display(eval_dataset)

# 1.2/ Our eval dataset is now ready! 

The review app makes it easy to build & create your evaluation dataset. 

*Note: the eval app logs may take some time to be available to you. If the dataset is empty, wait a bit.*

To simplify the demo and make sure you don't have to craft your own eval dataset, we saved a ready-to-use eval dataset already pre-generated for you. We'll use this one for the demo instead.

In [0]:
eval_dataset = spark.table("hackathon_eval_set").limit(10)
display(eval_dataset)

## Load the correct Python environment for the model


In [0]:
#Retrieve the model we want to eval
model = get_latest_model(MODEL_NAME_FQN)
pip_requirements = mlflow.pyfunc.get_model_dependencies(f"runs:/{model.run_id}/chain")

## Run our evaluation from the dataset

In [0]:
with mlflow.start_run(run_name="hackathon_eval_dataset"):
    # Evaluate the logged model
    eval_results = mlflow.evaluate(
        data=eval_dataset,
        model=f'runs:/{model.run_id}/chain',
        model_type="databricks-agent",
    )

### This is looking good, let's tag our model as production ready

After reviewing the model correctness and potentially comparing its behavior to your other previous version, we can flag our model as ready to be deployed.

*Note: Evaluation can be automated and part of a MLOps step: once you deploy a new Chatbot version with a new prompt, run the evaluation job and benchmark your model behavior vs the previous version.*

In [0]:
# client = MlflowClient()
# client.set_registered_model_alias(name=MODEL_NAME_FQN, alias="prod", version=model.version)