# Groundedness Evalutation for LLM Responses

This notebook is set up as a basic example of evaluating how grounded LLM responses (specifically from Dialogflow CX) are based on a given test set of question/answer pairs. The question is used as the test input and the generated output is compared to the answer as ground truth, generating a boolean value and an explanation of the evaluation.

# SCRAPI Installation & GCP Setup

Installs the SCRAPI library for scripting Dialogflow CX and sets up a connection to the needed projects and agents.

In [None]:
import importlib

# list of packages to install
packages = [
    "dfcx-scrapi",
    "google-auth",
    "google-cloud-aiplatform",
    "pandas",
    "tqdm",
]

# Install dependences
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f"Installing {package[1]}...")
        install = True
        %pip install {package[1]} -U -q

# Restart kernel if needed
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

print("Done installing.")

# Authentication

There are a few different methods for authenticating this notebook.


### Local Auth

To run this locally, make sure you have the Google Cloud SDK installed and Application Default Credentials configured. No additional authorization should be necessary.

### Auth in Google Colab

To authenticate as the current Google user in a Colab notebook, run this cell:

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab.auth import authenticate_user
    authenticate_user()

### Alternate User Login

Use the below instead to manually authenticate with a different Google user (service account impersonation TK).
- Run it and follow the instructions to open the link, sign in with your *target* GCP account.
- After clicking through to allow access, it will give you a code which you should copy back to the field outputted by the cell below.
- You may see a warning but it's safe to ignore and won't mess anything up.

In [None]:
# import os
# os.environ["PROJECT_ID"] = SOURCE_PROJECT_NAME

# # Interactive prompt to auth in browser
# !gcloud auth application-default login --no-launch-browser

# # Set active project to the `project_id` above
# !gcloud auth application-default set-quota-project $PROJECT_ID

# Define Evaluator Class

No code actually run here, just setting up the class. Usage example below.

In [232]:
import time
import vertexai
from vertexai.generative_models import GenerativeModel
import pandas as pd
from typing import Any, Optional
from tqdm import tqdm
from dfcx_scrapi.core.sessions import Sessions
import proto
import json
from google.auth import default
import re

class BotEvaluator:
    """
    A class for evaluating a chatbot's performance.

    Args:
        project_id (str): The ID of the project.
        vertexai_region (str): The region of the Vertex AI instance.
        agent_id (str): The full ID of the agent.
            Format: `projects/<PROJECT_ID>/locations/<REGION>/agents/<AGENT_ID>`.
        sessions_client (Optional[Sessions]): The sessions client. Defaults to None.

    Attributes:
        agent_id (str): The ID of the agent.
        sessions_client (Sessions): The DFCX sessions client.

    Methods:
        evaluate: Evaluates the chatbot's performance on a dataset.
        _run_test: Runs a test on the chatbot using a dataset.
        _get_response: Retrieves the chatbot's response for a given query.
        format_response_messages: Formats the response messages received from the chatbot.
        check_response_groundedness: Checks the groundedness of the chatbot's response.

    """
    def __init__(
        self,
        project_id: str,
        vertexai_region: str,
        agent_id: str,
        sessions_client: Optional[Sessions] = None,
    ) -> None:

        self.agent_id = agent_id
        self.sessions_client = sessions_client or Sessions()

        vertexai.init(
            project=project_id,
            location=vertexai_region,
            credentials=default()[0],
        )

    def evaluate(
        self,
        dataset: Optional[pd.DataFrame] = None,
        input_columns: Optional[list[str]] = None,
        check_groundedness: bool = False,
        ground_truth_column: Optional[str] = None,
        groundedness_model: Optional[str] = None,
    ) -> pd.DataFrame:
        """
        Evaluates the chatbot's performance on a dataset.

        Args:
            dataset (Optional[pd.DataFrame]): The dataset to evaluate. Defaults to None.
            input_columns (Optional[list[str]]): The input columns to use. Defaults to None.
            check_groundedness (bool): Whether to check the groundedness of the responses. Defaults to False.
            ground_truth_column (Optional[str]): The column containing the ground truth values. Defaults to None.
            groundedness_model (Optional[str]): The LM to use for checking groundedness. Defaults to None.

        Returns:
            pd.DataFrame: The evaluated results.

        """

        if dataset is None:
            with open("test_set.csv", "r") as f:
                dataset = pd.read_csv(f)

        if input_columns is None:
            input_columns = ["question"]

        results = dataset.copy()

        test_results = self._run_test(
            dataset=dataset,
            input_columns=input_columns,
            check_groundedness=check_groundedness,
            ground_truth_column=ground_truth_column,
            groundedness_model=groundedness_model,
        )

        results = pd.merge(dataset, test_results, left_index=True, right_index=True)

        return results


    def _run_test(
        self,
        dataset: pd.DataFrame,
        input_columns: str | list[str],
        check_groundedness: bool = False,
        ground_truth_column: Optional[str] = None,
        groundedness_model: Optional[str] = None,
    ) -> pd.DataFrame:
        """
        Runs a test on the chatbot using a dataset.

        Args:
            dataset (pd.DataFrame): The dataset to use for testing.
            input_columns (str | list[str]): The input columns to use.
            check_groundedness (bool): Whether to check the groundedness of the responses. Defaults to False.
            ground_truth_column (Optional[str]): The column containing the ground truth values. Defaults to None.
            groundedness_model (Optional[str]): The groundedness model to use. Defaults to None.

        Returns:
            pd.DataFrame: The test results.

        """

        if check_groundedness and ground_truth_column is None:
            raise ValueError(
                "Provide both a `ground_truth_column` to check groundedness."
            )


        if isinstance(input_columns, str):
            input_columns = [input_columns]

        fields = [
            "response",
            "time",
        ]

        if check_groundedness:
            fields.extend(["groundedness", "groundedness_reasoning"])

        columns = [
            f"{input}_{field}"
            for input in input_columns
            for field in fields
        ]

        results = pd.DataFrame(
            index=dataset.index,
            columns=columns
        )

        for id, example in tqdm(dataset.iterrows()):

            for input in input_columns:

                response, response_time = self._get_response(
                    query=example[input]
                )

                if check_groundedness:
                    groundedness = self.check_response_groundedness(
                        prediction=response.get("response"),
                        ground_truth=example[ground_truth_column],
                        lm=groundedness_model,
                    )
                    results.at[id, f"{input}_groundedness"] = groundedness["truthfulness"]
                    results.at[id, f"{input}_groundedness_reasoning"] = groundedness["reasoning"]

                results.at[id, f"{input}_response"] = response.get("response")
                results.at[id, f"{input}_time"] = response_time

                time.sleep(0.5)

        return results

    def _get_response(self, query: str) -> tuple[dict, float]:
        """
        Retrieves the chatbot's response for a given query.

        Args:
            query (str): The query to send to the chatbot.

        Returns:
            tuple[dict, float]: A tuple containing the response dictionary and the response time.

        """
        session_id = self.sessions_client.build_session_id(agent_id=self.agent_id)
        output = {}

        response_start = time.time()
        res = self.sessions_client.detect_intent(agent_id=self.agent_id, session_id=session_id, text=query)
        turn = proto.Message.to_dict(
            res,
            use_integers_for_enums=False,
            including_default_value_fields=False,
            preserving_proto_field_name=True,
        )
        response_time = time.time() - response_start

        output["response"] = self.format_response_messages(turn)

        # Could also grab params to check from the response
        # output["param_name"] = turn.get("parameters", {}).get("param_name")

        return output, response_time


    @staticmethod
    def format_response_messages(response: dict[str, Any]) -> str:
        """
        Formats the response messages received from the chatbot.

        Args:
            response (dict[str, Any]): The response dictionary.

        Returns:
            str: The formatted response messages.

        """
        messages = response["response_messages"]

        if len(messages) == 0:
            return ""

        all_messages = []
        for m in messages:
            match m:
                case {"text": {"text": texts}}:
                    all_messages.append("\n".join(texts))
                case {"payload": {"ujet": {"buttons": buttons}}}:
                    all_messages.append("\n".join(f"[ {b['title']} ]" for b in buttons))
                case {"end_interaction": info}:
                    all_messages.append("<END SESSION>")
                case _:
                    all_messages.append(m)

        print(all_messages)
        return "\n".join(all_messages)


    @staticmethod
    def check_response_groundedness(
        prediction: str,
        ground_truth: str,
        lm: GenerativeModel | str = "gemini-1.5-pro-001",
        safety_settings: Optional[dict[str, Any]] = None,
        generation_config: Optional[dict[str, Any]] = None,
    ) -> dict[str, bool | str]:
        """
        Checks the groundedness of the chatbot's response.

        Args:
            prediction (str): The predicted response.
            ground_truth (str): The ground truth response.
            lm (GenerativeModel | str): The groundedness model to use. Defaults to "gemini-1.5-pro-001".
            safety_settings (Optional[dict[str, Any]]): The safety settings for the model. Defaults to None.
            generation_config (Optional[dict[str, Any]]): The generation config for the model. Defaults to None.

        Returns:
            dict[str, bool | str]: A dictionary containing the groundedness information.

        """
        if lm is None:
            lm = "gemini-1.5-pro-001"

        if isinstance(lm, str):
            lm = GenerativeModel(lm)

        if generation_config is None:
            generation_config = {
                "max_output_tokens": 128,
                "temperature": 0,
                "response_mime_type": "application/json",
            }

        prompt = f"""Given the following prediction and ground truth, determine if the prediction is true or false. Provide a clear and concise reasoning for your decision.
**NOTE:**
- The prediction might use slightly different wording than the ground truth. Compare the meaning of the prediction with that of the ground truth to determine its truthfulness.

**Input:**
- **Prediction:** {prediction}
- **Ground Truth:** {ground_truth}

**Output Format:**
{{"truthfulness": [True/False], "reasoning": [Provide a concise explanation of why the prediction is True/False based on the ground truth context.]}}

**Example:**

**Prediction:** The capital of France is Paris.
**Ground Truth:** France is a country located in Western Europe. Its capital city is Paris.

**Example Output:**
{{"truthfulness": True, "reasoning": "The ground truth explicitly states that Paris is the capital city of France, which aligns with the prediction."}}

REMEMBER:
- ONLY OUTPUT YOUR ANSWER FORMATTED AS A DICTIONARY.
- DO NOT OUTPUT ANY OTHER TEXT.
- DO NOT INCLUDE CODE BLOCKS.
- THE FIRST CHARACTER OF YOUR RESPONSE SHOULD BE '{{'.
"""
        result = lm.generate_content(
            prompt,
            safety_settings=safety_settings,
            generation_config=generation_config,
        )

        return json.loads(result.candidates[0].content.parts[0].text.strip())

# Usage Example

In [227]:
# Initialize evaluator class
evaluator = BotEvaluator(
    project_id=PROJECT_ID,
    vertexai_region=PROJECT_REGION,
    agent_id="projects/ai-ml-team-sandbox/locations/us-east1/agents/fac86b77-f640-41f6-937b-22f037d6cc22",
)


In [228]:
# Load and clean/transform data separately with pd.read_csv()
dataset = pd.DataFrame([["What's the capital of Oklahoma?", "Oklahoma City"]], columns=["question", "truth"])

# Or else save as test_set.csv and the class will look for that by default

In [229]:
# Returns a copy of the original data with merged evaluation
evaluation_results = evaluator.evaluate(
    dataset=dataset,
    input_columns=["question"],
    check_groundedness=True,
    ground_truth_column="truth",
)

0it [00:00, ?it/s]

['Oklahoma City OK', 'END SESSION']


1it [00:03,  3.38s/it]


In [None]:
# Check results
evaluation_results.head()

In [None]:
from datetime import datetime
import pytz

# Write results out to a timestamped file

with open(
    f"test_results_{datetime.now(pytz.utc).isoformat('T', 'seconds')}.csv",
    "w",
) as f:
    evalutation_results.to_csv(f, index=True)