# IBM watsonx.governance Evaluation Studio - Tracking and Comparing AI Application Quality Evaluations

## Scenario Overview

Consider a development team building a RAG based chatbot—possibly using LLM Prompt or using an advanced agentic RAG chatbot—for their application. The underlying model powering the system is, say, **Gemma**, hosted on **Google Vertex AI**.

During the development phase, the team evaluates the chatbot by asking a set of questions. For each question:
- The relevant **context** is retrieved,
- A **response** is generated by the chatbot, and
- A **ground truth answer** (as it can be available during development) is used for comparison.

In addition to this **test data set**, the team has prepared a **validation set** of questions, which are also run against the application.

To assess the quality of the responses, the team uses the **IBM watsonx.governance SDK** to compute:
- RAG-specific metrics such as **faithfulness**, **context relevance**, and **answer relevance**
- **Readability metrics** of the generated responses

Both the development and validation sets have corresponding computed metrics.

To **track these evaluations**, the team uses **Evaluation Studio** in **watsonx.governance**:
- An **AI Experiment** is created.
- Both evaluation runs (development and validation) are logged under this experiment.
- The experiment runs are then compared using **AI Evaluation**.
- Finally, the results and comparisons are **visualized through the Evaluation Studio UI** for further insights.

This notebook walks through this end-to-end workflow.


![image.png](attachment:fd02a955-1018-48c6-a85f-9a82e17a1ebc.png)

#### Required pip installations - needs cleaning

In [None]:
!pip install -U "ibm-watsonx-gov[agentic,visualization]" | tail -n 1
!pip install -U "torch (>=2.1.0,<3.0.0)" | tail -n 1
!pip install opentelemetry-exporter-otlp-proto-http==1.34.1 | tail -n 1

In [1]:
import warnings
warnings.filterwarnings('ignore')

### Credentials to run watsonx.gov SDK and AI Experiments

In [None]:
WATSONX_APIKEY = "[Your IBM IAM API Key]"
WATSONX_PROJECT_ID = "[Your IBM watsonx Project ID]"
WXG_INSTANCE_ID = '[Your IBM watsonx.governance instance ID]'

### Initialize the watsonx.gov API client object

In [3]:
from ibm_watsonx_gov.ai_experiments.ai_experiments_client import AIExperimentsClient
from ibm_watsonx_gov.entities.ai_experiment import AIExperimentRun, AIExperiment
from ibm_watsonx_gov.clients.api_client import APIClient
from ibm_watsonx_gov.entities.credentials import Credentials
credentials = Credentials(api_key=WATSONX_APIKEY,
                          service_instance_id=WXG_INSTANCE_ID)

# Initializing APIClient
api_client = APIClient(credentials=credentials)

In [4]:
import pandas as pd
import numpy as np

**Note:** This notebook does not cover running the application or executing prompts against the underlying model. It assumes that the development team has already processed the questions using the LLM prompt or the agentic application and has obtained the corresponding contexts, generated answers, and ground truth answers.

### Test data 

In [5]:
!rm -fr banking_rag_chatbot_qna_1.csv
!wget "https://raw.githubusercontent.com/ravichamarthy/watsonx-explorations/refs/heads/main/banking_rag_chatbot_qna_1.csv"

--2025-08-08 05:52:22--  https://raw.githubusercontent.com/ravichamarthy/watsonx-explorations/refs/heads/main/banking_rag_chatbot_qna_1.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3995 (3.9K) [text/plain]
Saving to: ‘banking_rag_chatbot_qna_1.csv’


2025-08-08 05:52:22 (10.9 MB/s) - ‘banking_rag_chatbot_qna_1.csv’ saved [3995/3995]



In [6]:
banking_rag_chatbot_qna_1 = pd.read_csv("banking_rag_chatbot_qna_1.csv")
banking_rag_chatbot_qna_1.head()

Unnamed: 0,question,context,answer,grouth_truth
0,What is the minimum balance required for a sav...,The minimum average monthly balance (AMB) requ...,The minimum balance required is 10000 for urba...,"Depending on the branch location, the required..."
1,Can I open a fixed deposit online?,"Yes, customers with access to internet banking...","Yes, you can open a fixed deposit online via i...","Yes, fixed deposits can be opened online via i..."
2,How can I block my lost debit card?,"If your debit card is lost or stolen, it is cr...",You should block your lost debit card immediat...,Immediately block your lost debit card using m...
3,What is the interest rate on personal loans?,The bank offers personal loans at interest rat...,Personal loan interest rates range between 10....,Interest rates for personal loans range from 1...
4,Are NRIs eligible for home loans?,Non-Resident Indians (NRIs) are eligible to ap...,"Yes, NRIs can get home loans in India for up t...",NRIs can apply for home loans up to 20 years t...


### Validation data

In [7]:
!rm -fr banking_rag_chatbot_qna_2.csv
!wget "https://raw.githubusercontent.com/ravichamarthy/watsonx-explorations/refs/heads/main/banking_rag_chatbot_qna_2.csv"

--2025-08-08 05:52:25--  https://raw.githubusercontent.com/ravichamarthy/watsonx-explorations/refs/heads/main/banking_rag_chatbot_qna_2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3308 (3.2K) [text/plain]
Saving to: ‘banking_rag_chatbot_qna_2.csv’


2025-08-08 05:52:25 (10.4 MB/s) - ‘banking_rag_chatbot_qna_2.csv’ saved [3308/3308]



In [8]:
banking_rag_chatbot_qna_2 = pd.read_csv("banking_rag_chatbot_qna_2.csv")
banking_rag_chatbot_qna_2.head()

Unnamed: 0,question,context,answer,grouth_truth
0,How can I apply for a credit card online?,"To apply for a credit card online, customers c...",You can apply for a credit card online by subm...,"Visit the bank’s website or app, choose a card..."
1,What are the features of mobile banking?,The mobile banking app offers a wide range of ...,"Mobile banking offers fund transfers, bill pay...","Mobile banking supports transfers, bill paymen..."
2,How do I close my bank account?,"To close your bank account, visit the nearest ...","To close your account, visit the branch with a...",You must visit the branch with ID proof and su...
3,Can I link multiple bank accounts to one UPI ID?,"Yes, UPI apps allow users to link multiple ban...","Yes, you can link multiple accounts to one UPI...","Yes, UPI apps support linking multiple account..."
4,What happens if I miss an EMI payment?,Missing an EMI (Equated Monthly Installment) p...,"If you miss an EMI, it can affect your credit ...","Missing an EMI leads to penalties, interest, a..."


# IBM watsonx.gov SDK evaluations starts

### Configurations

- Define the `GenAIConfiguration`, specifying which fields represent the **question**, **context**, the **response**, and the **grount truth**
- Configure the use of **RAG evaluation metrics**, which require the **input question**, **retrieved context**, **generated response**, and the **ground truth answer**.
- *(Note: Only a subset of available metrics is shown here. For the complete list, refer to the [watsonx.governance documentation](https://ibm.github.io/ibm-watsonx-gov/index.html).)*


In [9]:
from ibm_watsonx_gov.config import GenAIConfiguration
from ibm_watsonx_gov.metrics import AnswerRelevanceMetric
from ibm_watsonx_gov.metrics import AnswerSimilarityMetric
from ibm_watsonx_gov.metrics import FaithfulnessMetric
from ibm_watsonx_gov.metrics import ContextRelevanceMetric
from ibm_watsonx_gov.metrics import TextGradeLevelMetric
from ibm_watsonx_gov.metrics import TextReadingEaseMetric
from ibm_watsonx_gov.entities.enums import TaskType, MetricGroup

config = GenAIConfiguration(
    input_fields=["question"],
    context_fields=["context"],
    output_fields=["answer"],
    reference_fields=["grouth_truth"]
)

metrics = [
    AnswerRelevanceMetric(),
    AnswerSimilarityMetric(),
    FaithfulnessMetric(),
    ContextRelevanceMetric(),
    TextGradeLevelMetric(),
    TextReadingEaseMetric(),
]

### Evaluate the Metrics
- Create MetricsEvaluator
- Evaluate the metrics against the data produced from the underlying prompt or Agent

In [10]:
from ibm_watsonx_gov.clients.api_client import APIClient
from ibm_watsonx_gov.evaluators import MetricsEvaluator

evaluator = MetricsEvaluator(
    api_client=APIClient(credentials=credentials),
    configuration=config,
)

## Evaluate on Test Data

In [11]:
evaluation_results_1 = evaluator.evaluate(
    data=banking_rag_chatbot_qna_1,
    metrics=metrics
)
evaluation_results_df_1 = evaluation_results_1.to_df()
evaluation_results_df_1

Unnamed: 0,answer_relevance.token_recall,answer_similarity.token_recall,faithfulness.token_k_precision,context_relevance.token_precision,text_grade_level.flesch_kincaid_grade,text_reading_ease.flesch_reading_ease
0,0.625,0.5,0.75,0.625,8.637,55.4035
1,0.833333,0.48,0.708333,0.833333,10.525,41.8675
2,0.5,0.578947,0.823529,0.625,10.475294,50.238824
3,0.428571,0.4,0.684211,0.714286,6.132632,62.375088
4,0.666667,0.5,0.695652,1.0,10.486667,62.625


## Evaluate on Validation Data

In [12]:
evaluation_results_2 = evaluator.evaluate(
    data=banking_rag_chatbot_qna_2,
    metrics=metrics
)
evaluation_results_df_2 = evaluation_results_2.to_df()
evaluation_results_df_2

Unnamed: 0,answer_relevance.token_recall,answer_similarity.token_recall,faithfulness.token_k_precision,context_relevance.token_precision,text_grade_level.flesch_kincaid_grade,text_reading_ease.flesch_reading_ease
0,0.75,0.346154,0.736842,0.75,10.019048,60.634286
1,0.333333,0.458333,0.833333,0.666667,8.897778,47.3
2,0.285714,0.391304,0.875,0.428571,7.818889,71.065
3,0.8,0.409091,0.666667,0.8,5.852222,85.165
4,0.428571,0.368421,0.5,0.285714,6.725263,80.686842


## Utility method to average the individual record level metrics, and construct the Experiment Run object

In [13]:
def construct_run_metrics(df):
    import pandas as pd
    
    # Define column mapping and grouping
    metric_mapping = {
        'answer_relevance.token_recall': ('answer_relevance', 'answer_quality'),
        'answer_similarity.token_recall': ('answer_similarity', 'answer_quality'),
        'faithfulness.token_k_precision': ('faithfulness', 'answer_quality'),
        'context_relevance.token_precision': ('context_relevance', 'retrieval_quality'),
        'text_grade_level.flesch_kincaid_grade': ('text_grade_level', 'readability'),
        'text_reading_ease.flesch_reading_ease': ('text_reading_ease', 'readability')
    }
    
    # Aggregate and build JSON list
    run_metric_results = []
    for col, (new_name, group) in metric_mapping.items():
        if col in df.columns:
            avg_value = df[col].mean()
            count = df[col].count()
            run_metric_results.append({
                "name": new_name,
                "value": round(avg_value, 4),
                "group": group,
                "count": int(count)
            })

    return run_metric_results

## Test Data Run metrics

In [14]:
run_1_metric_results = construct_run_metrics(evaluation_results_df_1)
# Preview result
import json
print(json.dumps(run_1_metric_results, indent=2))

[
  {
    "name": "answer_relevance",
    "value": 0.6107,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "answer_similarity",
    "value": 0.4918,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "faithfulness",
    "value": 0.7323,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "context_relevance",
    "value": 0.7595,
    "group": "retrieval_quality",
    "count": 5
  },
  {
    "name": "text_grade_level",
    "value": 9.2513,
    "group": "readability",
    "count": 5
  },
  {
    "name": "text_reading_ease",
    "value": 54.502,
    "group": "readability",
    "count": 5
  }
]


## Validation Data Run Metrics

In [15]:
run_2_metric_results = construct_run_metrics(evaluation_results_df_2)
# Preview result
import json
print(json.dumps(run_2_metric_results, indent=2))

[
  {
    "name": "answer_relevance",
    "value": 0.5195,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "answer_similarity",
    "value": 0.3947,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "faithfulness",
    "value": 0.7224,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "context_relevance",
    "value": 0.5862,
    "group": "retrieval_quality",
    "count": 5
  },
  {
    "name": "text_grade_level",
    "value": 7.8626,
    "group": "readability",
    "count": 5
  },
  {
    "name": "text_reading_ease",
    "value": 68.9702,
    "group": "readability",
    "count": 5
  }
]


# Creating AI Experiment asset to capture Metrics Evaluation Runs

In [16]:
# Creating AI Experiment asset
name = "Banking ChatBot AI Experiment"
description = "Evaluation of a RAG-based chatbot for common banking queries, using manually curated test cases across account services, loans, cards, digital banking, and UPI. Each case includes the user question, retrieved context, model answer, and human-verified ground truth for quality benchmarking."

ai_experiment_client = AIExperimentsClient(api_client=api_client, project_id=WATSONX_PROJECT_ID)
ai_experiment = AIExperiment(name=name, 
                             description=description,
                             component_type="prompt",
                             component_name="Test prompt")

ai_experiment_asset = ai_experiment_client.create(ai_experiment)
ai_experiment_id = ai_experiment_asset.asset_id

Created AI experiment asset with id d59a6174-1f22-424c-9171-149a908844d0.



In [None]:
ai_experiment_asset.to_json()

## Create the Experiment Run for the Test Data

In [17]:
import uuid
experiment_run_1_details = AIExperimentRun(
                            run_id=str(uuid.uuid4()),
                            run_name="ChatBot Test Data Metrics Evaluation",
                            nodes=[],
                            duration=10
                        )
experiment_run_1_details

AIExperimentRun(run_id='ad786851-00f3-42c2-a1db-d8918392aefd', run_name='ChatBot Test Data Metrics Evaluation', created_at='', created_by='', test_data={}, tracked=False, id_deleted=False, attachment_id='', nodes=[], description='', source_name='', source_url='', duration=10, custom_tags=[], properties={})

## Create the Experiment Run for the Validation Data

In [18]:
import uuid
experiment_run_2_details = AIExperimentRun(
                            run_id=str(uuid.uuid4()),
                            run_name="ChatBot Validation Data Metrics Evaluation",
                            nodes=[],
                            duration=10
                        )
experiment_run_2_details

AIExperimentRun(run_id='6e505dc0-660b-4bf7-94f3-4417e53baae1', run_name='ChatBot Validation Data Metrics Evaluation', created_at='', created_by='', test_data={}, tracked=False, id_deleted=False, attachment_id='', nodes=[], description='', source_name='', source_url='', duration=10, custom_tags=[], properties={})

## Associate the Test Data Run metrics with the experiment

In [19]:
ai_experiment_client.update(
    ai_experiment_id=ai_experiment_id,
    experiment_run_details=experiment_run_1_details,
    evaluation_results=run_1_metric_results
)


Storing evaluation result for experiment run ad786851-00f3-42c2-a1db-d8918392aefd of AI experiment d59a6174-1f22-424c-9171-149a908844d0.

Creating attachment for asset d59a6174-1f22-424c-9171-149a908844d0.

Successfully created attachment dd62c319-0953-4f5d-9523-15d7a711da62 for asset d59a6174-1f22-424c-9171-149a908844d0. Time taken: 1.829805850982666.

Updated experiment run details for run ChatBot Test Data Metrics Evaluation of AI experiment d59a6174-1f22-424c-9171-149a908844d0.

Updated AI experiment asset d59a6174-1f22-424c-9171-149a908844d0.



AIExperiment(container_id='', container_type='', container_name='', name='AI Experiment for Agent', description='', asset_type='', created_at='', owner_id='', asset_id='', creator_id='', component_id='', component_type='prompt', component_name='Test prompt', runs=[AIExperimentRun(run_id='ad786851-00f3-42c2-a1db-d8918392aefd', run_name='ChatBot Test Data Metrics Evaluation', created_at='2025-08-08T05:53:09Z', created_by='IBMid-550002SR1C', test_data={'total_rows': 5}, tracked=False, id_deleted=False, attachment_id='dd62c319-0953-4f5d-9523-15d7a711da62', nodes=[], description='', source_name='', source_url='', duration=10, custom_tags=[], properties={})])

## Associate the Validation Data Run metrics with the experiment

In [20]:
ai_experiment_client.update(
    ai_experiment_id=ai_experiment_id,
    experiment_run_details=experiment_run_2_details,
    evaluation_results=run_2_metric_results
)


Storing evaluation result for experiment run 6e505dc0-660b-4bf7-94f3-4417e53baae1 of AI experiment d59a6174-1f22-424c-9171-149a908844d0.

Creating attachment for asset d59a6174-1f22-424c-9171-149a908844d0.

Successfully created attachment 7ab6a3f2-ad85-4032-abd6-5fb1ae9ab221 for asset d59a6174-1f22-424c-9171-149a908844d0. Time taken: 1.9616460800170898.

Updated experiment run details for run ChatBot Validation Data Metrics Evaluation of AI experiment d59a6174-1f22-424c-9171-149a908844d0.

Updated AI experiment asset d59a6174-1f22-424c-9171-149a908844d0.



AIExperiment(container_id='', container_type='', container_name='', name='AI Experiment for Agent', description='', asset_type='', created_at='', owner_id='', asset_id='', creator_id='', component_id='', component_type='prompt', component_name='Test prompt', runs=[AIExperimentRun(run_id='ad786851-00f3-42c2-a1db-d8918392aefd', run_name='ChatBot Test Data Metrics Evaluation', created_at='2025-08-08T05:53:09Z', created_by='IBMid-550002SR1C', test_data={'total_rows': 5}, tracked=False, id_deleted=False, attachment_id='dd62c319-0953-4f5d-9523-15d7a711da62', nodes=[], description='', source_name='', source_url='', duration=10, custom_tags=[], properties={}), AIExperimentRun(run_id='6e505dc0-660b-4bf7-94f3-4417e53baae1', run_name='ChatBot Validation Data Metrics Evaluation', created_at='2025-08-08T05:53:13Z', created_by='IBMid-550002SR1C', test_data={'total_rows': 5}, tracked=False, id_deleted=False, attachment_id='7ab6a3f2-ad85-4032-abd6-5fb1ae9ab221', nodes=[], description='', source_name='

### Check whether the runs are associated or not.

In [21]:
ai_experiment = ai_experiment_client.get(ai_experiment_id)
ai_experiment.to_json()

Retrieved AI experiment asset d59a6174-1f22-424c-9171-149a908844d0.



{'container_id': '8486795b-14ed-4275-8256-dd18ce99f0d8',
 'container_type': 'project_id',
 'container_name': 'QuestDiagDemos',
 'name': 'Banking ChatBot AI Experiment',
 'description': 'Evaluation of a RAG-based chatbot for common banking queries, using manually curated test cases across account services, loans, cards, digital banking, and UPI. Each case includes the user question, retrieved context, model answer, and human-verified ground truth for quality benchmarking.',
 'asset_type': 'ai_experiment',
 'created_at': '2025-08-08T05:52:59Z',
 'owner_id': 'IBMid-550002SR1C',
 'asset_id': 'd59a6174-1f22-424c-9171-149a908844d0',
 'creator_id': 'IBMid-550002SR1C',
 'component_id': '',
 'component_type': 'prompt',
 'component_name': 'Test prompt',
 'runs': [{'run_id': 'ad786851-00f3-42c2-a1db-d8918392aefd',
   'run_name': 'ChatBot Test Data Metrics Evaluation',
   'created_at': '2025-08-08T05:53:09Z',
   'created_by': 'IBMid-550002SR1C',
   'test_data': {'total_rows': 5},
   'tracked': Fal

## Create the AI Evaluation Asset with the experiment that would create the watsonx.governance AI Evaluation Studio asset

In [23]:
from ibm_watsonx_gov.entities.ai_evaluation import EvaluationConfig
from ibm_watsonx_gov.entities.ai_evaluation import AIEvaluationAsset
ai_evaluation = ai_experiment_client.create_ai_evaluation_asset(    
    ai_experiment_ids=[ai_experiment_id],
    ai_evaluation_details=AIEvaluationAsset(
        name="Bank ChatBot Quality Evaluation Experiment",
        description="Bank ChatBot Quality Evaluation Experiment",
        evaluation_configuration=EvaluationConfig()
            ))

Retrieved AI experiment asset d59a6174-1f22-424c-9171-149a908844d0.

Created AI Evaluation asset with id 896eb383-6a68-46c1-a061-edee76038fa4.


In [24]:
ai_evaluation.to_json()

{'container_id': '8486795b-14ed-4275-8256-dd18ce99f0d8',
 'container_type': 'project',
 'container_name': 'Bank ChatBot Quality Evaluation Experiment',
 'name': 'Bank ChatBot Quality Evaluation Experiment',
 'description': 'Bank ChatBot Quality Evaluation Experiment',
 'asset_type': 'ai_evaluation',
 'created_at': '2025-08-08T05:54:11Z',
 'owner_id': 'IBMid-550002SR1C',
 'asset_id': '896eb383-6a68-46c1-a061-edee76038fa4',
 'creator_id': 'IBMid-550002SR1C',
 'asset_details': {'task_ids': [],
  'operational_space_id': 'development',
  'input_data_type': 'unstructured_text',
  'evaluation_asset_type': 'ai_experiment'},
 'evaluation_configuration': {'monitors': {'agentic_ai_quality': {'parameters': {'metrics_configuration': {}}}},
  'evaluation_assets': [{'id': 'd59a6174-1f22-424c-9171-149a908844d0',
    'container_id': '8486795b-14ed-4275-8256-dd18ce99f0d8',
    'container_type': 'project',
    'name': 'Banking ChatBot AI Experiment',
    'run_id': 'ad786851-00f3-42c2-a1db-d8918392aefd',


Use the above href link to navigate to the Evaluation Studio for this Experiment Run.

Author: ravi.chamarthy@in.ibm.com