# IBM watsonx.governance Evaluation Studio - Tracking and Comparing AI Application Quality Evaluations

## Scenario Overview

Consider a development team building a RAG based chatbot—possibly using LLM Prompt or using an advanced agentic RAG chatbot—for their application. <br>
The underlying model powering the system is, say, the new included **gpt-oss-120b model**, hosted on **IBM watsonx.ai**.

During the development phase, the team evaluates the chatbot by asking a set of questions. For each question:
- The relevant **context** is retrieved,
- A **response** is generated by the chatbot, and
- A **ground truth answer** (as it can be available during development) is used for comparison.

In addition to this **test data set**, the team has prepared a **validation set** of questions, which are also run against the application.

To assess the quality of the responses, the team uses the **IBM watsonx.governance SDK** to compute:
- RAG-specific metrics such as **faithfulness**, **context relevance**, and **answer relevance**
- **Readability metrics** of the generated responses

Both the development and validation sets have corresponding computed metrics.

To **track these evaluations**, the team uses **Evaluation Studio** in **watsonx.governance**:
- An **AI Experiment** is created.
- Both evaluation runs (development and validation) are logged under this experiment.
- The experiment runs are then compared using **AI Evaluation**.
- Finally, the results and comparisons are **visualized through the Evaluation Studio UI** for further insights.

This notebook walks through this end-to-end workflow, and evaluates and compares the metrics that are produced from the output of gpt-oss-120b CHAT API.

![image.png](attachment:4e52ca31-ccab-43e6-9c58-9e24b9cc1b42.png)

#### Required pip installations for IBM watsonx.governance SDK

In [None]:
!pip install -U "ibm-watsonx-gov[metrics]" | tail -n 1

In [1]:
import warnings
warnings.filterwarnings('ignore')

### Credentials to run watsonx.gov SDK and AI Experiments

In [2]:
WX_AI_URL = 'https://us-south.ml.cloud.ibm.com'
IAM_API_KEY = "[Your IBM IAM API Key]"
project_id = "[Your IBM watsonx Project ID]"
model_id = "openai/gpt-oss-120b"
WXG_INSTANCE_ID = '[Your IBM watsonx.governance instance ID]'

### Initialize the watsonx.ai API client object and ModelInference

In [3]:
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

model = ModelInference(
    api_client = APIClient(
        credentials = Credentials(
            url = WX_AI_URL,
            api_key = IAM_API_KEY
        )
    ),
    model_id=model_id,
    project_id=project_id,
    verify=False
)

# Utility method to call the OpenAI gpt_oss_120b via the ModelInterface API

In [4]:
def infer_openai_gpt_oss_120b(context, question):
    # Ideally, with the chat API, the context should not be included, but rather should be included as part of the conversation history or send it as part of the context params. 
    # But passing context params is only available as part of the chat prompt deployment, and not against the chat API.
    # Also, as of now, the gpt-oss-120b is not yet available for the text generation API, hence assing the context to the prompt instruction itself, and asking the model to use the context to generate the answer.
    
    content = "You are a helpful assistant. Use the provided context to answer the question. \
    If you do not know the answer, just say 'I do not know' rather than providing made up answers. \
    Answer the question in a human readable format without any HTML and Markup and Markdown conent. \
    Make sure the output is with a maximum of 100 tokens. " + "\n\n" + context     

    messages = [
      {
        "role": "system",
        "content": content
      },
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": question
          }
        ]
      }
    ]

    response = model.chat(messages=messages)
    response_text = response['choices'][0]['message']['content']
    return response_text

## Test data 

In [5]:
!rm -fr banking_rag_chatbot_qna_1.csv
!wget "https://raw.githubusercontent.com/ravichamarthy/watsonx-explorations/refs/heads/main/banking_rag_chatbot_qna_1.csv"

--2025-08-09 18:22:16--  https://raw.githubusercontent.com/ravichamarthy/watsonx-explorations/refs/heads/main/banking_rag_chatbot_qna_1.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3995 (3.9K) [text/plain]
Saving to: ‘banking_rag_chatbot_qna_1.csv’


2025-08-09 18:22:16 (24.3 MB/s) - ‘banking_rag_chatbot_qna_1.csv’ saved [3995/3995]



In [6]:
import pandas as pd
banking_rag_chatbot_qna_1 = pd.read_csv("banking_rag_chatbot_qna_1.csv")
banking_rag_chatbot_qna_1.head()

Unnamed: 0,question,context,answer,grouth_truth
0,What is the minimum balance required for a sav...,The minimum average monthly balance (AMB) requ...,The minimum balance required is 10000 for urba...,"Depending on the branch location, the required..."
1,Can I open a fixed deposit online?,"Yes, customers with access to internet banking...","Yes, you can open a fixed deposit online via i...","Yes, fixed deposits can be opened online via i..."
2,How can I block my lost debit card?,"If your debit card is lost or stolen, it is cr...",You should block your lost debit card immediat...,Immediately block your lost debit card using m...
3,What is the interest rate on personal loans?,The bank offers personal loans at interest rat...,Personal loan interest rates range between 10....,Interest rates for personal loans range from 1...
4,Are NRIs eligible for home loans?,Non-Resident Indians (NRIs) are eligible to ap...,"Yes, NRIs can get home loans in India for up t...",NRIs can apply for home loans up to 20 years t...


## Test Data - Using the context and the question, invoke the GPT OSS 120b model to get the response

In [7]:
banking_rag_chatbot_qna_1["gpt_oss_120b_generated_answer"] = banking_rag_chatbot_qna_1.apply(
    lambda row: infer_openai_gpt_oss_120b(row["context"], row["question"]),
    axis=1
)

In [8]:
banking_rag_chatbot_qna_1

Unnamed: 0,question,context,answer,grouth_truth,gpt_oss_120b_generated_answer
0,What is the minimum balance required for a sav...,The minimum average monthly balance (AMB) requ...,The minimum balance required is 10000 for urba...,"Depending on the branch location, the required...",The required minimum average monthly balance f...
1,Can I open a fixed deposit online?,"Yes, customers with access to internet banking...","Yes, you can open a fixed deposit online via i...","Yes, fixed deposits can be opened online via i...",Yes. If you have internet or mobile banking ac...
2,How can I block my lost debit card?,"If your debit card is lost or stolen, it is cr...",You should block your lost debit card immediat...,Immediately block your lost debit card using m...,You can block a lost debit card instantly by:\...
3,What is the interest rate on personal loans?,The bank offers personal loans at interest rat...,Personal loan interest rates range between 10....,Interest rates for personal loans range from 1...,The personal‑loan interest rate varies between...
4,Are NRIs eligible for home loans?,Non-Resident Indians (NRIs) are eligible to ap...,"Yes, NRIs can get home loans in India for up t...",NRIs can apply for home loans up to 20 years t...,Yes. NRIs can obtain home loans in India for u...


## Validation data 

In [9]:
!rm -fr banking_rag_chatbot_qna_2.csv
!wget "https://raw.githubusercontent.com/ravichamarthy/watsonx-explorations/refs/heads/main/banking_rag_chatbot_qna_2.csv"

--2025-08-09 18:22:29--  https://raw.githubusercontent.com/ravichamarthy/watsonx-explorations/refs/heads/main/banking_rag_chatbot_qna_2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3308 (3.2K) [text/plain]
Saving to: ‘banking_rag_chatbot_qna_2.csv’


2025-08-09 18:22:29 (19.7 MB/s) - ‘banking_rag_chatbot_qna_2.csv’ saved [3308/3308]



In [10]:
banking_rag_chatbot_qna_2 = pd.read_csv("banking_rag_chatbot_qna_2.csv")
banking_rag_chatbot_qna_2.head()

Unnamed: 0,question,context,answer,grouth_truth
0,How can I apply for a credit card online?,"To apply for a credit card online, customers c...",You can apply for a credit card online by subm...,"Visit the bank’s website or app, choose a card..."
1,What are the features of mobile banking?,The mobile banking app offers a wide range of ...,"Mobile banking offers fund transfers, bill pay...","Mobile banking supports transfers, bill paymen..."
2,How do I close my bank account?,"To close your bank account, visit the nearest ...","To close your account, visit the branch with a...",You must visit the branch with ID proof and su...
3,Can I link multiple bank accounts to one UPI ID?,"Yes, UPI apps allow users to link multiple ban...","Yes, you can link multiple accounts to one UPI...","Yes, UPI apps support linking multiple account..."
4,What happens if I miss an EMI payment?,Missing an EMI (Equated Monthly Installment) p...,"If you miss an EMI, it can affect your credit ...","Missing an EMI leads to penalties, interest, a..."


## Validation Data - Using the context and the question, invoke the GPT OSS 120b model to get the response

In [11]:
banking_rag_chatbot_qna_2["gpt_oss_120b_generated_answer"] = banking_rag_chatbot_qna_2.apply(
    lambda row: infer_openai_gpt_oss_120b(row["context"], row["question"]),
    axis=1
)

In [12]:
banking_rag_chatbot_qna_2

Unnamed: 0,question,context,answer,grouth_truth,gpt_oss_120b_generated_answer
0,How can I apply for a credit card online?,"To apply for a credit card online, customers c...",You can apply for a credit card online by subm...,"Visit the bank’s website or app, choose a card...",Visit your bank’s official website or open its...
1,What are the features of mobile banking?,The mobile banking app offers a wide range of ...,"Mobile banking offers fund transfers, bill pay...","Mobile banking supports transfers, bill paymen...",The mobile banking app provides:\n\n- Balance ...
2,How do I close my bank account?,"To close your bank account, visit the nearest ...","To close your account, visit the branch with a...",You must visit the branch with ID proof and su...,"To close your bank account, visit the nearest ..."
3,Can I link multiple bank accounts to one UPI ID?,"Yes, UPI apps allow users to link multiple ban...","Yes, you can link multiple accounts to one UPI...","Yes, UPI apps support linking multiple account...",Yes. With UPI you can link several of your ban...
4,What happens if I miss an EMI payment?,Missing an EMI (Equated Monthly Installment) p...,"If you miss an EMI, it can affect your credit ...","Missing an EMI leads to penalties, interest, a...",Missing an EMI can trigger a late‑payment fee ...


# IBM watsonx.gov SDK evaluations starts

### Configurations

- Define the `GenAIConfiguration`, specifying which fields represent the **question**, **context**, the **response**, and the **grount truth**
- Configure the use of **RAG evaluation metrics**, which require the **input question**, **retrieved context**, **generated response**, and the **ground truth answer**.
- *(Note: Only a subset of available metrics is shown here. For the complete list, refer to the [watsonx.governance documentation](https://ibm.github.io/ibm-watsonx-gov/index.html).)*
- Also, please note only the string based lexical evaluations are made for the metrics, so the metrics results may or may not be that great. To get good results, one should use watsonx.gov Smaller Models or the Granite Guardian way of computing the RAG metrics.

In [14]:
from ibm_watsonx_gov.config import GenAIConfiguration
from ibm_watsonx_gov.metrics import AnswerRelevanceMetric
from ibm_watsonx_gov.metrics import AnswerSimilarityMetric
from ibm_watsonx_gov.metrics import FaithfulnessMetric
from ibm_watsonx_gov.metrics import ContextRelevanceMetric
from ibm_watsonx_gov.metrics import TextGradeLevelMetric
from ibm_watsonx_gov.metrics import TextReadingEaseMetric
from ibm_watsonx_gov.entities.enums import TaskType, MetricGroup
from ibm_watsonx_gov.clients.api_client import APIClient
from ibm_watsonx_gov.entities.credentials import Credentials

config = GenAIConfiguration(
    input_fields=["question"],
    context_fields=["context"],
    output_fields=["gpt_oss_120b_generated_answer"],
    reference_fields=["grouth_truth"]
)

metrics = [
    AnswerRelevanceMetric(),
    AnswerSimilarityMetric(),
    FaithfulnessMetric(),
    ContextRelevanceMetric(),
    TextGradeLevelMetric(),
    TextReadingEaseMetric(),
]
api_client = APIClient(
    credentials=Credentials(
        api_key=IAM_API_KEY,
        service_instance_id=WXG_INSTANCE_ID
    )
)

### Evaluate the Metrics
- Create MetricsEvaluator
- Evaluate the metrics against the data produced from the underlying prompt or Agent

In [15]:
from ibm_watsonx_gov.evaluators import MetricsEvaluator

evaluator = MetricsEvaluator(
    api_client = api_client,
    configuration=config,
)

## Evaluate on Test Data

In [16]:
evaluation_results_1 = evaluator.evaluate(
    data=banking_rag_chatbot_qna_1,
    metrics=metrics
)
evaluation_results_df_1 = evaluation_results_1.to_df()
evaluation_results_df_1

Unnamed: 0,answer_relevance.token_recall,answer_similarity.token_recall,faithfulness.token_k_precision,context_relevance.token_precision,text_grade_level.flesch_kincaid_grade,text_reading_ease.flesch_reading_ease
0,0.75,0.409091,0.56,0.625,15.558571,33.386429
1,0.833333,0.72,0.764706,0.833333,9.560185,50.565833
2,0.625,0.736842,0.566038,0.625,7.576667,62.115
3,0.428571,0.6,0.708333,0.714286,5.888962,68.574827
4,0.833333,0.678571,0.735849,1.0,10.865,50.413333


## Evaluate on Validation Data

In [17]:
evaluation_results_2 = evaluator.evaluate(
    data=banking_rag_chatbot_qna_2,
    metrics=metrics
)
evaluation_results_df_2 = evaluation_results_2.to_df()
evaluation_results_df_2

Unnamed: 0,answer_relevance.token_recall,answer_similarity.token_recall,faithfulness.token_k_precision,context_relevance.token_precision,text_grade_level.flesch_kincaid_grade,text_reading_ease.flesch_reading_ease
0,0.125,0.846154,0.446154,0.75,8.682857,53.590476
1,0.333333,0.541667,0.647059,0.666667,23.352353,-11.804412
2,0.428571,0.652174,0.578125,0.428571,10.995882,46.506471
3,0.7,0.590909,0.588235,0.8,7.55283,72.378805
4,0.285714,0.789474,0.625,0.285714,12.567273,44.145909


## Utility method to average the individual record level metrics, and construct the Experiment Run object

In [18]:
def construct_run_metrics(df):
    import pandas as pd
    
    # Define column mapping and grouping
    metric_mapping = {
        'answer_relevance.token_recall': ('answer_relevance', 'answer_quality'),
        'answer_similarity.token_recall': ('answer_similarity', 'answer_quality'),
        'faithfulness.token_k_precision': ('faithfulness', 'answer_quality'),
        'context_relevance.token_precision': ('context_relevance', 'retrieval_quality'),
        'text_grade_level.flesch_kincaid_grade': ('text_grade_level', 'readability'),
        'text_reading_ease.flesch_reading_ease': ('text_reading_ease', 'readability')
    }
    
    # Aggregate and build JSON list
    run_metric_results = []
    for col, (new_name, group) in metric_mapping.items():
        if col in df.columns:
            avg_value = df[col].mean()
            count = df[col].count()
            run_metric_results.append({
                "name": new_name,
                "value": round(avg_value, 4),
                "group": group,
                "count": int(count)
            })

    return run_metric_results


## Test Data Run metrics

In [19]:
run_1_metric_results = construct_run_metrics(evaluation_results_df_1)
# Preview result
import json
print(json.dumps(run_1_metric_results, indent=2))

[
  {
    "name": "answer_relevance",
    "value": 0.694,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "answer_similarity",
    "value": 0.6289,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "faithfulness",
    "value": 0.667,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "context_relevance",
    "value": 0.7595,
    "group": "retrieval_quality",
    "count": 5
  },
  {
    "name": "text_grade_level",
    "value": 9.8899,
    "group": "readability",
    "count": 5
  },
  {
    "name": "text_reading_ease",
    "value": 53.0111,
    "group": "readability",
    "count": 5
  }
]


## Validation Data Run Metrics

In [20]:
run_2_metric_results = construct_run_metrics(evaluation_results_df_2)
# Preview result
import json
print(json.dumps(run_2_metric_results, indent=2))

[
  {
    "name": "answer_relevance",
    "value": 0.3745,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "answer_similarity",
    "value": 0.6841,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "faithfulness",
    "value": 0.5769,
    "group": "answer_quality",
    "count": 5
  },
  {
    "name": "context_relevance",
    "value": 0.5862,
    "group": "retrieval_quality",
    "count": 5
  },
  {
    "name": "text_grade_level",
    "value": 12.6302,
    "group": "readability",
    "count": 5
  },
  {
    "name": "text_reading_ease",
    "value": 40.9634,
    "group": "readability",
    "count": 5
  }
]


# Creating AI Experiment asset to capture Metrics Evaluation Runs

In [21]:
from ibm_watsonx_gov.ai_experiments.ai_experiments_client import AIExperimentsClient
from ibm_watsonx_gov.entities.ai_experiment import AIExperimentRun, AIExperiment

# Creating AI Experiment asset
name = "Banking ChatBot AI Experiment - gpt-oss-120b-evals"
description = "Evaluation of a RAG-based chatbot for common banking queries, using manually curated test cases across account services, loans, cards, digital banking, and UPI. Each case includes the user question, retrieved context, model answer, and human-verified ground truth for quality benchmarking."

ai_experiment_client = AIExperimentsClient(api_client=api_client, project_id=project_id)
ai_experiment = AIExperiment(name=name, 
                             description=description,
                             component_type="prompt",
                             component_name="Test prompt")

ai_experiment_asset = ai_experiment_client.create(ai_experiment)
ai_experiment_id = ai_experiment_asset.asset_id

Created AI experiment asset with id 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3.



## Create the Experiment Run for the Test Data and associate metrics with the experiment

In [22]:
import uuid
experiment_run_1_details = AIExperimentRun(
                            run_id=str(uuid.uuid4()),
                            run_name="ChatBot Test Data Metrics Evaluation",
                            nodes=[],
                            duration=10
                        )
experiment_run_1_details

AIExperimentRun(run_id='9bc48fb6-d431-4353-8778-fdc3ea17643c', run_name='ChatBot Test Data Metrics Evaluation', created_at='', created_by='', test_data={}, tracked=False, id_deleted=False, attachment_id='', nodes=[], description='', source_name='', source_url='', duration=10, custom_tags=[], properties={})

## Associate the Test Data Run metrics with the experiment

In [23]:
ai_experiment_client.update(
    ai_experiment_id=ai_experiment_id,
    experiment_run_details=experiment_run_1_details,
    evaluation_results=run_1_metric_results
)


Storing evaluation result for experiment run 9bc48fb6-d431-4353-8778-fdc3ea17643c of AI experiment 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3.

Creating attachment for asset 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3.

Successfully created attachment d978a3e9-f25d-48cf-b811-5bbf4bef00b1 for asset 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3. Time taken: 1.562204122543335.

Updated experiment run details for run ChatBot Test Data Metrics Evaluation of AI experiment 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3.

Updated AI experiment asset 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3.



AIExperiment(container_id='', container_type='', container_name='', name='AI Experiment for Agent', description='', asset_type='', created_at='', owner_id='', asset_id='', creator_id='', component_id='', component_type='prompt', component_name='Test prompt', runs=[AIExperimentRun(run_id='9bc48fb6-d431-4353-8778-fdc3ea17643c', run_name='ChatBot Test Data Metrics Evaluation', created_at='2025-08-09T18:25:42Z', created_by='IBMid-550002SR1C', test_data={'total_rows': 5}, tracked=False, id_deleted=False, attachment_id='d978a3e9-f25d-48cf-b811-5bbf4bef00b1', nodes=[], description='', source_name='', source_url='', duration=10, custom_tags=[], properties={})])

## Create the Experiment Run for the Validation Data and associate metrics with the experiment

In [24]:
import uuid
experiment_run_2_details = AIExperimentRun(
                            run_id=str(uuid.uuid4()),
                            run_name="ChatBot Validation Data Metrics Evaluation",
                            nodes=[],
                            duration=10
                        )

## Associate the Validation Data Run metrics with the experiment

In [25]:
ai_experiment_client.update(
    ai_experiment_id=ai_experiment_id,
    experiment_run_details=experiment_run_2_details,
    evaluation_results=run_2_metric_results
)


Storing evaluation result for experiment run 8f1ae536-e1a3-422e-8ae7-2d4d413d38df of AI experiment 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3.

Creating attachment for asset 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3.

Successfully created attachment 91d2a479-993c-4533-a960-cb73566c4501 for asset 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3. Time taken: 1.2365059852600098.

Updated experiment run details for run ChatBot Validation Data Metrics Evaluation of AI experiment 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3.

Updated AI experiment asset 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3.



AIExperiment(container_id='', container_type='', container_name='', name='AI Experiment for Agent', description='', asset_type='', created_at='', owner_id='', asset_id='', creator_id='', component_id='', component_type='prompt', component_name='Test prompt', runs=[AIExperimentRun(run_id='9bc48fb6-d431-4353-8778-fdc3ea17643c', run_name='ChatBot Test Data Metrics Evaluation', created_at='2025-08-09T18:25:42Z', created_by='IBMid-550002SR1C', test_data={'total_rows': 5}, tracked=False, id_deleted=False, attachment_id='d978a3e9-f25d-48cf-b811-5bbf4bef00b1', nodes=[], description='', source_name='', source_url='', duration=10, custom_tags=[], properties={}), AIExperimentRun(run_id='8f1ae536-e1a3-422e-8ae7-2d4d413d38df', run_name='ChatBot Validation Data Metrics Evaluation', created_at='2025-08-09T18:26:03Z', created_by='IBMid-550002SR1C', test_data={'total_rows': 5}, tracked=False, id_deleted=False, attachment_id='91d2a479-993c-4533-a960-cb73566c4501', nodes=[], description='', source_name='

### Check whether the runs are associated or not.

In [26]:
ai_experiment = ai_experiment_client.get(ai_experiment_id)
ai_experiment.to_json()

Retrieved AI experiment asset 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3.



{'container_id': '8486795b-14ed-4275-8256-dd18ce99f0d8',
 'container_type': 'project_id',
 'container_name': 'QuestDiagDemos',
 'name': 'Banking ChatBot AI Experiment - gpt-oss-120b-evals',
 'description': 'Evaluation of a RAG-based chatbot for common banking queries, using manually curated test cases across account services, loans, cards, digital banking, and UPI. Each case includes the user question, retrieved context, model answer, and human-verified ground truth for quality benchmarking.',
 'asset_type': 'ai_experiment',
 'created_at': '2025-08-09T18:25:11Z',
 'owner_id': 'IBMid-550002SR1C',
 'asset_id': '2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3',
 'creator_id': 'IBMid-550002SR1C',
 'component_id': '',
 'component_type': 'prompt',
 'component_name': 'Test prompt',
 'runs': [{'run_id': '9bc48fb6-d431-4353-8778-fdc3ea17643c',
   'run_name': 'ChatBot Test Data Metrics Evaluation',
   'created_at': '2025-08-09T18:25:42Z',
   'created_by': 'IBMid-550002SR1C',
   'test_data': {'total_rows': 

## Create the AI Evaluation Asset with the experiment that would create the watsonx.governance AI Evaluation Studio asset

In [27]:
from ibm_watsonx_gov.entities.ai_evaluation import EvaluationConfig
from ibm_watsonx_gov.entities.ai_evaluation import AIEvaluationAsset
ai_evaluation = ai_experiment_client.create_ai_evaluation_asset(    
    ai_experiment_ids=[ai_experiment_id],
    ai_evaluation_details=AIEvaluationAsset(
        name="Bank ChatBot Quality Evaluation Experiment - gpt-oss-120b-evals",
        description="Bank ChatBot Quality Evaluation Experiment - gpt-oss-120b-evals",
        evaluation_configuration=EvaluationConfig()
            ))

Retrieved AI experiment asset 2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3.

Created AI Evaluation asset with id a9fdd880-d671-4c06-9a9e-0a3c78f81b5e.


In [28]:
ai_evaluation.to_json()

{'container_id': '8486795b-14ed-4275-8256-dd18ce99f0d8',
 'container_type': 'project',
 'container_name': 'Bank ChatBot Quality Evaluation Experiment - gpt-oss-120b-evals',
 'name': 'Bank ChatBot Quality Evaluation Experiment - gpt-oss-120b-evals',
 'description': 'Bank ChatBot Quality Evaluation Experiment - gpt-oss-120b-evals',
 'asset_type': 'ai_evaluation',
 'created_at': '2025-08-09T18:26:09Z',
 'owner_id': 'IBMid-550002SR1C',
 'asset_id': 'a9fdd880-d671-4c06-9a9e-0a3c78f81b5e',
 'creator_id': 'IBMid-550002SR1C',
 'asset_details': {'task_ids': [],
  'operational_space_id': 'development',
  'input_data_type': 'unstructured_text',
  'evaluation_asset_type': 'ai_experiment'},
 'evaluation_configuration': {'monitors': {'agentic_ai_quality': {'parameters': {'metrics_configuration': {}}}},
  'evaluation_assets': [{'id': '2a5f15f7-978e-41f5-9cc7-cc2f0e3866c3',
    'container_id': '8486795b-14ed-4275-8256-dd18ce99f0d8',
    'container_type': 'project',
    'name': 'Banking ChatBot AI Expe

Use the above href link to navigate to the Evaluation Studio for this Experiment Run.

Author: ravi.chamarthy@in.ibm.com