# Lab: Governing an Azure OpenAI Generative AI Model in watsonx.governance 2.x


In this lab, you will create a *detached* prompt template asset that references a generative AI model in Azure OpenAI to start governing this model in **watsonx.governance**. You will learn how to perform inference to this external model using the **Openai** Python SDK,  you will then configure some OpenScale monitors to evaluate the model and obtain generative quality and model health metrics.

**Notes**

- This notebook should be run using with Runtime 22.2 & Python 3.10 or greater runtime environment (e.g.: 3.11, 3.12), if you are viewing this in Watson Studio, and do not see "Python 3.10/3.11" in the upper right corner of your screen, please update the runtime now. 
- This notebook assumes you have **access to an Azure OpenAI account that has the `opeanai-gpt-3.5` model deployed**. If you don't have access to this account, try reserving the `Access to Azure OpenAI GPT 3.5 Model` environment available in IBM's [Techzone](https://techzone.ibm.com/) (as of September 2024).
- At some steps in this notebook, you might need to go to the platform and perform some actions using the UI before continuing with the notebook.

- If users wish to execute this notebook for task types other than summarization, please consult [this](https://github.com/IBM/watson-openscale-samples/blob/main/IBM%20Cloud/WML/notebooks/watsonx/README.md) document for guidance on evaluating prompt templates for the available task types.


## Prerequisites

* Service credentials for IBM Watson OpenScale are required.
* If you are **not** using Watson Studio to run this notebook, it requires the ID of project in which you want to create the prompt template asset 

### Contents

- [Notebook Setup](#settingup)
- [1. Creating the Prompt Template](#ptatsetup)
    - [Prompt template](#prompt)
- [2. Evaluating Prompt Template in OpenScale](#ptatsetup)
    - [Risk evaluations for prompt template asset subscription](#evaluate)
    - [Display the Model Risk metrics](#mrmmetric)
    - [Display the Generative AI Quality metrics](#genaimetrics)
    - [Plot rougel and rougelsum metrics against records](#plotproject)
    - [See factsheets information](#factsheetsspace)

## Setup <a name="settingup"></a>

Run the below cell to install the required packages.

In [None]:
!pip install --upgrade datasets==2.10.0 --no-cache | tail -n 1
!pip install --upgrade evaluate --no-cache | tail -n 1
!pip install --upgrade --extra-index-url https://test.pypi.org/simple/ ibm-aigov-facts-client | tail -n 1
!pip install --upgrade "ibm-watson-openscale>=3.0.4" | tail -n 1
!pip install "ibm-watson-machine-learning"
!pip install --upgrade matplotlib | tail -n 1
!pip install --upgrade pydantic==1.10.11 --no-cache | tail -n 1
!pip install --upgrade sacrebleu --no-cache | tail -n 1
!pip install --upgrade sacremoses --no-cache | tail -n 1
!pip install --upgrade textstat --no-cache | tail -n 1
!pip install --upgrade openai rich azure-identity --no-cache | tail -n 1
# !pip install --upgrade transformers --no-cache | tail -n 1

**Note:** you may need to *restart the kernel* to use the updated packages. You don't need to run the cell above again after restarting

### Provision services and configure credentials

**ACTION:** Fill the `<YOUR NAME OR NAME INITIALS>` placeholder of the `USER_PREFIX` variable with your name and surname or name initials (e.g.: `USER_PREFIX="John Doe"`)

In [2]:
import re

# TODO: Fill-in the `USER_PREFIX` variable with your name and surname, or any other unique identifier like your name initials
USER_PREFIX = "<USER PREFIX>" # (e.g. "John Doe" or "JD")

# Check that your prefix string meets the requirements
if re.match(r'[a-z0-9]', USER_PREFIX):
    print("Thank you! Your prefix '{}' will be prepended to the names of all the assets your create using this notebook.".format(USER_PREFIX))
else:
    del USER_PREFIX
    raise ValueError("Please re-enter prefix in previous cell using only lower case a-z and 0-9")

Fill-in your platform and Azure credentials:

In [None]:
import os
from rich import print
from IPython.display import display, Markdown

CPD_URL = "<EDIT THIS>"
CPD_USERNAME = "<EDIT THIS>"
CPD_API_KEY = "<EDIT THIS>"

AZURE_OPENAI_ENDPOINT = "<EDIT THIS>"
AZURE_OPENAI_DEPLOYMENT_NAME = "<EDIT THIS>"
AZURE_CLIENT_ID = "<EDIT THIS>"
AZURE_CLIENT_SECRET = "<EDIT THIS>"
AZURE_TENANT_ID = "<EDIT THIS>"

PROJECT_ID = os.environ.get('PROJECT_ID', "<YOUR_PROJECT_ID>")
print(f"Your project id is '{PROJECT_ID}'")

### Function to create the access token

This function generates an IAM access token using the provided credentials. The API calls for creating and scoring prompt template assets utilize the token generated by this function.

In [None]:
import requests
import urllib3, json  # noqa: E401
urllib3.disable_warnings()

def generate_access_token():
    headers={}
    headers["Content-Type"] = "application/json"
    headers["Accept"] = "application/json"
    data = {
        "username":CPD_USERNAME,
        "api_key":CPD_API_KEY
    }
    data = json.dumps(data).encode("utf-8")
    url = CPD_URL + "/icp4d-api/v1/authorize"
    response = requests.post(url=url, data=data, headers=headers,verify=False)
    response.raise_for_status()
    json_data = response.json()
    iam_access_token = json_data['token']
    print("Access token generated succesfully!")
    return iam_access_token

iam_access_token = generate_access_token()

## 1. Creating the Prompt Template

The following cell shows the development of a prompt template used to summarize resumes from job applicants. 

We will test inference on Azure OpenAI and create a detached prompt template in our project in watsonx thar references the model and prompt.

**Action Required : <u>You will need to copy the prompt in section 2, step 14 of the lab instructions to proceed</u>.**

In [6]:
# TODO: Go back to step 2.14 of the lab instructions and copy the prompt shown there into the cell below
# Optional: change the prompt to your liking. Use the {text} placeholder to indicate where the resume text should be filled in
PROMPT_TEMPLATE = """
You will be given a resume. Please summarize the resume in 100 words or less.

--- start of text ---
{text}
--- end of text ---
""".strip()

In [7]:
import pandas as pd
import asyncio
from openai import AsyncAzureOpenAI
from azure.identity import ClientSecretCredential, get_bearer_token_provider

assert not PROMPT_TEMPLATE.startswith('<'), 'Please edit the prompt template according to the lab instructions'

def get_azure_token_provider():
    default_scope = "https://cognitiveservices.azure.com/.default"
    credential = ClientSecretCredential(
        tenant_id=os.environ.get('AZURE_TENANT_ID', AZURE_TENANT_ID),
        client_id=os.environ.get('AZURE_CLIENT_ID' ,AZURE_CLIENT_ID),
        client_secret=os.environ.get('AZURE_CLIENT_SECRET', AZURE_CLIENT_SECRET)
    )
    return get_bearer_token_provider(credential, default_scope)

async def summarize_resume(text:str, max_tokens:int=200, token_provider=None):
    """
    This function uses the Azure OpenAI API to summarize the text of the resume given.
    Usage: `summary = await summarize('[resume text to summarize]')`
    """
    if token_provider is None:
        token_provider = get_azure_token_provider()
    client = AsyncAzureOpenAI(
        azure_endpoint=os.environ.get('AZURE_OPENAI_ENDPOINT', AZURE_OPENAI_ENDPOINT),
        api_version="2024-02-15-preview",
        azure_ad_token_provider=token_provider
    )
    model_response = await client.chat.completions.create(
        model=os.environ.get('AZURE_OPENAI_DEPLOYMENT_NAME', AZURE_OPENAI_DEPLOYMENT_NAME),
        messages=[{"role": "user", "content": PROMPT_TEMPLATE.format(text=text)}],
        max_tokens=max_tokens
    )
    return model_response.choices[0].message.content

async def summarize_batch(resumes:list) -> list:
    """Summarize all the resumes given"""
    token_provider = get_azure_token_provider()
    summaries = await asyncio.gather(
        *[summarize_resume(resume, token_provider=token_provider) for resume in resumes]
    )
    return summaries

### Load the resume data

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/CloudPak-Outcomes/Outcomes-Projects/main/watsonx-governance-l4/data/resume_summarization_test_data.csv").head(10)
print(f"{len(data)} rows of data loaded")
data.head()

### Generate the summaries of the resumes

**Note:** This might take a while to finish running

In [None]:
data['generated_text'] = await summarize_batch(data['Resume'].values)
data.head()

Display the results

In [None]:
# you can run this multiple times to show the results from different row samples
def display_result(row):
    print(f"[bold]Resume:[/bold]\n[red]{row.Resume}[/red]")
    print(f"[bold]AI Generated Summary:[/bold]\n[blue1]{row.generated_text}[/blue1]")
    print(f"[bold]Reference (Labeled) Summary:[/bold]\n[green]{row.Summarization}[/green]")

display_result(data.sample().iloc[0])

### Create the detached prompt template <a name="detached_prompt"></a>

Create a detached prompt template in your project for the summarization task that references the Azure OpenAI model.

In [13]:
from ibm_aigov_facts_client import (
    AIGovFactsClient, CloudPakforDataConfig,
    DetachedPromptTemplate, PromptTemplate
)
from ibm_aigov_facts_client.utils.enums import Task

creds = CloudPakforDataConfig(
    service_url=CPD_URL,
    username=CPD_USERNAME,
    api_key=CPD_API_KEY
)
facts_client = AIGovFactsClient(
    cloud_pak_for_data_configs=creds,
    container_id=PROJECT_ID,
    container_type="project",
    disable_tracing=True
)

In [None]:
detached_information = DetachedPromptTemplate(
    prompt_id=USER_PREFIX+"-detached-aoai-prompt",
    model_id=f"azure/{AZURE_OPENAI_DEPLOYMENT_NAME}",
    model_provider="Azure OpenAI",
    model_name="GPT-3.5-turbo",
    model_url=AZURE_OPENAI_ENDPOINT,
    prompt_url="prompt_url",
    prompt_additional_info={"model_owner": "Microsoft", "model_version": "gpt-3.5-turbo-1106"}
)
prompt_name = f"{USER_PREFIX} - Detached prompt for Azure OpenAI GPT-3.5-turbo"
prompt_description = "A detached prompt for summarization using Azure OpenAI's GPT-3.5-turbo model"

# define parameters for PromptTemplate
prompt_template = PromptTemplate(
    input=PROMPT_TEMPLATE,
    prompt_variables={"text": ""},
)
pta_details = facts_client.assets.create_detached_prompt(
    model_id=f"azure/{AZURE_OPENAI_DEPLOYMENT_NAME}",
    task_id=Task.SUMMARIZATION, # 'summarization' task
    name=prompt_name,
    description=prompt_description,
    prompt_details=prompt_template,
    detached_information=detached_information
)
project_pta_id = pta_details.to_dict()["asset_id"]
print(f"Detached Prompt template ID: '{project_pta_id}'")

In [None]:
factsheets_url = f"{CPD_URL.strip('/')}/wx/prompt-details/{project_pta_id}/factsheet?context=wx&project_id={PROJECT_ID}"
display(Markdown(f"[Click here to navigate to the published factsheet in the project]({factsheets_url})"))

**Action Required: <u>Click the link above to go to the newly published factsheet in your watsonx project</u>**

NOTE: At this point, you should <u>**go back to the lab instructions and follow the steps in section 3**</u> before continuing with the notebook.

## 2. Evaluating the prompt with Watson OpenScale <a name="ptatsetup"></a>

**NOTE:** <u>Make sure you have started tracking the model in your AI use case before running this part of the notebook (section 3 of the lab instructions).</u> 

In this section, we will evaluate the prompt template we've created using Watson OpenScale. We will create the different monitors for GenAI models and evaluate the generative quality metrics and model health metrics.

Note that we don't go in detail about the different monitors and metrics available in OpenScale for GenAI models. Session 1 of the training should have covered this in more detail. In any case, you can refer to the [OpenScale documentation](https://dataplatform.cloud.ibm.com/docs/content/wsj/model/wos-monitor-gen-quality.html?context=cpdaas) for more information.

### Configure OpenScale

In [None]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator, CloudPakForDataAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

authenticator = CloudPakForDataAuthenticator(
    url=CPD_URL,
    username=CPD_USERNAME,
    apikey=CPD_API_KEY,
    disable_ssl_verification=True
)
wos_client = APIClient(
    service_url=CPD_URL,
    authenticator=authenticator,
    service_instance_id=None
)
data_mart_id = wos_client.service_instance_id
print(wos_client.version)

### Openscale instance mapping with the project

When the authentication is on CPD then we need to add additional step of mapping the project_id/space_id to an OpenScale instance.

In [23]:
from ibm_watson_openscale.base_classes import ApiRequestFailure


try:
  wos_client.wos.add_instance_mapping(                
    service_instance_id=data_mart_id,
    project_id=PROJECT_ID
  )
except ApiRequestFailure as arf:
   if arf.response.status_code == 409:
      # Instance mapping already exists. Ignore the error and continue
      pass
   else:
      raise arf

### Setup the prompt template asset in project for evaluation with supported monitor dimensions

The prompt template assets from project is only supported with `development` operational space ID. Running the below cell will create a development type subscription from the prompt template asset created within the project.

The available parameters that can be passed for `execute_prompt_setup` function are:

 * `prompt_template_asset_id` : Id of prompt template asset for which subscription needs to be created.
 * `label_column` :  The name of the column containing the ground truth or actual labels.
 * `project_id` : The GUID of the project.
 * `space_id` : The GUID of the space.
 * `deployment_id` : (optional) The GUID of the deployment.
 * `operational_space_id` : The rank of the environment in which the monitoring is happening. Accepted values are `development`, `pre_production`, `production`.
 * `problem_type` : (optional) The task type to monitor for the given prompt template asset.
 * `classification_type` : The classification type `binary`/`multiclass` applicable only for `classification` problem (task) type.
 * `input_data_type` : The input data type.
 * `supporting_monitors` : Monitor configuration for the subscription to be created.
 * `background_mode` : When `True`, the promt setup operation will be executed in the background

In [None]:
label_column = "reference_summary"
operational_space_id = "development"
problem_type = "summarization"
input_data_type = "unstructured_text"

monitors = {
    "generative_ai_quality": {
        "parameters": {
            "min_sample_size": 10,
            "metrics_configuration": {                    
            }
        }
    }
}
response = wos_client.wos.execute_prompt_setup(
    prompt_template_asset_id=project_pta_id, 
    project_id=PROJECT_ID,
    label_column=label_column,
    operational_space_id=operational_space_id, 
    problem_type=problem_type,
    input_data_type=input_data_type, 
    supporting_monitors=monitors, 
    background_mode=False
)
result = response.result
result.to_dict()

With the below cell, users can  read the  prompt setup task and check its status

In [None]:
response = wos_client.wos.get_prompt_setup( # wos_client.monitor_instances.mrm.get_prompt_setup # if using an older version of facts client
    prompt_template_asset_id=project_pta_id,
    project_id=PROJECT_ID
)

result = response.result
result_json = result.to_dict()

if result_json["status"]["state"] == "FINISHED":
    print("Finished prompt setup. The response is {}".format(result_json))
else:
    print("Prompt setup failed. The response is {}".format(result_json))

### Read required IDs from prompt setup response

In [28]:
subscription_id = result_json["subscription_id"]
mrm_monitor_instance_id = result_json["mrm_monitor_instance_id"]

### Show all the monitor instances of the production subscription
The following cell lists the monitors present in the development subscription along with their respective statuses and other details. Please wait for all the monitors to be in active state before proceeding further.

In [None]:
wos_client.monitor_instances.show(target_target_id=subscription_id)

### Risk evaluations for PTA subscription <a name="evaluate"></a>

### Evaluate the prompt template subscription

The following cell will assess the test data with the subscription of the prompt template asset and produce relevant measurements for the configured monitor.
<!-- **Note:** If you are running this notebook from Watson studio, you may first need to upload your test data to studio and run code snippet to download feedback data file from project to local directory -->

The **Risk Assesment** monitor will evaluate your GenAI model for Text Quality and Model Health. You can read more about the available evaluation metrics for GenAI models [by clicking the link here](https://dataplatform.cloud.ibm.com/docs/content/wsj/model/wos-monitors-overview.html?context=wx)

> **Note:** For the risk assessment of a development type subscription the user needs to have an evaluation dataset. The risk evaluation function takes the evaluation dataset path as a parameter for evaluation of the configured metric dimensions. If there is a discrepancy between the feature columns in the subscription and the column names in the uploading CSV, users have the option to supply a mapping JSON file to associate the CSV column names with the feature column names in the subscription.

In [30]:
llm_data = data.copy()
llm_data = llm_data[['Resume', 'Summarization', 'generated_text']].rename(columns={"Resume":"text", 'Summarization': 'reference_summary'})
llm_data.to_csv("test_data.csv", index=False)

In [None]:
test_data_set_name = "data"
test_data_path = "test_data.csv"
content_type = "multipart/form-data"
body = {}
response  = wos_client.monitor_instances.mrm.evaluate_risk(
    monitor_instance_id=mrm_monitor_instance_id,
    test_data_set_name=test_data_set_name, 
    test_data_path=test_data_path,
    content_type=content_type,
    body=body,
    project_id=PROJECT_ID,
    includes_model_output=True,
    background_mode=False
)

### Read the risk evaluation response

After finishing the risk evaluation, the evaluation results should be available for review:

In [None]:
response  = wos_client.monitor_instances.mrm.get_risk_evaluation(mrm_monitor_instance_id, project_id=PROJECT_ID)
response.result.to_dict()

### Display the Model Risk metrics <a name="mrmmetric"></a>

Having calculated the measurements for the Foundation Model subscription, the MRM metrics generated for this subscription should now be available for review:

In [None]:
wos_client.monitor_instances.show_metrics(monitor_instance_id=mrm_monitor_instance_id, project_id=PROJECT_ID)

### Display the Generative AI Quality metrics <a name="genaimetrics"></a>

[Read the documentation here if you want to know more about each metric](https://dataplatform.cloud.ibm.com/docs/content/wsj/model/wos-monitor-gen-quality.html?context=wx)

In [None]:
# Get the ID of the generative AI quality monitor
monitor_definition_id = "generative_ai_quality"
result = wos_client.monitor_instances.list(
    data_mart_id=data_mart_id,
    monitor_definition_id=monitor_definition_id,
    target_target_id=subscription_id,
    project_id=PROJECT_ID
).result
result_json = result._to_dict()
genaiquality_monitor_id = result_json["monitor_instances"][0]["metadata"]["id"]
genaiquality_monitor_id

Displaying the GenAIQ monitor metrics generated through the risk evaluation.

In [None]:
wos_client.monitor_instances.show_metrics(monitor_instance_id=genaiquality_monitor_id, project_id=PROJECT_ID)

### Display record level metrics for Generative AI Quality 

Read the dataset id for generative ai quality dataset

In [None]:
result = wos_client.data_sets.list(
    target_target_id=subscription_id,
    target_target_type="subscription",
    type="gen_ai_quality_metrics"
).result

genaiq_dataset_id = result.data_sets[0].metadata.id
genaiq_dataset_id

Displaying record level metrics for generative ai quality – there should be one record per row of data evaluated (10 total)

In [None]:
wos_client.data_sets.show_records(data_set_id=genaiq_dataset_id)

### Plot rougel and rougelsum metrics against records <a name="plotproject"></a>

In [None]:
import matplotlib.pyplot as plt

result = wos_client.data_sets.get_list_of_records(data_set_id = genaiq_dataset_id).result
result["records"]
x = []
y_rougel = []
y_rougelsum = []
for each in result["records"]:
    x.append(each["metadata"]["id"][-5:]) # Reading only last 5 characters to fit in the display
    y_rougel.append(each["entity"]["values"]["rougel"])
    y_rougelsum.append(each["entity"]["values"]["rougelsum"])

plt.scatter(x, y_rougel, marker='o')

# Adding labels and title
plt.xlabel("X-axis - Record id (last 5 characters)")
plt.ylabel("Y-axis - ROUGEL")
plt.title("rougel vs record id")

# Display the graph
plt.show()

Plot rougelsum metrics against records

In [None]:
import matplotlib.pyplot as plt
plt.scatter(x, y_rougelsum, marker="o")

# Adding labels and title
plt.xlabel("X-axis - Record ID (last 5 characters)")
plt.ylabel("Y-axis - ROUGELSUM")
plt.title("rougelsum vs record ID")

# Display the graph
plt.show()

### Navigate to see the published facts in the project <a name="factsheetsproject"></a>

In [None]:
factsheets_url = f"{CPD_URL}/wx/prompt-details/{project_pta_id}/factsheet?context=wx&project_id={PROJECT_ID}"
display(Markdown(f"[Click here to navigate to the published facts in your watsonx project]({factsheets_url})"))

**Action Required: <u>Click the link above to visualize the genAI metrics in your factsheet</u>**

### Note: Production Monitoring

Recall from session 1, that you set up production monitors for a watsonx.ai model for a RAG use case and manually logged the payload data to OpenScale's datamart, which is a set of tables in the Db2 database connected to the monitoring service. The same process applies to external models, but with an additional caveat:

- While for watsonx.ai models deployed in the same environment as your watsonx.governance services, that data is automatically written to the datamart every time you score the model without any further effort or code required. For third-party LLMs (or watsonx.ai LLMs hosted in other environments) that data must be written to the datamart using API calls.


Thus, in a production environment, the prompts that you send to the model for inference and the model's response need to be uploaded to OpenScale to continuously trigger evaluations and keep the metrics up-to-date. You can do this by, for example, wrapping the calls to the LLM such that they send payload data to OpenScale or by setting up a separate pipeline that does it in batch (the details will depend on your specific use case). **The image below provides an illustration of how this would work:**

![governance-genai-models-drawio](https://dsws-public-data.s3.eu-de.cloud-object-storage.appdomain.cloud/governance-genai-models-drawio.png)

- You can refer to your session 1 lab notes if you are interested in setting up production monitoring for GenAI models. Additionally, [this OpenScale sample notebook](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/openscale-apis.html?context=cpdaas) also provides a good reference specific to detached prompt templates.
- In the next lab of this session we will go more in detail about setting up production monitoring.

## Congratulations!

You have finished the first hands-on lab for this session. Please continue to the next lab and don't forget to share your feedback with the lab instructors.