![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FEvaluation&file=Evaluation+For+GenAI.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Evaluation/Evaluation%20For%20GenAI.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FEvaluation%2FEvaluation%2520For%2520GenAI.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Evaluation/Evaluation%20For%20GenAI.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Evaluation/Evaluation%20For%20GenAI.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Evaluation For GenAI

In machine learning, evaluation involves computing the predicted outcome and comparing it to the known actual outcome. With generative AI, all responses are predicted values. How do we assess these for their accuracy in designated tasks such as summarization, question answering, code writing, generating dialogue, and other, even custom, tasks? That is the goal of this workflow!

In machine learning, the comparison of predicted values to actual known values comes down to a metric that is computed and then used to judge the ability of the model. This could be a confusion matrix, accuracy score, F1 score, precision, and more for a classification model. It could be MAE, RMSE, MSE, and others for a regression model. But what about text?

To [evaluate generative AI responses](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview), we first need to decide what we are comparing them to:

- **Compare to another response: Model-Based Pairwise** - compare two responses and pick the better one
    - Use a model as a judge to compare responses based on a metric
    - Compare two models, two versions of a model, two different system instructions, ...
- **Compare to criteria: Model-Based Pointwise** - judge a response against an evaluation criteria
    - Use a model as a judge to evaluate responses for a metric on a rating scale
- **Ground truth reference: Computation-based** metrics used to compute metrics that directly compare text
    - Similarity using **Lexicon-based metrics** like Exact Match, BLEU, and ROUGE
    - Classification metrics like F1-score, Accuracy on aggregated responses
    - Embedding-based comparison like the distance between results in an embedding space

The following workflow shows how to use evaluations to optimize prompts by rewriting system instructions automatically:

- [Optimize Prompts Using Evaluation Metrics](./Optimize%20Prompts%20Using%20Evaluation%20Metrics.ipynb)

---
## Colab Setup

To run this notebook in Colab run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [30]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform', '1.78.0'),
    ('pandas', 'pandas')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable aiplatform.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

inputs:

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'evaluation'

BUCKET = PROJECT_ID # change to Bucket name if not the same as the Project ID

packages:

In [31]:
# package imports
from IPython.display import Markdown
import pandas as pd

# vertex ai imports
from google.cloud import aiplatform
import vertexai
import vertexai.generative_models # for Gemini Models
import vertexai.evaluation 

In [9]:
aiplatform.__version__

'1.78.0'

clients:

In [10]:
vertexai.init(project = PROJECT_ID, location = REGION)

---
## Gemini Models

Select one (or more) of the [supported Gemini models](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference#supported-models) and read more about the characteristics of each [here](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models).


### Setup Model

Here these models are selected:
- [Gemini 1.5 Flash model with version 002](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-1.5-flash)
- [Gemini 1.5 Pro model with version 002](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-1.5-pro)


In [42]:
gemini_flash = vertexai.generative_models.GenerativeModel("gemini-1.5-flash-002")
gemini_pro = vertexai.generative_models.GenerativeModel("gemini-1.5-pro-002")

### Prompt With Text

In [43]:
Markdown(gemini_flash.generate_content('Write a creative Haiku about Lego bricks.').text)

Plastic colored dreams,
Clicking, building, towers rise,
Worlds born piece by piece. 


In [44]:
Markdown(gemini_pro.generate_content('Write a creative Haiku about Lego bricks.').text)

Colorful plastic
Clicking together they build
Worlds in small hands bloom


### Multiple Responses For Prompt

In [45]:
flash_responses = gemini.generate_content(
    contents = 'Write a creative Haiku about Lego bricks ability to foster imagination.',
    generation_config = vertexai.generative_models.GenerationConfig(
        candidate_count = 8,
        temperature = 2.0
    )
)
Markdown(''.join(['- ' + r.text for r in flash_responses.candidates]))

- Small blocks, endless worlds,
Dragons built from colored dreams,
Mind's eye takes the flight. 
- Small blocks, endless worlds,
Ships sail, dragons take flight,
Dreams click into place.
- Small brick, endless worlds,
Ships sail, dragons take flight now,
Mind's eye builds the dream. 
- Small blocks, worlds unfold,
Pirate ships or castles rise,
Dreams click into place. 
- Small blocks, endless worlds,
Dragons built from tiny squares,
Dreams take plastic form. 
- Small brick, endless worlds,
Ships and castles rise and fall,
Dreams take plastic form.
- Small blocks, worlds unfold,
Pirate ship or dragon's lair,
Mind's eye takes the lead. 
- Small brick, boundless dream,
Worlds rise from clicking plastic,
Mind's eye takes the lead. 


# Evaluation Workflows

## Workflow 1: Pointwise Evaluation With Multiple Prebuilt Metrics

In [74]:
eval_dataset = pd.DataFrame(
    dict(
        prompt = ['Write a Haiku about Lego bricks.'] * len(flash_responses.candidates),
        responses = [r.text for r in flash_responses.candidates]
    )
)

In [93]:
pointwise_task = vertexai.evaluation.EvalTask(
    dataset = eval_dataset,
    metrics = [
        vertexai.evaluation.MetricPromptTemplateExamples.Pointwise.FLUENCY,
        vertexai.evaluation.MetricPromptTemplateExamples.Pointwise.COHERENCE,
    ],
)

In [94]:
pointwise_result = pointwise_task.evaluate(
    model = gemini_pro
)

Generating a total of 8 responses from Gemini model gemini-1.5-pro-002.


100%|██████████| 8/8 [00:00<00:00,  9.24it/s]

All 8 responses are successfully generated from Gemini model gemini-1.5-pro-002.
Multithreaded Batch Inference took: 0.8736460581421852 seconds.
Computing metrics with a total of 16 Vertex Gen AI Evaluation Service API requests.



100%|██████████| 16/16 [00:19<00:00,  1.21s/it]

All 16 metric requests are successfully computed.
Evaluation Took:19.36790792644024 seconds





In [95]:
pointwise_result.summary_metrics

{'row_count': 8,
 'fluency/mean': 5.0,
 'fluency/std': 0.0,
 'coherence/mean': 5.0,
 'coherence/std': 0.0}

In [96]:
pointwise_result.metrics_table

Unnamed: 0,prompt,responses,response,fluency/explanation,fluency/score,coherence/explanation,coherence/score
0,Write a Haiku about Lego bricks.,"Small blocks, endless worlds,\nDragons built f...","Colorful plastic\nClick together, worlds are b...",STEP 1: Assess grammar correctness: The respon...,5.0,STEP 1: The purpose is to write a Haiku about ...,5.0
1,Write a Haiku about Lego bricks.,"Small blocks, endless worlds,\nShips sail, dra...","Colorful plastic\nClick together, worlds are b...",STEP 1: Assess grammar correctness: The respon...,5.0,STEP 1: The purpose of the prompt is to genera...,5.0
2,Write a Haiku about Lego bricks.,"Small brick, endless worlds,\nShips sail, drag...","Colorful plastic\nClick together, worlds are b...",STEP 1: Assess grammar correctness: The respon...,5.0,The AI response provides a haiku that is topic...,5.0
3,Write a Haiku about Lego bricks.,"Small blocks, worlds unfold,\nPirate ships or ...","Colorful plastic\nClick together, worlds we bu...",STEP 1: Assess grammar correctness: The respon...,5.0,STEP 1: The purpose is to write a Haiku about ...,5.0
4,Write a Haiku about Lego bricks.,"Small blocks, endless worlds,\nDragons built f...","Colorful plastic\nClick together, worlds we bu...",STEP 1: Assess grammar correctness: The respon...,5.0,STEP 1: The purpose of the prompt is to genera...,5.0
5,Write a Haiku about Lego bricks.,"Small brick, endless worlds,\nShips and castle...","Colorful plastic\nClick together, worlds we bu...",STEP 1: Assess grammar correctness: The respon...,5.0,STEP 1: The purpose is to write a Haiku about ...,5.0
6,Write a Haiku about Lego bricks.,"Small blocks, worlds unfold,\nPirate ship or d...","Colorful bright bricks\nClick together, worlds...",STEP 1: Assess grammar correctness: The respon...,5.0,STEP 1: The purpose of the prompt is to genera...,5.0
7,Write a Haiku about Lego bricks.,"Small brick, boundless dream,\nWorlds rise fro...","Colorful plastic\nClick together, worlds we bu...",STEP 1: Assess grammar correctness: The respon...,5.0,STEP 1: The purpose is to write a Haiku about ...,5.0


## Workflow 2: Pointwise Evaluation With Custom Metric

In [46]:
haiku_quality = vertexai.evaluation.PointwiseMetric(
    metric = 'custom_text_quality',
    metric_prompt_template = vertexai.evaluation.PointwiseMetricPromptTemplate(
        criteria = dict(
            haiku_rules = 'Has three lines.  First and third line has five syllables.  Second line has seven syllables.',
            imagination = 'The content fosters a spirit of imagination.',
            fun_out_loud = 'The text sounds fun to read and even easy to memorize based on its meter when spoken.'
        ),
        rating_rubric = dict([
            (3, 'The response is exceptional at all criteria'),
            (2, 'The response is exceptional at two criteria'),
            (1, 'The response is exceptional at one criteria.'),
            (0, 'The response adhears to all critera'),
            (-1, 'The response fails to adhear to one or more criteria')
        ])
    )
)

The `input_variables` parameter is empty. Only the `response` column is used for computing this model-based metric.


In [47]:
eval_dataset = pd.DataFrame(
    dict(
        response = [r.text for r in flash_responses.candidates]
    )
)

In [48]:
eval_task = vertexai.evaluation.EvalTask(
    dataset = eval_dataset,
    metrics = [haiku_quality]
)

In [49]:
eval_results  = eval_task.evaluate()

Computing metrics with a total of 8 Vertex Gen AI Evaluation Service API requests.


100%|██████████| 8/8 [00:11<00:00,  1.46s/it]

All 8 metric requests are successfully computed.
Evaluation Took:11.713130806572735 seconds





In [50]:
eval_results.metadata

In [51]:
eval_results.summary_metrics

{'row_count': 8,
 'custom_text_quality/mean': 1.6875,
 'custom_text_quality/std': 0.45806269065645216}

In [52]:
eval_results.metrics_table

Unnamed: 0,response,custom_text_quality/explanation,custom_text_quality/score
0,"Small blocks, endless worlds,\nDragons built f...",fun_out_loud: The text has a sing-songy meter ...,2.0
1,"Small blocks, endless worlds,\nShips sail, dra...",fun_out_loud: The response is fun to read out ...,2.0
2,"Small brick, endless worlds,\nShips sail, drag...",fun_out_loud: The cadence is acceptable and ea...,1.0
3,"Small blocks, worlds unfold,\nPirate ships or ...",fun_out_loud: The response is fun to read out ...,2.0
4,"Small blocks, endless worlds,\nDragons built f...",fun_out_loud: The response is written in a met...,1.5
5,"Small brick, endless worlds,\nShips and castle...",fun_out_loud: The response's meter is not bad....,2.0
6,"Small blocks, worlds unfold,\nPirate ship or d...",fun_out_loud: The response is somewhat sing-so...,1.0
7,"Small brick, boundless dream,\nWorlds rise fro...",fun_out_loud: The response provides a poem wit...,2.0


## Workflow 3: Listing Available Model-Based Metrics

**References:**
- [Metric prompt templates for model-based evaluation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates)

In [90]:
vertexai.evaluation.MetricPromptTemplateExamples.Pairwise.__dict__.keys()

dict_keys(['__module__', '__doc__', 'FLUENCY', 'COHERENCE', 'SAFETY', 'GROUNDEDNESS', 'INSTRUCTION_FOLLOWING', 'VERBOSITY', 'TEXT_QUALITY', 'SUMMARIZATION_QUALITY', 'QUESTION_ANSWERING_QUALITY', 'MULTI_TURN_CHAT_QUALITY', 'MULTI_TURN_SAFETY_QUALITY', '__dict__', '__weakref__'])

In [91]:
vertexai.evaluation.MetricPromptTemplateExamples.Pointwise.__dict__.keys()

dict_keys(['__module__', '__doc__', 'FLUENCY', 'COHERENCE', 'SAFETY', 'GROUNDEDNESS', 'INSTRUCTION_FOLLOWING', 'VERBOSITY', 'TEXT_QUALITY', 'SUMMARIZATION_QUALITY', 'QUESTION_ANSWERING_QUALITY', 'MULTI_TURN_CHAT_QUALITY', 'MULTI_TURN_SAFETY_QUALITY', '__dict__', '__weakref__'])

In [53]:
vertexai.evaluation.MetricPromptTemplateExamples.list_example_metric_names()

['coherence',
 'fluency',
 'safety',
 'groundedness',
 'instruction_following',
 'verbosity',
 'text_quality',
 'summarization_quality',
 'question_answering_quality',
 'multi_turn_chat_quality',
 'multi_turn_safety',
 'pairwise_coherence',
 'pairwise_fluency',
 'pairwise_safety',
 'pairwise_groundedness',
 'pairwise_instruction_following',
 'pairwise_verbosity',
 'pairwise_text_quality',
 'pairwise_summarization_quality',
 'pairwise_question_answering_quality',
 'pairwise_multi_turn_chat_quality',
 'pairwise_multi_turn_safety']

In [64]:
print(vertexai.evaluation.MetricPromptTemplateExamples.get_prompt_template('pairwise_text_quality'))


# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models. We will provide you with the user input and a pair of AI-generated responses (Response A and Response B). You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will first judge responses individually, following the Rating Rubric and Evaluation Steps. Then you will give step-by-step explanations for your judgment, compare the results to declare the winner based on the Rating Rubric and Evaluation Steps.

# Evaluation
## Metric Definition
You will be assessing the Text Quality of each model's response, which measures how effectively the text conveys clear, accurate, and engaging information that directly addresses the user's prompt, considering factors like fluency, coherence, relevance, and conciseness.

## Criteria
Coherence: The re

## Workflow 4: Pairwise Evaluation With Modified Pre-Built Metric

Compare multiple models and have the evaluation service make the model generation calls for the prompt.

In [68]:
pairwise_quality = vertexai.evaluation.PairwiseMetric(
    metric = 'my_text_quality',
    metric_prompt_template = vertexai.evaluation.MetricPromptTemplateExamples.get_prompt_template('pairwise_text_quality'),
    baseline_model = gemini_flash
)
eval_dataset = pd.DataFrame(dict(prompt = ['Write a Haiku about Lego bricks.']))
pairwise_task = vertexai.evaluation.EvalTask(
    dataset = eval_dataset,
    metrics = [pairwise_quality],
)
pairwise_result = pairwise_task.evaluate(
    model = gemini_pro
)

Generating a total of 1 responses from Gemini model gemini-1.5-pro-002.


100%|██████████| 1/1 [00:00<00:00,  1.28it/s]

All 1 responses are successfully generated from Gemini model gemini-1.5-pro-002.
Multithreaded Batch Inference took: 0.7885037232190371 seconds.
Generating a total of 1 responses from Gemini model gemini-1.5-flash-002.



100%|██████████| 1/1 [00:00<00:00,  2.86it/s]

All 1 responses are successfully generated from Gemini model gemini-1.5-flash-002.
Multithreaded Batch Inference took: 0.3570019705221057 seconds.
Computing metrics with a total of 1 Vertex Gen AI Evaluation Service API requests.



100%|██████████| 1/1 [00:02<00:00,  2.17s/it]

All 1 metric requests are successfully computed.
Evaluation Took:2.183914008550346 seconds





In [69]:
pairwise_result.metrics_table

Unnamed: 0,prompt,response,baseline_model_response,my_text_quality/explanation,my_text_quality/pairwise_choice
0,Write a Haiku about Lego bricks.,"Colorful plastic\nClick together, worlds are b...","Colorful plastic,\nEndless worlds in tiny bric...",Both responses fulfill the prompt but BASELINE...,BASELINE


## Workflow 5: Evaluation with Multiple Computation-Based Metrics

In [97]:
eval_dataset = pd.DataFrame(
    dict(
        reference = ['Plastic colored dreams, Clicking, building, towers rise, Worlds born piece by piece.'] * 8,
        response = [r.text for r in flash_responses.candidates]
    )
)

In [101]:
eval_task = vertexai.evaluation.EvalTask(
    dataset = eval_dataset,
    metrics = ['bleu', 'exact_match', 'rouge']
)

In [102]:
eval_result = eval_task.evaluate()

Computing metrics with a total of 24 Vertex Gen AI Evaluation Service API requests.


100%|██████████| 24/24 [00:23<00:00,  1.04it/s]

All 24 metric requests are successfully computed.
Evaluation Took:23.085264557041228 seconds





In [103]:
eval_result.summary_metrics

{'row_count': 8,
 'bleu/mean': 0.042436428375,
 'bleu/std': 0.020690220151082194,
 'exact_match/mean': 0.0,
 'exact_match/std': 0.0,
 'rouge/mean': 0.15374033437499998,
 'rouge/std': 0.07371710453463654}

In [104]:
eval_result.metrics_table

Unnamed: 0,reference,response,bleu/score,exact_match/score,rouge/score
0,"Plastic colored dreams, Clicking, building, to...","Small blocks, endless worlds,\nDragons built f...",0.084754,0.0,0.222222
1,"Plastic colored dreams, Clicking, building, to...","Small blocks, endless worlds,\nShips sail, dra...",0.032115,0.0,0.16
2,"Plastic colored dreams, Clicking, building, to...","Small brick, endless worlds,\nShips sail, drag...",0.025828,0.0,0.071429
3,"Plastic colored dreams, Clicking, building, to...","Small blocks, worlds unfold,\nPirate ships or ...",0.057514,0.0,0.24
4,"Plastic colored dreams, Clicking, building, to...","Small blocks, endless worlds,\nDragons built f...",0.032342,0.0,0.16
5,"Plastic colored dreams, Clicking, building, to...","Small brick, endless worlds,\nShips and castle...",0.032115,0.0,0.230769
6,"Plastic colored dreams, Clicking, building, to...","Small blocks, worlds unfold,\nPirate ship or d...",0.024427,0.0,0.071429
7,"Plastic colored dreams, Clicking, building, to...","Small brick, boundless dream,\nWorlds rise fro...",0.050395,0.0,0.074074
