# Model Evaluations (Week 4)

+ Unhappy: 👎
+ Anxious: 😬
+ Curious: 🤔
+ Happy: 👍

## Objectives

For this week's activities, we must do the following:

- [ ] Set up an evaluation pipeline to compare Gemini, Gemma, and/or tuned model.
- [ ] Export evaluation to a "table" (BQ?).
- [ ] Set up a rapid evaluation pipeline to see the specific performance of a model.

Nice to haves:

- [ ] Limit context passed to Gemma model based upon token count
- [ ] Train Gemma model on Guanaco dataset
- [ ] Upgrade ALL the things to Genkit

Sources:

+ https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-quickstart

## Step 0. Install and import libraries

In [64]:
%%writefile -a requirements.txt
google-cloud-aiplatform[evaluation]
google-cloud-bigquery
bigframes
pandas-io
pandas-gbq

Appending to requirements.txt


In [65]:
!pip install -qr requirements.txt

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-api-python-client 1.8.0 requires google-api-core<2dev,>=1.13.0, but you have google-api-core 2.22.0 which is incompatible.[0m[31m
[0m

In [66]:
import pandas as pd
from pandas.io import gbq

import bigframes.pandas as bpd
from google.cloud import bigquery

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.evaluation import EvalTask, Rouge, PointwiseMetric, PointwiseMetricPromptTemplate, MetricPromptTemplateExamples

In [18]:
PROJECT_ID = !gcloud config get-value project
PROJECT_ID = PROJECT_ID[0]
LOCATION = "us-west1"
bpd.options.bigquery.project = PROJECT_ID
bpd.options.bigquery.location = LOCATION

## Step 1. Learn about evaluation metrics

In [5]:
fluency_text = """
Sentences flow smoothly and are easy to read, avoiding awkward
phrasing or run-on sentences. Ideas and sentences connect
logically, using transitions effectively where needed.
"""

consistency_text = """
Text remains consistent across sentences. If the user asks
about ruins in Rome, the answer describes Roman ruins or
ancient sites in Italy.
"""

evals_quickstart = PointwiseMetric(
    metric="evals_quickstart",
    metric_prompt_template=PointwiseMetricPromptTemplate(
        criteria={
            "fluency": fluency_text,
            "consistency": consistency_text,
        },
        rating_rubric={
            "1": "The response performs well on both criteria.",
            "0": "The response is somewhat aligned with both criteria",
            "-1": "The response falls short on both criteria",
        },
    ),
)

The `input_variables` parameter is empty. Only the `response` column is used for computing this model-based metric.


In [6]:
responses = [
    "Greece is an exciting place to go! Do you want to see ruins or experience the culture?",
    "Greece is an exciting place to go",
    "There are many places that you could go",
]
eval_dataset = pd.DataFrame({
    "response" : responses,
})

In [9]:
experiment_name = "evalsquickstart"
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[evals_quickstart],
    experiment=experiment_name
)

pointwise_result = eval_task.evaluate()

Associating projects/1025771077852/locations/us-central1/metadataStores/default/contexts/evalsquickstart-b6ee2454-289e-4162-8c07-3472f0892d17 to Experiment: evalsquickstart


Computing metrics with a total of 3 Vertex Gen AI Evaluation Service API requests.


100%|██████████| 3/3 [00:10<00:00,  3.39s/it]

All 3 metric requests are successfully computed.
Evaluation Took:10.312111223000102 seconds





In [12]:
results = pointwise_result.metrics_table

In [15]:
results.loc[0]["evals_quickstart/explanation"]

'Consistency: The response is consistent in focusing on Greece as a travel destination and offers options related to ruins and culture, fitting with the theme of travel or tourism.  Fluency: The sentences are fluent and easy to read. It uses a question effectively to engage the user and encourage further interaction.  Overall, the response is well-written and easy to understand.'

## Step 2. Identify metric set

The metrics that I want to test the models for are as follows:

1. [Open Domain Question Answering](https://www.promptingguide.ai/prompts/question-answering/open-domain)

   + `In this conversation between a human and the AI, the AI is helpful and friendly, and when it does not know the answer it says \"I don’t know\".\n`
   
2. [Closed Domain Question Answering](https://www.promptingguide.ai/prompts/question-answering/closed-domain)

   + `The user wants to travel to Greece to see ancient ruins. The AI is a helpful travel guide. Please provide 3 to 5 destination suggestions.`

3. [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge)

4. [Groundedness](https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates#pointwise_groundedness)

5. [Coherence](https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates#pointwise_coherence)

In [19]:
# My set of metrics
open_domain = '''
In this conversation between a human and the AI, the AI is helpful and friendly, 
and when it does not know the answer it says \"I don’t know\".\n
'''

closed_domain = '''
The user wants to travel to a country to see historical landmarks and archaeological sites.
The AI is a helpful travel guide. Please provide 3 to 5 destination suggestions.
'''

prompteng_metrics = PointwiseMetric(
    metric="prompteng_metrics",
    metric_prompt_template=PointwiseMetricPromptTemplate(
        criteria={
            "open_domain": open_domain,
            "closed_domain": closed_domain,
        },
        rating_rubric={
            "1": "The response performs well on both criteria.",
            "0.5": "The response performs well on one but not the other criteria.",
            "0": "The response falls short on both criteria",
        },
    ),
)

# Requires a set of goldens -- `reference` column!
rouge = Rouge(rouge_type="rouge1")

''' 
ROUGE Expects inputs to look something like:
eval_dataset = pd.DataFrame({
    "response": [
        "I want to see the ancient ruins in Greece.",
        "Help me plan my trip to Japan.",
        "I'm going to the Yucatan peninsula in Mexico. What is there to see?",
        "What are the most interesting places to go in Egypt?",
        "I want to visit historical sites in England and Scotland.",
    ],
    "reference": [
    
    ],
})


'''

metrics = [
    prompteng_metrics,
    rouge,
    MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
    MetricPromptTemplateExamples.Pointwise.COHERENCE,
]

The `input_variables` parameter is empty. Only the `response` column is used for computing this model-based metric.


In [20]:
eval_dataset = pd.DataFrame({
    "prompt": [
        "I want to see the ancient ruins in Greece.",
        "Help me plan my trip to Japan.",
        "I'm going to the Yucatan peninsula in Mexico. What is there to see?",
        "What are the most interesting places to go in Egypt?",
        "I want to visit historical sites in England and Scotland.",
    ],
})

### Sources

+ [`EvalTask` reference doc](https://cloud.google.com/vertex-ai/generative-ai/docs/reference/python/latest/vertexai.evaluation.EvalTask)
+ THE MOTHER LODE https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates
+ ANOTHER MOTHER LODE https://github.com/google-research/google-research/tree/master

## Step 3. Generate baseline from OOTB Gemini

In [21]:
# Generate a baseline from OOTB Gemini 1.5 Flash
candidate_model = GenerativeModel("gemini-1.5-flash-001")
pointwise_eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=metrics,
)
pointwise_result = pointwise_eval_task.evaluate(
    model=candidate_model,
)

Generating a total of 5 responses from Gemini model gemini-1.5-flash-001.


100%|██████████| 5/5 [00:04<00:00,  1.25it/s]

All 5 responses are successfully generated from Gemini model gemini-1.5-flash-001.
Multithreaded Batch Inference took: 4.017185878999953 seconds.
Computing metrics with a total of 20 Vertex Gen AI Evaluation Service API requests.



100%|██████████| 20/20 [01:21<00:00,  4.06s/it]

All 20 metric requests are successfully computed.
Evaluation Took:81.25386331500022 seconds





In [22]:
results = pointwise_result.metrics_table

In [23]:
print(results)

                                                                prompt  \
0                           I want to see the ancient ruins in Greece.   
1                                       Help me plan my trip to Japan.   
2  I'm going to the Yucatan peninsula in Mexico. What is there to see?   
3                 What are the most interesting places to go in Egypt?   
4            I want to visit historical sites in England and Scotland.   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

In [24]:
golden_responses = results[['prompt','response']]

In [25]:
pd.set_option('display.max_colwidth', None)
golden_eval_dataset = golden_responses.rename(columns={"prompt": "prompt", "response": "reference"})
golden_eval_dataset.head

<bound method NDFrame.head of                                                                 prompt  \
0                           I want to see the ancient ruins in Greece.   
1                                       Help me plan my trip to Japan.   
2  I'm going to the Yucatan peninsula in Mexico. What is there to see?   
3                 What are the most interesting places to go in Egypt?   
4            I want to visit historical sites in England and Scotland.   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

## Step 4. Evaluate tuned Gemini model 

In [11]:
tuned_model_endpoint = "1926929312049528832"
tuned_model_name = f"projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/{tuned_model_endpoint}"
tuned_gemini_model = GenerativeModel(tuned_model_name)
tuned_gemini_task = EvalTask(
    dataset=golden_eval_dataset,
    metrics=metrics,
)
tuned_gemini_result = tuned_gemini_task.evaluate(
    model=tuned_gemini_model,
)

Generating a total of 5 responses from Gemini model 1926929312049528832.


100%|██████████| 5/5 [00:06<00:00,  1.37s/it]

All 5 responses are successfully generated from Gemini model 1926929312049528832.
Multithreaded Batch Inference took: 6.8616636139998946 seconds.





Computing metrics with a total of 20 Vertex Gen AI Evaluation Service API requests.


100%|██████████| 20/20 [01:22<00:00,  4.15s/it]

All 20 metric requests are successfully computed.
Evaluation Took:82.99699030299996 seconds





In [12]:
tuned_gemini_result.metrics_table.head

<bound method NDFrame.head of                                                                 prompt  \
0                           I want to see the ancient ruins in Greece.   
1                                       Help me plan my trip to Japan.   
2  I'm going to the Yucatan peninsula in Mexico. What is there to see?   
3                 What are the most interesting places to go in Egypt?   
4            I want to visit historical sites in England and Scotland.   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

In [55]:
# See if I can get a response from my tuned model :/
tuned_gemini_model.generate_content("I want to see ruins in ancient Greece")

candidates {
  content {
    role: "model"
    parts {
      text: "Greece has many sites of historical and cultural significance, including ancient ruins from the Classical period. The Acropolis of Athens is a prominent example and includes structures such as the Parthenon and Erechtheion. Other significant sites include the ancient cities of Delphi, Olympia, and Mycenae, which feature ruins of temples, theaters, and other important buildings.\n\nIf you\'re interested in visiting ancient ruins in Greece, I recommend researching specific sites that are of interest to you and planning a trip to explore those locations.\n\n"
    }
  }
  finish_reason: STOP
  avg_logprobs: -0.8702068510509673
}
usage_metadata {
  prompt_token_count: 8
  candidates_token_count: 105
  total_token_count: 113
}

+ 👎👎 First try to generate responses from tuned model fails because the 
  model doesn't respond. I've seen this behavior from the tuned Gemini model
  in the app. You have to basically ask it twice (for the first user input)
  before it responds.
  
  - tl;dr: I'll need to retune this model to get it to generate actual responses the first time :/

## Step 5. Evaluate Gemma model

+ 👎👎  The code tends to hang/freeze on this task, causing the entire notebook to hang.
  - According to the cell output, the problem isn't getting responses from the model, it seems
    to hang when generating the results
  - I think I'll need to move this code into a process that supports long-running operations.
+ It looks like evaluation pipeline jobs don't allow the same set of metrics as Pointwise
  and Pairwise evaluations. They only provide exact match, ROUGE-L, and BLEU.

In [13]:
gemma_model_endpoint = "3122353538139684864"
gemma_model_name = f"projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/{tuned_model_endpoint}"
gemma_gemini_model = GenerativeModel(tuned_model_name)
gemma_eval_task = EvalTask(
    dataset=golden_eval_dataset,
    metrics=metrics,
)
gemma_result = pointwise_eval_task.evaluate(
    model=tuned_gemini_model,
)

Generating a total of 5 responses from Gemini model 1926929312049528832.


100%|██████████| 5/5 [00:06<00:00,  1.28s/it]

All 5 responses are successfully generated from Gemini model 1926929312049528832.
Multithreaded Batch Inference took: 6.431891494999945 seconds.
Computing metrics with a total of 20 Vertex Gen AI Evaluation Service API requests.



100%|██████████| 20/20 [07:53<00:00,  3.96s/it]

KeyboardInterrupt: 

In [14]:
gemma_result.metrics_table.head

NameError: name 'gemma_result' is not defined

### Sources

+ [Vertex Evaluation Pipeline](https://cloud.google.com/vertex-ai/generative-ai/docs/models/computation-based-eval-pipeline#supported_models)


## Step 6. Write goldens/groundtruth to BQ

+ I want to store the best answers from the OOTB Gemini model as the "groundtruth"
  or "goldens" set of query/responses for my models. I've decided to use BigQuery
  to store this set. 
+ The documentation for BigFrames is hard to parse. It has a mix of conceptual content, reference content,
  how-tos, etc.
+ I think this is a bug -- when I create my dataset and table using a location-locked client, the resulting
  datasets and tables should be located in that location. However, the BigFrames API keeps giving me this error:
  
```py
ValueError: Current session is in us-west1 but dataset 'PROJECT_ID.myherodotus' is located in US
```

  - I investigated where this dataset is created and confirmed that it shows to be in the us-west1 region.
  - Le sigh -- I was able to create a table easily using the `pandas-gbq` library.

In [67]:
golden_eval_dataset.head
golden_eval_dataset.to_csv("goldens.csv")

golden_eval_dataset.to_gbq(destination_table="myherodotus.goldens20241104",
                           project_id=PROJECT_ID,
                           if_exists='fail')

100%|██████████| 1/1 [00:00<00:00, 8065.97it/s]


In [55]:
# Create the dataset and table
bq_client = bigquery.Client(location=LOCATION)

dataset = bigquery.Dataset(f"{PROJECT_ID}.myherodotus")
dataset.location = LOCATION

bq_client.create_dataset(dataset, exists_ok=True)

Dataset(DatasetReference('erschmid-test-291318', 'myherodotus'))

In [56]:
goldens_table_name = f"{PROJECT_ID}.myherodotus.goldens20241104"
goldens_table = bigquery.Table(goldens_table_name)

actual_goldens_table = bq_client.create_table(goldens_table, exists_ok=True)

In [None]:
# DOES NOT WORK :/
#bq_goldens_dataset = bpd.read_gbq(goldens_table_name)

### Sources

+ [BigQuery DataFrames](https://cloud.google.com/bigquery/docs/use-bigquery-dataframes)
+ 👍👍 [Create dataset sample](https://cloud.google.com/bigquery/docs/samples/bigquery-create-dataset?hl=en#bigquery_create_dataset-python)

## Step 7. Get golden dataset out of BQ

In [69]:
# Get a pd.DataFrame out of BQ
sql = f"""
SELECT prompt, reference
FROM {goldens_table_name}
"""

df = bq_client.query_and_wait(sql).to_dataframe()

In [70]:
df.head

<bound method NDFrame.head of                                                                 prompt  \
0                           I want to see the ancient ruins in Greece.   
1                                       Help me plan my trip to Japan.   
2  I'm going to the Yucatan peninsula in Mexico. What is there to see?   
3                 What are the most interesting places to go in Egypt?   
4            I want to visit historical sites in England and Scotland.   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             