# Compare Model Performance using the Generative AI Evaluation Service: Challenge Lab

## GENAI063

Link: https://partner.cloudskillsboost.google/course_templates/1130/labs/528777

## Objective
This lab challenges you to conduct a model-based, pairwise evaluation on two models tasked with completing the same tasks. You will use the Generative AI Evaluation Service to complete this evaluation.

Link: https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview

## Your Challenge
You have been contracted by a movie production studio that wants to prepare for a series of low-budget short films. They’ve asked you to develop a generative AI tool to help them. They've provided you:
 - Some unstructured notes on different phases of production
 - A rate card which describes hourly rates for different crew positions on the films

You know that Gemini Flash is a faster and lower-cost alternative to Gemini Pro, so you’d like to quantify its performance to see if it would be an adequate alternative to Gemini Pro on these complex tasks.

## Task 1. Initialize Gen AI in a Colab Enterprise notebook

In [1]:
%pip install --upgrade --quiet google-genai nest-asyncio==1.5.9

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/206.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m206.4/206.4 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h

Clicking on the caret should reveal a set of menus above the notebook. Select **Runtime > Restart Session**. When asked to confirm, select Yes. The runtime will restart, indicated by clearing the green checkmark and the cell run order integer next to the cell you ran above.

In [1]:
import pandas as pd
from inspect import cleandoc
from IPython.display import display, Markdown

import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from vertexai.evaluation import (
    MetricPromptTemplateExamples,
    EvalTask,
    PairwiseMetric,
    PairwiseMetricPromptTemplate,
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
)

pd.set_option("display.max_colwidth", None)

In [2]:
# TBD: In a new code block, initialize Gen AI with vertexai.init(). Use the us-west1 location and run the cell.
PROJECT_ID = "qwiklabs-gcp-04-ed6c67b5afd4"
LOCATION = "us-west1"
import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)

## Task 2. Explore example data and generate a document

In this task, you will set up some sample data for a film production including crew rates, shooting schedules and then define questions for a large language model to answer.

1. Run the following code in a new cell to instantiate some example data. The calls to cleandoc() helps remove the indents and extra lines used for making the multi-line string readable in the code.

In [3]:
hourly_rates = cleandoc("""
  Screenwriter: $40
  Actor: $25
  Director: $30
  Camera Operator: $35
  Sound Engineer: $20
  Editor: $30
  """)

planning_notes = cleandoc("""
 Phases of Production:
   Writing:
   The Screenwriter will write the script.
   They need 72 hours to do so.


 Pre-Production:
   The Director needs time to analyze the script.
   They will work on it for 36 hours.
   The Camera Operator will join the director for 24 hours of planning.


 Production Phase 1
   The first three days of filming will require the director, 4 actors, the camera operator, and the sound engineer


 Production Phase 2
   The next three days of filming will require the director, 8 actors, the camera operator, and the sound engineer


 Post-Production
   The editor will take 64 hours to edit the film.
   The director will work with the editor for 24 hours during this phase.
""")

2.Run the following code to define the content we would like the model to help us with.

In [4]:
tasks = [
    """What is the cost of each phase of production?
    If days are mentioned, assume an 8 hour work day.""",

    """How many days will each phase require? Assume an
    8 hour work day. If multiple people are working in parallel,
    do not add those times together, but only use the longest time.
    Also include a count of the total number of days of the entire
    project.""",

    """Prepare a text schedule for all phases of the film starting
    on Feb 3, 2025. The whole crew should be off Saturdays
    and Sundays."""
]

3. Next, define a prompt template.

In [6]:
prompt_template = cleandoc("""
<instructions>
  Prepare a document to fulfill the task based on the context provided.
</instructions>
<task>
  {task}
</task>
<context>
  {context}
</context>
  """)

4. You will compare how the lower-cost Gemini Flash compares against Gemini Pro on these instruction tasks to determine which you should use for this project. Instantiate a model variable **llm_pro** to contain a generative model using **gemini-2.5-pro-preview-05-06** and a model variable **llm_flash** to contain a generative model using **gemini-2.0-flash-001**.

5. Add a generation configuration to each model to set the **temperature to 0**.

In [13]:
llm_pro = GenerativeModel(
  "gemini-2.5-pro",
  generation_config={
      "temperature": 0,
      "top_p": 0.4,
  },
)

llm_flash = GenerativeModel(
  "gemini-2.0-flash-001",
  generation_config={
      "temperature": 0,
      "top_p": 0.4,
  },
)

6. Combine hourly_rates and planning_notes (with a pair of line breaks as a separator) to form a context chunk.

In [11]:
context = hourly_rates + "\n\n" + planning_notes

7. Using the **prompt_template (???)** and the **context**, generate a response to the second task (tasks[1]) for each model (llm_pro and llm_flash). Use the Markdown() class imported from IPython.display to wrap the response text to render Gemini's responses, which are often formatted as Markdown strings.

# Q: How to use 'prompt_template' in llm_pro.generate_content() ???

In [29]:
# Check results for "llm_pro" model:
Markdown(llm_pro.generate_content(context + "\n\n" + str(tasks[1])).text)

Based on the information provided and assuming an 8-hour workday, here is the breakdown of days for each phase:

**Writing:**
The Screenwriter works for 72 hours.
*   72 hours / 8 hours per day = **9 days**

**Pre-Production:**
The Director works for 36 hours, and the Camera Operator works in parallel for 24 of those hours. The phase is determined by the longest time commitment.
*   36 hours / 8 hours per day = 4.5 days. Since a partial day is still a workday, this phase requires **5 days**.

**Production Phase 1:**
This phase is explicitly stated to last for **3 days**.

**Production Phase 2:**
This phase is explicitly stated to last for **3 days**.

**Post-Production:**
The Editor works for 64 hours, and the Director works in parallel for 24 of those hours. The phase is determined by the longest time commitment.
*   64 hours / 8 hours per day = **8 days**

---

### **Total Project Duration**

*   **Writing:** 9 days
*   **Pre-Production:** 5 days
*   **Production Phase 1:** 3 days
*   **Production Phase 2:** 3 days
*   **Post-Production:** 8 days

**Total Project Days: 28 days**

In [28]:
# Check results for "llm_flash" model:
Markdown(llm_flash.generate_content(context + "\n\n" + str(tasks[1])).text)

Okay, let's break down the project timeline and costs:

**Phase Durations (in days):**

*   **Writing:**
    *   Screenwriter: 72 hours / 8 hours/day = 9 days

*   **Pre-Production:**
    *   Director: 36 hours / 8 hours/day = 4.5 days
    *   Camera Operator: 24 hours / 8 hours/day = 3 days
    *   Since they work in parallel, the phase duration is the longer of the two: 4.5 days

*   **Production Phase 1:**
    *   Given: 3 days

*   **Production Phase 2:**
    *   Given: 3 days

*   **Post-Production:**
    *   Editor: 64 hours / 8 hours/day = 8 days
    *   Director: 24 hours / 8 hours/day = 3 days
    *   Since they work in parallel, the phase duration is the longer of the two: 8 days

**Total Project Duration:**

9 + 4.5 + 3 + 3 + 8 = 27.5 days

**Summary:**

*   **Writing:** 9 days
*   **Pre-Production:** 4.5 days
*   **Production Phase 1:** 3 days
*   **Production Phase 2:** 3 days
*   **Post-Production:** 8 days
*   **Total Project Duration:** 27.5 days


## Task 3. Prepare the Evaluation Dataset and EvalTask
In this task, you will set up the data and scoring method to evaluate the models.

1. You will evaluate the models' responses against each other by using Pairwise question answering quality. Note the user input fields in curly braces in this prompt, which are required to evaluate this metrics. You will use the Gemini Pro responses as your **baseline model response** and your Gemini Flash responses as your **responses**.

2. Prepare a Pandas DataFrame with the fields needed for evaluation.

In [23]:
# The full prompt combines the prompt instructions you
# created earlier with the context for each apartment.
full_prompts = [str(record) + "\n\n" + context for record in tasks]

print(full_prompts[1])

How many days will each phase require? Assume an 
    8 hour work day. If multiple people are working in parallel, 
    do not add those times together, but only use the longest time. 
    Also include a count of the total number of days of the entire 
    project.

Screenwriter: $40
Actor: $25
Director: $30
Camera Operator: $35
Sound Engineer: $20
Editor: $30

Phases of Production:
  Writing:
  The Screenwriter will write the script.
  They need 72 hours to do so.


  Pre-Production:
  The Director needs time to analyze the script.
  They will work on it for 36 hours.
  The Camera Operator will join the director for 24 hours of planning.


  Production Phase 1
  The first three days of filming will require the director, 4 actors, the camera operator, and the sound engineer


  Production Phase 2
  The next three days of filming will require the director, 8 actors, the camera operator, and the sound engineer


  Post-Production
  The editor will take 64 hours to edit the film.
  The di

In [24]:
eval_dataset = pd.DataFrame({
    "prompt": full_prompts[:2],
})

print(eval_dataset[0])

KeyError: 0