# Compare Model Performance using the Generative AI Evaluation Service: Challenge Lab

## GENAI063


## Task 1. Initialize Gen AI in a Colab Enterprise notebook
In this task, you will be setting up a Colab notebook and initializing Gen AI to connect the notebook and generate creative text content.

1. In the Google Cloud Console, navigate to Vertex AI > Colab Enterprise.

2. If prompted, enable the required APIs.

3. Click on + to create a new notebook.

Note: While GCP Colab Enterprise Notebooks might default to the us-central1 region, it's crucial to create your notebook in the same region where the lab environment is provisioned. You can find the lab's region on the left-hand side of the lab interface.

4. Rename the notebook to cymbal-indie-film.

5. Paste the following code into the top cell and run it with Shift + Return.

Note: If you don’t already have an active notebook runtime, running a cell in a Colab Enterprise notebook will trigger it to create one for you and connect the notebook to it. When a runtime is allocated for the first time, you may be presented with a pop-up window to authorize the environment to act as your Qwiklabs student account.

6. After the cell completes running, indicated by a checkmark to the left of the cell, the packages should be installed. To use them, we’ll restart the runtime. Click on the downward-pointing caret in the upper right of the notebook.

7. Clicking on the caret should reveal a set of menus above the notebook. Select Runtime > Restart Session. When asked to confirm, select Yes. The runtime will restart, indicated by clearing the green checkmark and the cell run order integer next to the cell you ran above.

8. Click + Code to add a new code and paste the following code below. Press Shift + Return to run the cell.

9. In a new code block, initialize Gen AI with vertexai.init(). Use the us-central1 location and run the cell.

10. Save the notebook.

In [1]:
%pip install --upgrade --quiet google-genai nest-asyncio==1.5.9

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/222.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.8/222.8 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h

Select Runtime > Restart Session.

In [1]:
import pandas as pd
from inspect import cleandoc
from IPython.display import display, Markdown

import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from vertexai.evaluation import (
    MetricPromptTemplateExamples,
    EvalTask,
    PairwiseMetric,
    PairwiseMetricPromptTemplate,
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
)

pd.set_option("display.max_colwidth", None)

In [2]:
PROJECT_ID = "qwiklabs-gcp-00-f92ddae5cd8d"
LOCATION = "us-central1"
import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)

## Task 2. Explore example data and generate a document
In this task, you will set up some sample data for a film production including crew rates, shooting schedules and then define questions for a large language model to answer.

1. Run the following code in a new cell to instantiate some example data. The calls to cleandoc() helps remove the indents and extra lines used for making the multi-line string readable in the code.

In [3]:
hourly_rates = cleandoc("""
  Screenwriter: $40
  Actor: $25
  Director: $30
  Camera Operator: $35
  Sound Engineer: $20
  Editor: $30
  """)

planning_notes = cleandoc("""
 Phases of Production:
   Writing:
   The Screenwriter will write the script.
   They need 72 hours to do so.


   Pre-Production:
   The Director needs time to analyze the script.
   They will work on it for 36 hours.
   The Camera Operator will join the director for 24 hours of planning.


   Production Phase 1
   The first three days of filming will require the director, 4 actors, the camera operator, and the sound engineer


   Production Phase 2
   The next three days of filming will require the director, 8 actors, the camera operator, and the sound engineer


   Post-Production
   The editor will take 64 hours to edit the film.
   The director will work with the editor for 24 hours during this phase.
""")

2. Run the following code to define the content we would like the model to help us with.

In [4]:
tasks = [
    """What is the cost of each phase of production?
    If days are mentioned, assume an 8 hour work day.""",

    """How many days will each phase require? Assume an
    8 hour work day. If multiple people are working in parallel,
    do not add those times together, but only use the longest time.
    Also include a count of the total number of days of the entire
    project.""",

    """Prepare a text schedule for all phases of the film starting
    on Feb 3, 2025. The whole crew should be off Saturdays
    and Sundays."""
]

3. Next, define a prompt template.

In [5]:
prompt_template = cleandoc("""
<instructions>Prepare a document to fulfill the task based on the context provided.</instructions>
<task>{task}</task>
<context>{context}</context>""")

4. You will compare how the lower-cost Gemini Flash compares against Gemini Pro on these instruction tasks to determine which you should use for this project. Instantiate a model variable llm_pro to contain a generative model using gemini-2.5-pro-preview-05-06 and a model variable llm_flash to contain a generative model using gemini-2.0-flash-001.

5. Add a generation configuration to each model to set the temperature to 0.

In [6]:
llm_pro = GenerativeModel(
  "gemini-2.5-pro",
  generation_config={
      "temperature": 0,
      "top_p": 0.4,
  },
)

llm_flash = GenerativeModel(
  "gemini-2.0-flash-001",
  generation_config={
      "temperature": 0,
      "top_p": 0.4,
  },
)

6. Combine hourly_rates and planning_notes (with a pair of line breaks as a separator) to form a context chunk.

In [7]:
context = hourly_rates + "\n\n" + planning_notes

7. Using the prompt template and the context, generate a response to the second task (tasks[1]) for each model (llm_pro and llm_flash). Use the Markdown() class imported from IPython.display to wrap the response text to render Gemini's responses, which are often formatted as Markdown strings.

In [8]:
prompt = prompt_template.format(context=context, task=str(tasks[1]))

In [9]:
# Check results for "llm_pro" model:
Markdown(llm_pro.generate_content(prompt).text)

Based on the context provided, here is a breakdown of the duration for each project phase and the total project duration, assuming an 8-hour workday.

### **Phase Durations**

*   **Writing:**
    The Screenwriter requires 72 hours.
    *Calculation: 72 hours / 8 hours per day = **9 days***

*   **Pre-Production:**
    The Director will work for 36 hours, and the Camera Operator will work for 24 hours in parallel. The longest duration is used.
    *Calculation: 36 hours / 8 hours per day = **4.5 days***

*   **Production Phase 1:**
    This phase is explicitly stated to last for **3 days**.

*   **Production Phase 2:**
    This phase is explicitly stated to last for **3 days**.

*   **Post-Production:**
    The Editor will work for 64 hours, and the Director will work for 24 hours in parallel. The longest duration is used.
    *Calculation: 64 hours / 8 hours per day = **8 days***

---

### **Total Project Duration**

The total number of days for the entire project is the sum of all phases.

*   Writing: 9 days
*   Pre-Production: 4.5 days
*   Production Phase 1: 3 days
*   Production Phase 2: 3 days
*   Post-Production: 8 days

**Total Project Days: 27.5 days**

In [10]:
# Check results for "llm_flash" model:
Markdown(llm_flash.generate_content(prompt).text)

## Project Timeline - Days Per Phase

Here's a breakdown of the project timeline, calculated with 8-hour workdays and parallel tasks considered:

**Phase Breakdown:**

*   **Writing:**
    *   Screenwriter: 72 hours / 8 hours/day = **9 days**

*   **Pre-Production:**
    *   Director: 36 hours / 8 hours/day = 4.5 days
    *   Camera Operator: 24 hours / 8 hours/day = 3 days
    *   Longest Time: 4.5 days
    *   **Total: 4.5 days**

*   **Production Phase 1:**
    *   Director, 4 Actors, Camera Operator, Sound Engineer: 3 days
    *   **Total: 3 days**

*   **Production Phase 2:**
    *   Director, 8 Actors, Camera Operator, Sound Engineer: 3 days
    *   **Total: 3 days**

*   **Post-Production:**
    *   Editor: 64 hours / 8 hours/day = 8 days
    *   Director: 24 hours / 8 hours/day = 3 days
    *   Longest Time: 8 days
    *   **Total: 8 days**

**Total Project Days:**

9 days (Writing) + 4.5 days (Pre-Production) + 3 days (Production Phase 1) + 3 days (Production Phase 2) + 8 days (Post-Production) = **27.5 days**


## Task 3. Prepare the Evaluation Dataset and EvalTask

In this task, you will set up the data and scoring method to evaluate the models.

1. You will evaluate the models' responses against each other by using Pairwise question answering quality. Note the user input fields in curly braces in this prompt, which are required to evaluate this metrics. You will use the Gemini Pro responses as your baseline model response and your Gemini Flash responses as your responses.

2. Prepare a Pandas DataFrame with the fields needed for evaluation.

3. Create an EvalTask(), passing in the dataset, identifying MetricPromptTemplateExamples.Pairwise.QUESTION_ANSWERING_QUALITY as the metric you would like to be calculated, and defining an experiment name of indie-film-planning.

4. Save the notebook.

In [None]:
import pandas as pd
from inspect import cleandoc
from IPython.display import display, Markdown

import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from vertexai.evaluation import (
    MetricPromptTemplateExamples,
    EvalTask,
    PairwiseMetric,
    PairwiseMetricPromptTemplate,
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
)

# Initialize Vertex AI
# Replace 'your-project-id' and 'your-region' with your actual GCP project ID and region
# Example: vertexai.init(project="my-gcp-project-12345", location="us-central1")
vertexai.init(project="qwiklabs-gcp-00-f92ddae5cd8d", location="us-central1")

# Define context components
hourly_rates = cleandoc("""
    Screenwriter: $40
    Actor: $25
    Director: $30
    Camera Operator: $35
    Sound Engineer: $20
    Editor: $30
    """)

planning_notes = cleandoc("""
    Phases of Production:
    Writing:
    The Screenwriter will write the script.
    They need 72 hours to do so.


    Pre-Production:
    The Director needs time to analyze the script.
    They will work on it for 36 hours.
    The Camera Operator will join the director for 24 hours of planning.


    Production Phase 1
    The first three days of filming will require the director, 4 actors, the camera operator, and the sound engineer


    Production Phase 2
    The next three days of filming will require the director, 8 actors, the camera operator, and the sound engineer


    Post-Production
    The editor will take 64 hours to edit the film.
    The director will work with the editor for 24 hours during this phase.
""")

context_text = hourly_rates + "\n\n" + planning_notes

# Define the tasks
tasks = [
    """What is the cost of each phase of production?
    If days are mentioned, assume an 8 hour work day.""",

    """How many days will each phase require? Assume an
    8 hour work day. If multiple people are working in parallel,
    do not add those times together, but only use the longest time.
    Also include a count of the total number of days of the entire
    project.""",

    """Prepare a text schedule for all phases of the film starting
    on Feb 3, 2025. The whole crew should be off Saturdays
    and Sundays."""
]

# Prepare the 'question' field by combining context and task
questions_for_eval = [f"{context_text}\n\n{task}" for task in tasks]

# Optional: Configuration for generation (e.g., to make responses more consistent)
generation_config = GenerationConfig(
    temperature=0.1,  # Lower temperature for less creativity, more deterministic output
    max_output_tokens=1024, # Adjust based on expected response length
)

# --- Generate baseline responses from 'gemini-pro' ---
print("Generating baseline responses from 'gemini-pro' model...")
# baseline_model = GenerativeModel("gemini-pro")
baseline_responses = []

for i, question in enumerate(questions_for_eval):
    print(f"  Generating baseline response for Task {i+1}...")
    try:
        response = llm_pro.generate_content(
            question,
            generation_config=generation_config
        )
        baseline_responses.append(response.text)
        print(f"  Baseline response for Task {i+1} generated successfully.")
    except Exception as e:
        print(f"  Error generating baseline response for Task {i+1}: {e}")
        baseline_responses.append(f"ERROR: Could not generate response from gemini-pro: {e}") # Append an error message

print("\nBaseline response generation complete.")

# --- Generate candidate responses from 'gemini-2.0-flash-001' ---
print("\nGenerating candidate responses from 'gemini-2.0-flash-001' model...")
# candidate_model = GenerativeModel("gemini-2.0-flash-001")
candidate_responses = []

for i, question in enumerate(questions_for_eval):
    print(f"  Generating candidate response for Task {i+1}...")
    try:
        response = llm_flash.generate_content(
            question,
            generation_config=generation_config
        )
        candidate_responses.append(response.text)
        print(f"  Candidate response for Task {i+1} generated successfully.")
    except Exception as e:
        print(f"  Error generating candidate response for Task {i+1}: {e}")
        candidate_responses.append(f"ERROR: Could not generate response from gemini-2.0-flash-001: {e}") # Append an error message

print("\nCandidate response generation complete.")


# 2) Prepare a Pandas DataFrame with the fields needed for evaluation.
data = {
    'question': questions_for_eval,
    'baseline_response': baseline_responses,
    'candidate_response': candidate_responses,
}
df_eval = pd.DataFrame(data)

print("\nDataFrame for Evaluation:")
display(df_eval)

# 3) Create an EvalTask(), passing in the dataset, identifying
#    MetricPromptTemplateExamples.Pairwise.YoutubeING_QUALITY as the metric you would like to be calculated,
#    and defining an experiment name of indie-film-planning.

try:
    eval_task = EvalTask(
        dataset=df_eval,
        metrics=[MetricPromptTemplateExamples.Pairwise.YoutubeING_QUALITY],
        experiment_name="indie-film-planning",
    )
    print(cleandoc(f"""
        \nEvalTask created successfully with:
        Dataset shape: {df_eval.shape}
        Metrics: {[metric.name for metric in eval_task.metrics]}
        Experiment Name: {eval_task.experiment_name}
    """))

    # To actually run the evaluation, you would uncomment the line below.
    # eval_task.run()
    # print("\nEvaluation task initiated. You can monitor its progress in the GCP console under Generative AI > Model Evaluation.")

except Exception as e:
    print(f"\nAn error occurred during EvalTask creation: {e}")
    print("Please ensure your GCP project ID and region are correctly set and you have the necessary permissions.")
    print("Also, confirm that the 'dataset' columns match the requirements of the chosen metric.")

Generating baseline responses from 'gemini-pro' model...
  Generating baseline response for Task 1...
  Error generating baseline response for Task 1: Cannot get the response text.
Cannot get the Candidate text.
Response candidate content has no parts (and thus no text). The candidate is likely blocked by the safety filters.
Content:
{
  "role": "model"
}
Candidate:
{
  "content": {
    "role": "model"
  },
  "finish_reason": "MAX_TOKENS"
}
Response:
{
  "candidates": [
    {
      "content": {
        "role": "model"
      },
      "finish_reason": "MAX_TOKENS"
    }
  ],
  "usage_metadata": {
    "prompt_token_count": 228,
    "total_token_count": 1251,
    "prompt_tokens_details": [
      {
        "modality": "TEXT",
        "token_count": 228
      }
    ],
    "thoughts_token_count": 1023
  },
  "model_version": "gemini-2.5-pro",
  "create_time": "2025-06-28T00:35:23.574150Z",
  "response_id": "yzhfaMaFI-G9qsMP1oSv2QU"
}
  Generating baseline response for Task 2...
  Error genera

In [15]:
# # === 2. Construct evaluation examples ===
# examples = []
# for task in tasks:
#     input_prompt = prompt_template.format(task=task, context=context)
#     examples.append({
#         "input": input_prompt
#     })

# # === 3. Create EvalTask ===
# eval_task = EvalTask(
#     # name="film_schedule_pairwise_eval",
#     # metric=PairwiseQuestionAnsweringQuality(),
#     metrics=[
#         MetricPromptTemplateExamples.Pairwise.QUESTION_ANSWERING_QUALITY
#     ],
#     dataset=examples,
#     experiment="indie-film-planning",
#     # input_data=examples,
#     model="gemini-2.0-flash-001",
#     baseline_model="gemini-2.5-pro"
# )

In [16]:
metric_prompt_template = """
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models. We will provide you with the user input and a pair of AI-generated responses (Response A and Response B). You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.

You will first judge responses individually, following the Rating Rubric and Evaluation Steps. Then you will give step-by-step explanations for your judgment, compare results to declare the winner based on the Rating Rubric and Evaluation Steps.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
"A": Response A answers the given question as per the criteria better than response B.
"SAME": Response A and B answers the given question equally well as per the criteria.
"B": Response B answers the given question as per the criteria better than response A.

## Evaluation Steps
STEP 1: Analyze Response A based on the question answering quality criteria: Determine how well Response A fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 2: Analyze Response B based on the question answering quality criteria: Determine how well Response B fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field.

# User Inputs and AI-generated Responses
## User Inputs
### Prompt
{prompt}

# AI-generated Response

### Response A
{baseline_model_response}

### Response B
{response}
"""

In [None]:
instruction = "Summarize the following article"

context = [
    "To make a classic spaghetti carbonara, start by bringing a large pot of salted water to a boil. While the water is heating up, cook pancetta or guanciale in a skillet with olive oil over medium heat until it's crispy and golden brown. Once the pancetta is done, remove it from the skillet and set it aside. In the same skillet, whisk together eggs, grated Parmesan cheese, and black pepper to make the sauce. When the pasta is cooked al dente, drain it and immediately toss it in the skillet with the egg mixture, adding a splash of the pasta cooking water to create a creamy sauce.",
    "Preparing a perfect risotto requires patience and attention to detail. Begin by heating butter in a large, heavy-bottomed pot over medium heat. Add finely chopped onions and minced garlic to the pot, and cook until they're soft and translucent, about 5 minutes. Next, add Arborio rice to the pot and cook, stirring constantly, until the grains are coated with the butter and begin to toast slightly. Pour in a splash of white wine and cook until it's absorbed. From there, gradually add hot chicken or vegetable broth to the rice, stirring frequently, until the risotto is creamy and the rice is tender with a slight bite.",
    "For a flavorful grilled steak, start by choosing a well-marbled cut of beef like ribeye or New York strip. Season the steak generously with kosher salt and freshly ground black pepper on both sides, pressing the seasoning into the meat. Preheat a grill to high heat and brush the grates with oil to prevent sticking. Place the seasoned steak on the grill and cook for about 4-5 minutes on each side for medium-rare, or adjust the cooking time to your desired level of doneness. Let the steak rest for a few minutes before slicing against the grain and serving.",
    "Creating a creamy homemade tomato soup is a comforting and simple process. Begin by heating olive oil in a large pot over medium heat. Add diced onions and minced garlic to the pot and cook until they're soft and fragrant. Next, add chopped fresh tomatoes, chicken or vegetable broth, and a sprig of fresh basil to the pot. Simmer the soup for about 20-30 minutes, or until the tomatoes are tender and falling apart. Remove the basil sprig and use an immersion blender to puree the soup until smooth. Season with salt and pepper to taste before serving.",
    "To bake a decadent chocolate cake from scratch, start by preheating your oven to 350°F (175°C) and greasing and flouring two 9-inch round cake pans. In a large mixing bowl, cream together softened butter and granulated sugar until light and fluffy. Beat in eggs one at a time, making sure each egg is fully incorporated before adding the next. In a separate bowl, sift together all-purpose flour, cocoa powder, baking powder, baking soda, and salt. Divide the batter evenly between the prepared cake pans and bake for 25-30 minutes, or until a toothpick inserted into the center comes out clean.",
]

reference = [
    "The process of making spaghetti carbonara involves boiling pasta, crisping pancetta or guanciale, whisking together eggs and Parmesan cheese, and tossing everything together to create a creamy sauce.",
    "Preparing risotto entails sautéing onions and garlic, toasting Arborio rice, adding wine and broth gradually, and stirring until creamy and tender.",
    "Grilling a flavorful steak involves seasoning generously, preheating the grill, cooking to desired doneness, and letting it rest before slicing.",
    "Creating homemade tomato soup includes sautéing onions and garlic, simmering with tomatoes and broth, pureeing until smooth, and seasoning to taste.",
    "Baking a decadent chocolate cake requires creaming butter and sugar, beating in eggs and alternating dry ingredients with buttermilk before baking until done.",
]

eval_dataset = pd.DataFrame(
    {
        "context": context,
        "reference": reference,
        "instruction": [instruction] * len(context),
    }
)

In [None]:
experiment_name = "indie-film-planning"

metrics = ["MetricPromptTemplateExamples.Pairwise.QUESTION_ANSWERING_QUALITY"]

summarization_eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=metrics,
    experiment=experiment_name,
)