In [1]:
%pip install --upgrade --quiet google-genai nest-asyncio==1.5.9

In [2]:
import pandas as pd
from inspect import cleandoc
from IPython.display import display, Markdown

import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from vertexai.evaluation import (
    MetricPromptTemplateExamples,
    EvalTask,
    PairwiseMetric,
    PairwiseMetricPromptTemplate,
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
)

pd.set_option("display.max_colwidth", None)

In [3]:
# TBD: In a new code block, initialize Gen AI with vertexai.init(). Use the us-west1 location and run the cell.
PROJECT_ID = "qwiklabs-gcp-03-d6b4e0cf43df"
LOCATION = "us-central1"
import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)

## Task 2. Explore example data and generate a document

In this task, you will set up some sample data for a film production including crew rates, shooting schedules and then define questions for a large language model to answer.

1. Run the following code in a new cell to instantiate some example data. The calls to cleandoc() helps remove the indents and extra lines used for making the multi-line string readable in the code.

In [4]:
hourly_rates = cleandoc("""
  Screenwriter: $40
  Actor: $25
  Director: $30
  Camera Operator: $35
  Sound Engineer: $20
  Editor: $30
  """)

planning_notes = cleandoc("""
 Phases of Production:
   Writing:
   The Screenwriter will write the script.
   They need 72 hours to do so.


   Pre-Production:
   The Director needs time to analyze the script.
   They will work on it for 36 hours.
   The Camera Operator will join the director for 24 hours of planning.


   Production Phase 1
   The first three days of filming will require the director, 4 actors, the camera operator, and the sound engineer


   Production Phase 2
   The next three days of filming will require the director, 8 actors, the camera operator, and the sound engineer


   Post-Production
   The editor will take 64 hours to edit the film.
   The director will work with the editor for 24 hours during this phase.
""")

In [5]:
tasks = [
    """What is the cost of each phase of production?
    If days are mentioned, assume an 8 hour work day.""",

    """How many days will each phase require? Assume an
    8 hour work day. If multiple people are working in parallel,
    do not add those times together, but only use the longest time.
    Also include a count of the total number of days of the entire
    project.""",

    """Prepare a text schedule for all phases of the film starting
    on Feb 3, 2025. The whole crew should be off Saturdays
    and Sundays."""
]

In [6]:
prompt_template = cleandoc("""
  <instructions>
  Prepare a document to fulfill the task based on the context provided.
  </instructions>
<task>
  {task}
  </task>
<context>
  {context}
  </context>
  """)

In [7]:
llm_pro = GenerativeModel(
  "gemini-2.5-pro",
  generation_config={
      "temperature": 0,
      "top_p": 0.4,
  },
)

llm_flash = GenerativeModel(
  "gemini-2.0-flash-001",
  generation_config={
      "temperature": 0,
      "top_p": 0.4,
  },
)

In [11]:
context = hourly_rates + "\n\n" + planning_notes
prompt = prompt_template.format(context=context, task=str(tasks[1]))

In [12]:
# Check results for "llm_pro" model:
Markdown(llm_pro.generate_content(prompt).text)

Based on the context provided, here is the breakdown of the number of days required for each phase and the total for the project.

All calculations assume an 8-hour workday. For parallel work, the longest duration is used.

### **Phase Durations**

*   **Writing:**
    The Screenwriter requires 72 hours.
    *Calculation: 72 hours / 8 hours per day = 9 days*
    **Days Required: 9**

*   **Pre-Production:**
    The Director requires 36 hours and the Camera Operator requires 24 hours. As this work is done in parallel, the longest time is used.
    *Calculation: 36 hours / 8 hours per day = 4.5 days*
    **Days Required: 4.5**

*   **Production Phase 1:**
    The duration for this phase is explicitly stated.
    **Days Required: 3**

*   **Production Phase 2:**
    The duration for this phase is explicitly stated.
    **Days Required: 3**

*   **Post-Production:**
    The Editor requires 64 hours and the Director requires 24 hours. As this work is done in parallel, the longest time is used.
    *Calculation: 64 hours / 8 hours per day = 8 days*
    **Days Required: 8**

---

### **Total Project Duration**

The total number of days for the project is the sum of all phases.

*   **Writing:** 9 days
*   **Pre-Production:** 4.5 days
*   **Production Phase 1:** 3 days
*   **Production Phase 2:** 3 days
*   **Post-Production:** 8 days

**Total Project Days: 27.5**

In [13]:
# Check results for "llm_flash" model:
# Markdown(llm_flash.generate_content(context + "\n\n" + str(tasks[1])).text)
Markdown(llm_flash.generate_content(prompt).text)

**Project Timeline Breakdown**

Here's a breakdown of the project timeline, assuming an 8-hour workday:

**Phase Breakdown:**

*   **Writing:**
    *   Screenwriter: 72 hours
    *   Days Required: 72 hours / 8 hours/day = 9 days

*   **Pre-Production:**
    *   Director: 36 hours
    *   Camera Operator: 24 hours
    *   Since the director and camera operator are working in parallel, we take the longest time.
    *   Days Required: 36 hours / 8 hours/day = 4.5 days

*   **Production Phase 1:**
    *   Director, 4 Actors, Camera Operator, Sound Engineer: 3 days
    *   Days Required: 3 days

*   **Production Phase 2:**
    *   Director, 8 Actors, Camera Operator, Sound Engineer: 3 days
    *   Days Required: 3 days

*   **Post-Production:**
    *   Editor: 64 hours
    *   Director: 24 hours
    *   Since the editor and director are working in parallel, we take the longest time.
    *   Days Required: 64 hours / 8 hours/day = 8 days

**Total Project Days:**

9 days (Writing) + 4.5 days (Pre-Production) + 3 days (Production Phase 1) + 3 days (Production Phase 2) + 8 days (Post-Production) = **27.5 days**


## Task 3. Prepare the Evaluation Dataset and EvalTask
In this task, you will set up the data and scoring method to evaluate the models.

1. You will evaluate the models' responses against each other by using Pairwise question answering quality. Note the user input fields in curly braces in this prompt, which are required to evaluate this metrics. You will use the Gemini Pro responses as your **baseline model response** and your Gemini Flash responses as your **responses**.

2. Prepare a Pandas DataFrame with the fields needed for evaluation.

In [30]:
# prompt = prompt_template.format(context=context, task=str(tasks[1]))
# full_prompts = [context + str(task) for task in tasks]

# prompt_template.format(context=context, task=str(tasks[1]))

full_prompts = pd.DataFrame({
    "context": [context, context, context],
    "task": [str(tasks[0]), str(tasks[1]), str(tasks[2])]
})

print(len(full_prompts))
# print(full_prompts[0])

3


In [31]:
# Assuming you have a dataset loaded into a pandas DataFrame.
# The dataset should contain columns relevant to the pairwise evaluation,
# such as 'prompt', 'response', and 'baseline_model_response'.

# eval_dataset = pd.DataFrame({
#     "prompt": ["What is the capital of France?", "Who wrote 'Romeo and Juliet'?", "What is the formula for water?"],
#     "response": ["Paris is the capital of France.", "William Shakespeare wrote 'Romeo and Juliet'.", "The formula for water is H2O."],
#     "baseline_model_response": ["France's capital is Paris.", "'Romeo and Juliet' was written by Shakespeare.", "Water's chemical formula is H2O."]
# })


eval_dataset = pd.DataFrame({
    "context": [context, context, context],
    "task": [str(tasks[0]), str(tasks[1]), str(tasks[2])]
})

# Create the EvalTask with the dataset and the specified pairwise metric.
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pairwise.QUESTION_ANSWERING_QUALITY
    ],
    experiment="indie-film-planning",
)

## Task 4. Run the evaluations and examine results
In this task, you will ask a model to choose a preferred response for each task from the two large language models llm_pro and llm_flash.

1. Run the evaluation of the EvalTask you configured above.

**Note:** This may take 2 minutes to run.

2. Print the summary_table of the evaluation results.

3. Display the evaluation response's metrics_table. Do you see a clear preference for either model, according to the evaluation service?

4. To simplify reading the results, display the column from the metrics_table that includes the evaluation service's preferred response for this example.

5. Next, display the column from the metrics_table that contains the model's explanations of its choices. Read some of the results to understand the evaluation service's preferences.

In [32]:
import datetime

run_ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

# run the evaluation for model 'llm_pro' using eval_task.evaluate()
eval_result = eval_task.evaluate(
    model=llm_pro,
    prompt_template=prompt_template,
    experiment_run_name=f"indie-film-planning-{run_ts}"
)

INFO:vertexai.evaluation.eval_task:Logging Eval Experiment metadata: {'prompt_template': '  <instructions>\n  Prepare a document to fulfill the task based on the context provided.\n  </instructions>\n<task>\n  {task}\n  </task>\n<context>\n  {context}\n  </context>\n  ', 'model_name': 'publishers/google/models/gemini-2.5-pro', 'temperature': 0, 'top_p': 0.4}
INFO:vertexai.evaluation._evaluation:Assembling prompts from the `prompt_template`. The `prompt` column in the `EvalResult.metrics_table` has the assembled prompts used for model response generation.
INFO:vertexai.evaluation._evaluation:Generating a total of 3 responses from Gemini model gemini-2.5-pro.
100%|██████████| 3/3 [00:23<00:00,  7.81s/it]
INFO:vertexai.evaluation._evaluation:All 3 responses are successfully generated from Gemini model gemini-2.5-pro.
INFO:vertexai.evaluation._evaluation:Multithreaded Batch Inference took: 23.449814522999986 seconds.


ValueError: Cannot find the `baseline_model_response` column in the evaluation dataset to fill the metric prompt template for `pairwise_question_answering_quality` metric. Please check if the column is present in the evaluation dataset, or provide a key-value pair in `metric_column_mapping` parameter of `EvalTask` to map it to a different column name. The evaluation dataset columns are ['context', 'task', 'prompt', 'response'].

In [None]:
# run the evaluation for model 'llm_flash' using eval_task.evaluate()
eval_result = eval_task.evaluate(
    model=llm_flash,
    prompt_template=prompt_template,
    experiment_run_name=f"indie-film-planning-{run_ts}"
)