# Inspect Rich Documents with Gemini Multimodality and Multimodal RAG Challenge Lab

__Note__: if you encounter an authentication error when running the cells in the notebook, go to __Vertex AI__ > __Dashboard__, and click on __Enable All Recommended APIs__. Then, re-run the failed cell, and continue the lab.  

## Setup and requirements

### Install Vertex AI SDK for Python and other dependencies

Run the following four cells below before you get to Task 1. Be sure to add your current project ID to the cell titled __Define Google Cloud project information__. 

In [None]:
# "RUN THIS CELL AS IS"

# Install required python packages and other dependencies
!pip3 install --upgrade --user google-cloud-aiplatform

!pip3 install --upgrade --user google-cloud-aiplatform pymupdf

### Restart current runtime

You must restart the runtime in order to use the newly installed packages in this Jupyter runtime. You can do this by running the cell below, which will restart the current kernel.


In [None]:
# "RUN THIS CELL AS IS"

import IPython

# Restart the kernet after libraries are loaded.

app = IPython.Application.instance()
app.kernel.do_shutdown(True)


<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>

### Define Google Cloud project information

In [None]:
# We import the sys module to check our environment and possibly load additional modules.
import sys

# Explanation for Beginners:
# 1) PROJECT_ID: The ID of your Google Cloud project. If you are using Vertex AI, 
#    all your resources (models, endpoints, datasets) will be billed to this project.
# 2) LOCATION: The region where you plan to run Vertex AI. 
#    Common choices are "us-central1" or "us-east1" (among others).
# 3) We try to detect if we’re NOT running in Google Colab. If so, we attempt 
#    to retrieve PROJECT_ID and LOCATION from the local gcloud configuration.

# Define project information and update the location if it differs from the one 
# specified in the lab instructions.
# Replace "[YOUR_PROJECT_ID]" with your actual Google Cloud project ID if you wish.
PROJECT_ID = "[your project]"  
LOCATION = "[your location]"          

# Try to get the PROJECT_ID and LOCATION automatically (if not running in Colab).
if "google.colab" not in sys.modules:
    import subprocess

    PROJECT_ID = subprocess.check_output(
        ["gcloud", "config", "get-value", "project"], text=True
    ).strip()

    LOCATION = subprocess.check_output(
        ["gcloud", "config", "get-value", "compute/region"], text=True
    ).strip()

print(f"Your project ID is: {PROJECT_ID}")
print(f"Your location is: {LOCATION}")


### Initialize Vertex AI

Initialize the Vertex AI SDK for Python for your project:

In [2]:
# "RUN THIS CELL AS IS"

# We initialize the Vertex AI library for the specified Google Cloud project and region.
# 
# Explanation for Beginners:
# 1) `vertexai`: This library allows interaction with Vertex AI services, such as 
#    text, image, and multimodal models.
# 2) `vertexai.init(...)`: We specify which GCP project and region we’re connecting to.
#    This means any calls to Vertex AI (e.g., generating text, training a model) will be 
#    billed to the indicated project and will operate in the specified region.
# 3) `PROJECT_ID` and `LOCATION`: These should be set in the previous cell. 
#    If you haven’t replaced them with your own values, the code may rely on 
#    auto-detection via gcloud configuration (depending on your environment).

import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)



## Task 1. Generating Multimodal Insights with the Gemini Pro Vision model

Gemini 1.0 Pro Vision (gemini-1.0-pro-vision) is a multimodal model that supports multimodal prompts. You can include text, image(s), and video in your prompt requests and get text or code responses.

To complete Task 1, follow the instructions at the top of each notebook cell:
* Run the cells with the comment "RUN THIS CELL AS IS".
* Complete and run the cells with the comment "COMPLETE THE MISSING PART AND RUN THIS CELL".

__Note__: Ensure you can see the weather related data in the response that is printed.


### Setup and requirements for Task 1

#### Import libraries

In [3]:
# "RUN THIS CELL AS IS"

# We import several classes from the Vertex AI generative models library:
# 1) GenerationConfig: Allows us to configure parameters for generative operations,
#    such as temperature, max_output_tokens, etc.
# 2) GenerativeModel: The main class for calling text, image, or multimodal generative models.
# 3) Image: A specialized class for handling image-based inputs/outputs in generative tasks.
# 4) Part: Useful for structuring multi-part content (e.g., video files, audio files).

from vertexai.generative_models import (
    GenerationConfig,
    GenerativeModel,
    Image,
    Part,
)


#### Load Gemini 1.0 Pro Vision model

In [4]:
# "RUN THIS CELL AS IS"

# We create an instance of a specific generative model called "gemini-1.0-pro-vision."
# This model can handle multimodal inputs (e.g., text and images).
#
# Explanation for Beginners:
# 1) The string "gemini-1.0-pro-vision" specifies which Gemini model version 
#    we're loading from Vertex AI. It's designed for vision tasks in addition to text.
# 2) By assigning this model to `multimodal_model`, we can later call methods 
#    like `generate_content` to process and generate text or image outputs 
#    in a multimodal context (text + images).

multimodal_model = GenerativeModel("gemini-1.0-pro-vision")


#### Define helper functions

In [6]:
# "RUN THIS CELL AS IS"
# 
# Explanation for Beginners:
# 1) display_images: Displays Vertex AI Image objects in a Jupyter environment, converting 
#    them to RGB mode and optionally resizing them if they exceed certain dimensions.
# 2) get_image_bytes_from_url: Fetches the raw bytes of an image directly from a web URL.
# 3) load_image_from_url: Uses the bytes fetched from a URL to create a Vertex AI Image object.
# 4) display_content_as_image: Checks if content is an image and, if so, displays it inline.
# 5) display_content_as_video: Checks if content is a "Part" representing a video file, 
#    builds the correct URL, and displays it inline.
# 6) print_multimodal_prompt: Iterates through a list of textual or image/video content, 
#    printing text or displaying media inline to show what is being sent to the model.

import http.client
import typing
import urllib.request

import IPython.display
from PIL import Image as PIL_Image
from PIL import ImageOps as PIL_ImageOps


def display_images(
    images: typing.Iterable[Image],
    max_width: int = 600,
    max_height: int = 350,
) -> None:
    for image in images:
        pil_image = typing.cast(PIL_Image.Image, image._pil_image)
        if pil_image.mode != "RGB":
            # RGB is supported by all Jupyter environments (e.g. RGBA is not yet)
            pil_image = pil_image.convert("RGB")
        image_width, image_height = pil_image.size
        if max_width < image_width or max_height < image_height:
            # Resize to display a smaller notebook image
            pil_image = PIL_ImageOps.contain(pil_image, (max_width, max_height))
        IPython.display.display(pil_image)


def get_image_bytes_from_url(image_url: str) -> bytes:
    with urllib.request.urlopen(image_url) as response:
        response = typing.cast(http.client.HTTPResponse, response)
        image_bytes = response.read()
    return image_bytes


def load_image_from_url(image_url: str) -> Image:
    image_bytes = get_image_bytes_from_url(image_url)
    return Image.from_bytes(image_bytes)


def display_content_as_image(content: str | Image | Part) -> bool:
    if not isinstance(content, Image):
        return False
    display_images([content])
    return True


def display_content_as_video(content: str | Image | Part) -> bool:
    if not isinstance(content, Part):
        return False
    part = typing.cast(Part, content)
    file_path = part.file_data.file_uri.removeprefix("gs://")
    video_url = f"https://storage.googleapis.com/{file_path}"
    IPython.display.display(IPython.display.Video(video_url, width=600))
    return True


def print_multimodal_prompt(contents: list[str | Image | Part]):
    """
    Given contents that would be sent to Gemini,
    output the full multimodal prompt for ease of readability.
    """
    for content in contents:
        if display_content_as_image(content):
            continue
        if display_content_as_video(content):
            continue
        print(content)


### Task 1.1. Image understanding across multiple images

In [7]:
# "RUN THIS CELL AS IS"

# You're going to work with provided variables in this task. 
# First, review and describe the content/purpose of each variable below. 

# We define two URLs pointing to images named "Ask_first_1.png" and "Dont_do_this_1.png".
# Explanation for Beginners:
# 1) These images may contain some textual or graphical information 
#    that we want to process or analyze.
# 2) We use our utility function `load_image_from_url` to convert each URL into a
#    Vertex AI Image object that can be passed to the model.

image_ask_first_1_url = "https://storage.googleapis.com/spls/gsp520/Google_Branding/Ask_first_1.png"
image_dont_do_this_1_url = "https://storage.googleapis.com/spls/gsp520/Google_Branding/Dont_do_this_1.png"

# We load each image from its respective URL using our helper function.
image_ask_first_1 = load_image_from_url(image_ask_first_1_url)
image_dont_do_this_1 = load_image_from_url(image_dont_do_this_1_url)

# We define a simple set of instructions and prompts that will guide our model.
# Explanation for Beginners:
# 1) instructions: A simple string to provide context for the images.
# 2) prompt1: A short prompt that asks, "What is the title of this image?"
# 3) prompt2: A more detailed list of steps we want the model to follow, such as
#    identifying the title, describing the image, extracting text, and describing sentiment.

instructions = "Instructions: Consider the following image that contains text:"
prompt1 = "What is the title of this image"
prompt2 = """
Answer the question through these steps:
Step 1: Identify the title of each image by using the filename of each image.
Step 2: Describe the image.
Step 3: For each image, describe the actions that a user is expected to take.
Step 4: Extract the text from each image as a full sentence.
Step 5: Describe the sentiment for each image with an explanation.

Answer and describe the steps taken:
"""


#### Create an input for the multimodal model

In [8]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Now, you're going to create an input for your multimodal model. Create your contents list using the variables above. Ensure the structure matches the format expected by the multimodal model.

# We define a list called 'contents' that combines both text and images in the order
# we want the model to process them. This list will be passed to our multimodal model,
# allowing it to read the text prompts as well as analyze the images.

# Explanation for Beginners:
# 1) instructions: A short prompt indicating that these images contain text.
# 2) image_ask_first_1: The first image (a Vertex AI Image object).
# 3) prompt1: A short textual query about the image's title.
# 4) image_dont_do_this_1: The second image (another Vertex AI Image object).
# 5) prompt2: A more detailed prompt describing the steps the model should take,
#    such as identifying the title from the filename, describing the image,
#    listing user actions, extracting text, and describing sentiment.

contents = [
    instructions,
    image_ask_first_1,   # The first image
    prompt1,
    image_dont_do_this_1,  # The second image
    prompt2
]




#### Generate responses from the multimodal model

In [9]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the next part of this task, you're going to generate responses from the multimodal model. Capture the output of the model in the "responses" variable by using your "contents" list.


# We use the `generate_content` method of our multimodal_model (Gemini 1.0 Pro Vision)
# to generate a response from the combined text + image content list.
# 
# Explanation for Beginners:
# 1) contents: A list of text strings and images we defined earlier, 
#    containing instructions, a short prompt, two images, and a final prompt. 
# 2) stream=True: The model will return partial outputs as they’re generated, 
#    allowing us to display them in real time or collect them incrementally.

responses = multimodal_model.generate_content(contents, stream=True)



#### Display the prompt and responses


In [None]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the last part of this task, you're going to print your contents and responses with the prompt and responses title provided. Use descriptive titles to help organize the output (e.g., "Prompts", "Model Responses") and then display the prompt and responses by using the print() function. 

# Hint: "\n" inserts a newline character for clearer separation between the sections.


# We print the prompts (including text and images) that we've passed to the model,
# then display the model responses as they stream back from Gemini.
#
# Explanation for Beginners:
# 1) print_multimodal_prompt(contents): This function attempts to display images
#    if it encounters an Image object and simply prints out text otherwise.
# 2) We iterate over the responses (which arrive as a stream if stream=True)
#    and print out each chunk of text. By using 'end=""', we avoid extra newlines
#    between each chunk, producing a smoother concatenated output.

print("Prompts\n")
print_multimodal_prompt(contents)

print("\nModel Responses\n")
for response in responses:
    print(response.text, end="")




### To verify your work for Task 1.1, click __Check my progress__ in the lab instructions.

### Task 1.2. Similarity/Differences between images

#### Explore the variables of the task

In [11]:
# "RUN THIS CELL AS IS"

# You're going to work with provided variables in this task. First, review and describe the content/purpose of each variable below. 


# We define URLs for two new images, "Ask_first_3.png" and "Dont_do_this_3.png."
# We then load these URLs into Vertex AI Image objects using our helper function 
# `load_image_from_url`. We'll use these images for additional prompts later.

# Explanation for Beginners:
# 1) image_ask_first_3_url, image_dont_do_this_3_url:
#    These are direct links to images stored in a Google Cloud Storage bucket.
# 2) load_image_from_url(...):
#    Converts each image URL into a Vertex AI Image object. We'll pass these 
#    Image objects to our model for multimodal analysis.

image_ask_first_3_url = "https://storage.googleapis.com/spls/gsp520/Google_Branding/Ask_first_3.png"
image_dont_do_this_3_url = "https://storage.googleapis.com/spls/gsp520/Google_Branding/Dont_do_this_3.png"

image_ask_first_3 = load_image_from_url(image_ask_first_3_url)
image_dont_do_this_3 = load_image_from_url(image_dont_do_this_3_url)

# Next, we define three text prompts. The first two are simple markers 
# indicating "Image 1" and "Image 2," while the third prompt asks specific 
# questions comparing these two images.

# Explanation for Beginners:
# 1) prompt1: Introduces Image 1 context.
# 2) prompt2: Introduces Image 2 context.
# 3) prompt3: Requests details about what's shown in each image, 
#    how they compare, and how they differ in terms of text.

prompt1 = """
Consider the following two images:
Image 1:
"""
prompt2 = """
Image 2:
"""
prompt3 = """
1. What is shown in Image 1 and Image 2?
2. What is similar between the two images?
3. What is difference between Image 1 and Image 2 in terms of the text ?
"""




#### Create an input for the multimodal model

In [12]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Now, you're going to create an input for your multimodal model. Create your contents list using the variables above. Ensure the structure matches the format expected by the multimodal model.

# We assemble a 'contents' list to combine both text prompts and images in the order
# we want the model to see them. The model will process prompt1 (introducing Image 1),
# then see Image 1, then process prompt2 (introducing Image 2), then see Image 2,
# and finally read prompt3 (our detailed questions).

# Explanation for Beginners:
# 1) prompt1: Introduces the concept of Image 1.
# 2) image_ask_first_3: The first Vertex AI Image object.
# 3) prompt2: Introduces Image 2.
# 4) image_dont_do_this_3: The second Vertex AI Image object.
# 5) prompt3: A set of questions prompting the model to compare and contrast the images.

contents = [
    prompt1,            # Text that sets context for Image 1
    image_ask_first_3,  # The first image object
    prompt2,            # Text that sets context for Image 2
    image_dont_do_this_3,  # The second image object
    prompt3             # The final prompt with comparison questions
]


#### Set configuration parameters

In [13]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Now, you're going to set configuration parameters that will influence how the multimodal model generates text. These settings control aspects like the creativity and focus of the responses. Here's how:
# Temperature: Controls randomness. Lower values mean more predictable results, higher values mean more surprising and creative outpu
# Top p / Top k: Affects how the model chooses words. Explore different values to see how they change the results.
# Other parameters: Check the model's documentation for additional options you might want to adjust.

# Store your configuration parameters in a generation_config variable. This improves reusability, allowing you to easily apply the same settings across tasks and make adjustments as needed.

from vertexai.generative_models import GenerationConfig

generation_config = GenerationConfig(
    temperature=0.0,    # Lower temperature => more deterministic responses
    top_p=0.8,          # Restrict to cumulative probability p for next word selection
    top_k=40,           # Restrict to top k likely tokens
    candidate_count=1,  # Number of candidate responses to generate
    max_output_tokens=2048  # Maximum tokens in the generated response
)



#### Generate responses from the multimodal model


In [17]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the next part of this task, you're going to generate responses from a multimodal model. capture the output of the model in the "responses" variable by using your "contents" list and the confiuguration settings.


responses = multimodal_model.generate_content(
    contents,
    generation_config=generation_config,
    stream=True
)



#### Display the prompt and responses

In [None]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the last part of this task, you're going to print your contents and responses with the prompt and responses title provided. Use descriptive titles to help organize the output (e.g., "Prompts", "Model Responses") and then display the prompt and responses by using the print() function. 

# Hint: "\n" inserts a newline character for clearer separation between the sections.

print("Prompts\n")
print_multimodal_prompt(contents)

print("\nModel Responses\n")
for response in responses:
    print(response.text, end="")


### To verify your work for Task 1.2, click __Check my progress__ in the lab instructions.

### Task 1.3. Generate a video description

#### Explore the variables of the task


In [19]:
# "RUN THIS CELL AS IS"

# You're going to work with provided variables in this task. 
# First, review and describe the content/purpose of each variable below. 


prompt = """
What is the product shown in this video?
What specific product features are highlighted in the video?
Where was this video filmed? Which cities around the world could potentially serve as the background in the video?
What is the sentiment of the video?
"""
video = Part.from_uri(
    uri="gs://spls/gsp520/google-pixel-8-pro.mp4",
    mime_type="video/mp4",
)

#### Create an input for the multimodal model

In [20]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Now, you're going to create an input for your multimodal model. Create your contents list using the variables above. Ensure the structure matches the format expected by the multimodal model.

contents = [
    prompt,
    video
]


#### Generate responses from the multimodal model

In [21]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the next part of this task, you're going to generate responses from a multimodal model. Capture the output of the model in the "responses" variable by using your "contents" list.


responses = multimodal_model.generate_content(contents, stream=True)


#### Display the prompt and responses

**Note:** If you encounter any authentication error below cell run, go to the **Navigation menu**, click **Vertex AI > Dashboard**, then click **"Enable all Recommended APIs"** Now, go back to cell 16, and run cells 16, 17 and below cell.

In [None]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the last part of this task, you're going to print your contents and responses with the prompt and responses title provided. Use descriptive titles to help organize the output (e.g., "Prompts", "Model Responses") and then display the prompt and responses by using the print() function. 

# Hint: "\n" inserts a newline character for clearer separation between the sections.

print("Prompts\n")
print_multimodal_prompt(contents)

print("\nModel Responses\n")
for response in responses:
    print(response.text, end="")


Proceed to Task 1.4 below (no progress check for Task 1.3 in lab instructions). 

### Task 1.4. Extract tags of objects throughout the video

#### Explore the variables of the task

In [23]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# You're going to work with provided variables in this task. First, review and describe the content/purpose of each variable below. 

prompt = """
Answer the following questions using the video only:

Which particular sport is highlighted in the video?
What values or beliefs does the advertisement communicate about the brand?
What emotions or feelings does the advertisement evoke in the audience?
Which tags associated with objects featured throughout the video could be extracted?
"""
video = Part.from_uri(
    uri="gs://spls/gsp520/google-pixel-8-pro.mp4",
    mime_type="video/mp4",
)

#### Create an input for the multimodal model

In [24]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Now, you're going to create an input for your multimodal model. Create your contents list using the variables above. Ensure the structure matches the format expected by the multimodal model.

contents = [
    prompt,
    video
]


#### Generate responses from the multimodal model

In [25]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the next part of this task, you're going to generate responses from a multimodal model. capture the output of the model in the "responses" variable by using your "contents" list and video input.


responses = multimodal_model.generate_content(contents, stream=True)


#### Display the prompt and responses

In [None]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the last part of this task, you're going to print your contents and responses with the prompt and responses title provided. Use descriptive titles to help organize the output (e.g., "Prompts", "Model Responses") and then display the prompt and responses by using the print() function. 

# Hint: "\n" inserts a newline character for clearer separation between the sections.

print("Prompts\n")
print_multimodal_prompt(contents)

print("\nModel Responses\n")
for response in responses:
    print(response.text, end="")


Proceed to Task 1.5 below (no progress check for Task 1.4 in lab instructions). 

### Task 1.5. Ask more questions about a video

**Note:** Although this video contains audio, Gemini does not currently support audio input and will only answer based on the video.

#### Explore the variables of the task

In [27]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# You're going to work with provided variables in this task. 
# First, review and describe the content/purpose of each variable below. 


prompt = """
Answer the following questions using the video only:

How does the advertisement portray the use of technology, specifically AI, in capturing and preserving memories?
What visual cues or storytelling elements contribute to the nostalgic atmosphere of the advertisement?
How does the advertisement depict the role of friendship and social connections in enhancing experiences and creating memories?
Are there any specific features or functionalities of the phone highlighted in the advertisement, besides its AI capabilities?

Provide the answer JSON.
"""
video = Part.from_uri(
    uri="gs://spls/gsp520/google-pixel-8-pro.mp4",
    mime_type="video/mp4",
)


#### Create an input for the multimodal model

In [None]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Now, you're going to create an input for your multimodal model. Create your contents list using the variables above. Ensure the structure matches the format expected by the multimodal model.

contents = [
    prompt,
    video
]


#### Generate responses from the multimodal model

In [28]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the next part of this task, you're going to generate responses from a multimodal model. capture the output of the model in the "responses" variable by using your "contents" list and video input.


responses = multimodal_model.generate_content(contents, stream=True)


#### Display the prompt and responses

In [None]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the last part of this task, you're going to print your contents and responses with the prompt and responses title provided. Use descriptive titles to help organize the output (e.g., "Prompts", "Model Responses") and then display the prompt and responses by using the print() function. 

# Hint: "\n" inserts a newline character for clearer separation between the sections.

print("Prompts\n")
print_multimodal_prompt(contents)

print("\nModel Responses\n")
for response in responses:
    print(response.text, end="")


Proceed to Task 1.6 below (no progress check for Task 1.5 in lab instructions). 

### Task 1.6. Retrieve extra information beyond the video

#### Explore the variables of the task

In [30]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# You're going to work with provided variables in this task. 
# First, review and describe the content/purpose of each variable below. 


prompt = """
Answer the following questions using the video only:

How does the advertisement appeal to its target audience through its messaging and imagery?
What overall message or takeaway does the advertisement convey about the brand and its products?
Are there any symbolic elements or motifs used throughout the advertisement to reinforce its central themes?
What is the best hashtag for this video based on the description ?

"""
video = Part.from_uri(
    uri="gs://spls/gsp520/google-pixel-8-pro.mp4",
    mime_type="video/mp4",
)

#### Create an input for the multimodal model

In [31]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Now, you're going to create an input for your multimodal model. Create your contents list using the variables above. Ensure the structure matches the format expected by the multimodal model.

contents = [
    prompt,
    video
]


#### Generate responses from the multimodal model

In [32]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the next part of this task, you're going to generate responses from a multimodal model. capture the output of the model in the "responses" variable by using your "contents" list and video input.


responses = multimodal_model.generate_content(contents, stream=True)


#### Display the prompt and responses

In [None]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# In the last part of this task, you're going to print your contents and responses with the prompt and responses title provided. Use descriptive titles to help organize the output (e.g., "Prompts", "Model Responses") and then display the prompt and responses by using the print() function. 

# Hint: "\n" inserts a newline character for clearer separation between the sections.

print("Prompts\n")
print_multimodal_prompt(contents)

print("\nModel Responses\n")
for response in responses:
    print(response.text, end="")


### To verify your work for Task 1.6, click __Check my progress__ in the lab instructions.

## Task 2. Retrieving and integrating knowledge with multimodal retrieval augmented generation (RAG)

To complete Task 2, follow the instructions at the top of each notebook cell:
* Run the cells with the comment "RUN THIS CELL AS IS".
* Complete and run the cells with the comment "COMPLETE THE MISSING PART AND RUN THIS CELL".

For additional information about the available data and helper functions for Task 2, review the section named __Available data and helper functions for Task 2__ in the lab instructions.

### Setup and requirements for Task 2

#### Import libraries

In [34]:
# "RUN THIS CELL AS IS"

# Import necessary packages and libraries.

# We import various classes from the Vertex AI generative models library, as well as
# Markdown and display from IPython, which allows us to render text and objects
# nicely within a notebook environment.
#
# Explanation for Beginners (inline comments follow):
# 1) Markdown, display: Functions from IPython.display that help format and show
#    text output in a more readable way (e.g., as Markdown).
# 2) Content, GenerationConfig, GenerationResponse, GenerativeModel: Classes and
#    tools for configuring and handling responses from Vertex AI generative models.
# 3) HarmCategory, HarmBlockThreshold: Used to configure or interpret the model’s
#    safety checks, deciding whether certain content gets blocked or allowed.
# 4) Image, Part: Classes for handling images or multi-part content in a multimodal context.

from IPython.display import Markdown, display
from vertexai.generative_models import (
    Content,
    GenerationConfig,
    GenerationResponse,
    GenerativeModel,
    HarmCategory,
    HarmBlockThreshold,
    Image,
    Part,
)



#### Load the Gemini 1.0 Pro and Gemini 1.0 Pro Vision model

In [35]:
# "RUN THIS CELL AS IS"

# Load the Gemini 1.0 Pro and Gemini 1.0 Pro Vision model.

# We create two GenerativeModel instances from Vertex AI:
# 1) text_model: A text-focused Gemini model named "gemini-1.0-pro," primarily used for
#    text generation tasks (e.g., code, dialogue).
# 2) multimodal_model: A vision-capable Gemini model named "gemini-1.0-pro-vision," which
#    can handle text + image inputs for multimodal tasks.

# Explanation for Beginners:
# 1) "gemini-1.0-pro": This version is optimized for text generation (like Q&A or code).
# 2) "gemini-1.0-pro-vision": Extends capabilities to also process or generate image data,
#    allowing you to pass images in the prompt.
# 3) By creating two separate models, you can choose the one that best fits
#    your current task (pure text vs. multimodal).
# 4) These models can still share parameters or underlying technology but are
#    configured for different types of input/output.

text_model = GenerativeModel("gemini-1.0-pro")
multimodal_model = GenerativeModel("gemini-1.0-pro-vision")


#### Download custom Python modules and utilities 

The cell below will download some helper functions needed for this notebook, to improve readability. You can also view the code (`intro_multimodal_rag_utils.py`) directly on [Github](https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/retrieval-augmented-generation/utils/intro_multimodal_rag_utils.py).

In [36]:
# "RUN THIS CELL AS IS"

# We import modules needed for file and URL handling, as well as checking the Python environment.
# Explanation for Beginners:
# 1) os, sys: Python modules for interacting with the operating system and environment.
#    - os lets us work with paths, files, and directories.
#    - sys can provide info about the runtime environment (e.g., whether we’re in Colab).
# 2) urllib.request: A module that helps us download files from the internet.
#
# The code below:
# 1) Checks if a folder called "utils" exists. If not, it creates the folder.
# 2) Defines a base URL (url_prefix) pointing to a repository with our helper scripts.
# 3) Loops through a list of file names ("intro_multimodal_rag_utils.py") to download them
#    from the URL and save them locally in the "utils" folder.

import os
import urllib.request
import sys

# If the "utils" folder doesn't exist, we create it.
if not os.path.exists("utils"):
    os.makedirs("utils")

# This is the base URL from which we'll download helper scripts.
url_prefix = (
    "https://raw.githubusercontent.com/"
    "GoogleCloudPlatform/generative-ai/main/"
    "gemini/use-cases/retrieval-augmented-generation/utils/"
)

# List of filenames to fetch from the specified URL.
files = ["intro_multimodal_rag_utils.py"]

# We iterate over each file and download it to our "utils" directory.
for fname in files:
    urllib.request.urlretrieve(
        f"{url_prefix}/{fname}",  # Full URL of the file to download
        filename=f"utils/{fname}" # Local path where the file will be saved
    )



#### Get documents and images from Cloud Storage

In [None]:
# "RUN THIS CELL AS IS"

# Download documents and images used in this notebook.

!gsutil -m rsync -r gs://spls/gsp520 .
print("Download completed")

### Task 2.1. Build metadata of documents containing text and images

__Note__: These steps may take a few minutes to complete.

#### Import helper functions to build metadata

In [38]:
# "RUN THIS CELL AS IS"

# Import helper functions from utils.
from utils.intro_multimodal_rag_utils import get_document_metadata

#### Explore the variables of the task

In [39]:
# "RUN THIS CELL AS IS"

# You're going to work with provided variables in this task. 
# First, review and describe the content/purpose of each variable below. 


# Specify the "PDF folder path" with single PDF and "PDF folder" with multiple PDF.

pdf_folder_path = "Google_Branding/"  # if running in Vertex AI Workbench.

# Specify the image description prompt. Change it
image_description_prompt = """Explain what is going on in the image.
If it's a table, extract all elements of the table.
If it's a graph, explain the findings in the graph.
Do not include any numbers that are not mentioned in the image.
"""


#### Extract and store metadata of text and images from a document

In [None]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Call the "get_document_metadata" function from the utils file to extract text and image metadata from the PDF document. Store the results in two different DataFrames: "text_metadata_df" and "image_metadata_df".  
# text_metadata_df: This will contain extracted text snippets, their corresponding page numbers, and potentially other relevant information.
# image_metadata_df: This will contain descriptions of the images found in the PDF (if any), along with their location within the document.

# We call the get_document_metadata function to process PDFs, extracting text and images, 
# then generating embeddings for both. The function returns two DataFrames: 
# 1) text_metadata_df: Contains page-by-page and chunk-level text info and embeddings.
# 2) image_metadata_df: Contains extracted image info and embeddings.

# Explanation for Beginners:
# 1) get_document_metadata: A function (likely from our "intro_multimodal_rag_utils.py") 
#    that handles reading PDF documents, extracting text chunks, images, 
#    generating textual and image embeddings, and storing all this in DataFrames.
# 2) multimodal_model: This is an instance of a Vertex AI GenerativeModel 
#    that can handle text + image data. We're using the "gemini-1.5-pro" model name here 
#    for advanced reasoning and multimodal tasks.
# 3) pdf_folder_path: The local folder path containing our PDFs to process.
# 4) image_save_dir="images": Extracted images will be saved in a local "images" folder.
# 5) image_description_prompt: A text prompt that the model uses to describe each extracted image.
# 6) embedding_size=1408: The dimension of embeddings for text and image data. 
#    Higher dimensional embeddings can capture more nuanced information.
# 7) add_sleep_after_page=True: Tells the script to pause after processing each page 
#    to avoid hitting API rate limits or quotas.
# 8) sleep_time_after_page=5: Defines how many seconds to wait after each page 
#    before proceeding to the next one.

text_metadata_df, image_metadata_df = get_document_metadata(
    multimodal_model,             # Using the Gemini 1.5 Pro model for analysis
    pdf_folder_path,
    image_save_dir="images",
    image_description_prompt=image_description_prompt,
    embedding_size=1408,
    add_sleep_after_page=True,    # Pause after each page to manage quotas
    sleep_time_after_page=5,      # Number of seconds to pause after each page
)


print("\n\n --- Completed processing. ---")

# NOTE: This can take a few minutes to complete


#### Inspect the processed text metadata

In [None]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Explore the text_metadata_df dataframe by displaying the first few rows of the dataframe.

text_metadata_df.head()

#### Import the helper functions to implement RAG

In [42]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Import helper functions from utils.

from utils.intro_multimodal_rag_utils import (
    get_similar_text_from_query,
    print_text_to_text_citation,
    get_similar_image_from_query,
    print_text_to_image_citation,
    get_gemini_response,
    display_images,
)

Proceed to Task 2.2 below (no progress check for Task 2.1 in lab instructions). 

### Task 2.2. Create a user query

#### Explore the variables of the task

In [43]:
# "RUN THIS CELL AS IS"

# You're going to work with provided variables in this task. 
# First, review and describe the content/purpose of each variable below. 

query = """Questions:
 - What are the key expectations that users can have from Google regarding the provision and development of its services?
- What specific rules and guidelines are established for users when using Google services?
- How does Google handle intellectual property rights related to the content found within its services, including content owned by users, Google, and third parties? 
- What legal rights and remedies are available to users in case of problems or disagreements with Google?
- How do the service-specific additional terms interact with these general Terms of Service, and which terms take precedence in case of any conflicts?
 """

Proceed to Task 2.3 below (no progress check for Task 2.2 in lab instructions).

### Task 2.3. Get all relevant text chunks

#### Retrieve relevant chunks of text based on the query

In [44]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Call the "get_similar_text_from_query" function from the utils file to retrieve relevant chunks of text based on the query. Store the results in a dictionart called "matching_results_chunks_data".  
# matching_results_chunks_data: This dictionary will contain file_name, page_num, cosine_score, chunk_number and chunk_socre. The dictionary represents a search result for a query related to the text_metadata_df.

# We retrieve the top 3 most relevant text chunks from the text_metadata_df DataFrame 
# by matching them against our query. We use the "text_embedding_chunk" column, which
# stores the embeddings for each text chunk.

# Explanation for Beginners:
# 1) get_similar_text_from_query: A function that takes in a query string, a text DataFrame,
#    and compares the embeddings of the query with the embeddings of each text chunk.
# 2) column_name="text_embedding_chunk": Points to the column in text_metadata_df that holds
#    the chunk-level text embeddings.
# 3) top_n=3: Only returns the top 3 matches to our query for easier inspection.
# 4) chunk_text=True: Tells the function to return the actual chunk of text so we can see 
#    what's in it, instead of returning just page-level text or metadata.

matching_results_chunks_data = get_similar_text_from_query(
    query,
    text_metadata_df,
    column_name="text_embedding_chunk",
    top_n=3,
    chunk_text=True,
)




#### Display the first item of the text chunk dictionary

In [None]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Explore the first item in your matching_results_chunks_data dictionary by displaying the first item.

matching_results_chunks_data[0]


Proceed to Task 2.4 below (no progress check for Task 2.3 in lab instructions).

### Task 2.4. Create context_text

#### Create a list to store the combined chunks of text

In [46]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"


# Create an empty list named "context_text". This list will be used to store the combined chunks of text.
context_text = []



#### Iterate through each item in the text chunks dictionary

In [48]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Create a for loop to iterate through each item in the matching_results_chunks_data dictionary in order to combine all the selected relevant text chunks

# We iterate over the matching_results_chunks_data dictionary, which stores 
# information about each text chunk that was found relevant to our query.
#
# Explanation for Beginners:
# 1) matching_results_chunks_data is a dictionary where each key is an index
#    for a matched chunk, and each value is another dictionary containing metadata
#    like file_name, page_num, and chunk_text.
# 2) context_text is a list that we'll later combine or pass to a model 
#    so it can reference all the relevant chunks when generating an answer.
# 3) value["chunk_text"] is the actual text content of the chunk. We add it 
#    to the context_text list so we have a consolidated set of relevant text excerpts.

for key, value in matching_results_chunks_data.items():
    context_text.append(value["chunk_text"])


#### Join all the text chunks and store in a list

In [49]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Take all of the individual text chunks stored in the context_text list and join them together into a single string named final_context_text. Use "\n" part inserts a newline character between each chunk, effectively creating separate lines or paragraphs.

final_context_text = "\n".join(context_text)


Proceed to Task 2.5 below (no progress check for Task 2.4 in lab instructions).

### Task 2.5. Pass context to Gemini

#### Explore the variables of the task


In [50]:
# "RUN THIS CELL AS IS"

# You're going to work with provided variables in this task. First, review and describe the content/purpose of each variable below. 

prompt = f""" Instructions: Compare the images and the text provided as Context: to answer multiple Question:
Make sure to think thoroughly before answering the question and put the necessary steps to arrive at the answer in bullet points for easy explainability.
If unsure, respond, "Not enough context to answer".

Context:
 - Text Context:
 {final_context_text}


{query}

Answer:
"""

#### Generate Gemini response with streaming output

In [None]:
# "COMPLETE THE MISSING PART AND RUN THIS CELL"

# Call "get_gemini_response" function from utils module in order to generate Gemini response with streaming output. This function uses a multimodal Gemini model, a text prompt, and configuration parameters and instructs the Gemini model to generate a response using the provided prompt. As Gemini model enables streaming, you will receive chunks of the response as they were produced. 
# Format the streamed output using Markdown syntax for easy readability and conversion to HTML.

# We import Markdown from IPython.display to format our model’s response in a more readable way.
# We also import display so we can display rich outputs inline.

from IPython.display import Markdown, display

# Explanation for Beginners:
# 1) Markdown: This class renders a string of text as Markdown, 
#    enabling headings, bullet points, bold/italics, etc.
# 2) display(...): Allows us to show rendered objects (like Markdown) inline in Jupyter.
# 3) get_gemini_response: A custom function (likely from your utils) that calls 
#    the Gemini model, passing along a prompt plus optional config parameters.

# Below, we actually call get_gemini_response on our multimodal_model:
# 1) model_input=[prompt] sends a single string prompt (could be text or text+images).
# 2) stream=True returns the model’s response in chunks (streaming).
# 3) generation_config=GenerationConfig(temperature=1) sets the model’s “creativity” level;
#    a higher temperature allows for more varied and exploratory answers.

display(
    Markdown(
        get_gemini_response(
            multimodal_model,                   # Our Gemini 1.0 Pro Vision (multimodal) model
            model_input=[prompt],               # The prompt we want to generate a response for
            stream=True,                        # Stream partial outputs in real time
            generation_config=GenerationConfig( # Configuration for how the model generates text
                temperature=1                    # Higher temperature => more creative output
            ),
        )
    )
)




### To verify your work for Task 2.5, click __Check my progress__ in the lab instructions.