In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Multimodal Retrieval Augmented Generation (RAG) using Gemini API in Vertex AI

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/intro_multimodal_rag.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fretrieval-augmented-generation%2Fintro_multimodal_rag.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/intro_multimodal_rag.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/retrieval-augmented-generation/intro_multimodal_rag.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>    
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/intro_multimodal_rag.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/intro_multimodal_rag.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/intro_multimodal_rag.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/intro_multimodal_rag.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/intro_multimodal_rag.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            

| | |
|-|-|
|Author(s) | [Lavi Nigam](https://github.com/lavinigam-gcp) |

<div class="alert alert-block alert-warning">
<b>⚠️ There is a new version of this notebook with new data and some modifications here:  ⚠️</b>
</div>

[**building_DIY_multimodal_qa_system_with_mRAG.ipynb**](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/qa-ops/building_DIY_multimodal_qa_system_with_mRAG.ipynb)

You can, however, still use this notebook as it is fully functional and has updated Gemini and text-embedding models.

## Overview

Retrieval augmented generation (RAG) has become a popular paradigm for enabling LLMs to access external data and also as a mechanism for grounding to mitigate against hallucinations.

In this notebook, you will learn how to perform multimodal RAG where you will perform Q&A over a financial document filled with both text and images.

### Gemini

Gemini is a family of generative AI models developed by Google DeepMind that is designed for multimodal use cases. The Gemini API gives you access to the Gemini 1.0 Pro Vision and Gemini 1.0 Pro models.

### Comparing text-based and multimodal RAG

Multimodal RAG offers several advantages over text-based RAG:

1. **Enhanced knowledge access:** Multimodal RAG can access and process both textual and visual information, providing a richer and more comprehensive knowledge base for the LLM.
2. **Improved reasoning capabilities:** By incorporating visual cues, multimodal RAG can make better informed inferences across different types of data modalities.

This notebook shows you how to use RAG with Gemini API in Vertex AI, [text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text-embeddings), and [multimodal embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/multimodal-embeddings), to build a document search engine.

Through hands-on examples, you will discover how to construct a multimedia-rich metadata repository of your document sources, enabling search, comparison, and reasoning across diverse information streams.

### Objectives

This notebook provides a guide to building a document search engine using multimodal retrieval augmented generation (RAG), step by step:

1. Extract and store metadata of documents containing both text and images, and generate embeddings the documents
2. Search the metadata with text queries to find similar text or images
3. Search the metadata with image queries to find similar images
4. Using a text query as input, search for contextual answers using both text and images

### Costs

This tutorial uses billable components of Google Cloud:

- Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Getting Started

### Install Vertex AI SDK for Python and other dependencies

In [None]:
%pip install --upgrade --user google-cloud-aiplatform pymupdf rich colorama

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [None]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench) or [Colab Enterprise](https://cloud.google.com/colab/docs).

In [None]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

### Define Google Cloud project information

In [None]:
# We define project-specific information and demonstrate how to retrieve the
# project ID automatically if we are not running in Google Colab.
#
# Explanation for Beginners:
# 1) PROJECT_ID: This variable holds the ID of your Google Cloud project 
#    (e.g., "my-gcp-project") so that all operations in Vertex AI can be 
#    correctly associated with your account.
# 2) LOCATION: This variable represents the region where your Vertex AI resources 
#    will be created (e.g., "us-central1"). It's important to match this 
#    to where your project and resources are located.
# 3) Checking "google.colab" in sys.modules: This helps us detect whether 
#    our code is running in Google Colab. If it's not, we assume a local 
#    environment, where we attempt to retrieve the project ID from "gcloud" 
#    config automatically. 
# 4) subprocess.check_output(["gcloud", ...]): We execute a gcloud command 
#    in the background to fetch the current gcloud-configured project ID, 
#    then strip() removes extra whitespace or newline characters.

import sys

PROJECT_ID = "[your project here]"  # @param {type:"string"}
LOCATION = "[your location here]"   # @param {type:"string"}

# If this notebook is not running on Google Colab, we attempt to retrieve
# the default project ID from the user's local gcloud settings.
if "google.colab" not in sys.modules:
    import subprocess

    # Run a gcloud command to get the current config's project ID.
    PROJECT_ID = subprocess.check_output(
        ["gcloud", "config", "get-value", "project"], text=True
    ).strip()

# Finally, we print out the project ID that we are using so we can confirm it's correct.
print(f"Your project ID is: {PROJECT_ID}")


In [2]:
# We import the sys module, which can be helpful for environment checks, 
# though it is not strictly necessary for initializing Vertex AI.
import sys

# Explanation for Beginners:
# 1) vertexai: This library allows us to interact with Google Cloud's Vertex AI.
#    By calling vertexai.init, we specify which project and location we want 
#    to use for our AI resources (models, endpoints, etc.).
# 2) PROJECT_ID and LOCATION: These variables should be set before running this cell,
#    so we can properly tell Vertex AI which GCP project to bill and which region to host in.
# 3) vertexai.init(...): This call "logs us in" to our project’s Vertex AI environment.

# We import the vertexai library to interact with Vertex AI services.
import vertexai

# We initialize Vertex AI using the project and location defined earlier.
vertexai.init(project=PROJECT_ID, location=LOCATION)


### Import libraries

In [3]:
# I'm importing tools to display rich and Markdown-formatted text within a Jupyter environment.
# For example, 'Markdown' and 'rich_Markdown' let me render stylized text output.
# The 'vertexai.generative_models' library gives me access to:
#   1) GenerationConfig, which configures generative model parameters (like temperature).
#   2) GenerativeModel, the main class for text, code, or multimodal content generation.
#   3) Image, a specialized class for handling and generating images.
from IPython.display import Markdown, display  # Display objects (e.g., Markdown) in a Jupyter notebook
from rich.markdown import Markdown as rich_Markdown  # A rich-text Markdown renderer for console output

# From the Vertex AI generative models library:
from vertexai.generative_models import (
    GenerationConfig,  # Provides options for controlling generation (e.g., temperature, max tokens)
    GenerativeModel,   # The foundational class for working with Vertex AI generative models
    Image              # A class that represents or generates images via Vertex AI
)


### Load the Gemini 1.5 Pro and Gemini 1.5 Flash models

In [4]:
# Here, we instantiate three Vertex AI GenerativeModel instances. Each constructor call
# includes the name of a specific model we want to use:
# 
# 1) text_model:      A model configured with "gemini-1.5-pro" for text-only tasks.
# 2) multimodal_model:        Also "gemini-1.5-pro," but we may use it for multimodal tasks
#                             like text, image, or video.
# 3) multimodal_model_flash:  A faster (flash) variant of the 1.5 Gemini model
#                             for reduced latency at potential trade-off in quality.
text_model = GenerativeModel("gemini-1.5-pro")       # For text or code generation
multimodal_model = GenerativeModel("gemini-1.5-pro") # For multimodal tasks (text, images, etc.)
multimodal_model_flash = GenerativeModel("gemini-1.5-flash")  # Faster inference version


### Download custom Python utilities & required files

The cell below will download a helper functions needed for this notebook, to improve readability. It also downloads other required files. You can also view the code for the utils here: (`intro_multimodal_rag_utils.py`) directly on [GitHub](https://storage.googleapis.com/github-repo/rag/intro_multimodal_rag/intro_multimodal_rag_old_version/intro_multimodal_rag_utils.py).

In [None]:
# We download documents and images from a Google Cloud Storage bucket into our local directory.
# We use the 'gsutil -m rsync -r' command to perform a recursive synchronization, 
# ensuring that the local folder matches the contents of the specified bucket directory.

# Explanation for Beginners:
# 1) gsutil is a command-line tool for working with Google Cloud Storage (GCS).
# 2) -m flag enables parallel (multi-threaded) transfer for faster sync.
# 3) rsync -r recursively syncs files and subfolders between the GCS path and our local directory.
!gsutil -m rsync -r gs://github-repo/rag/intro_multimodal_rag/intro_multimodal_rag_old_version .
print("Download completed")


## Building metadata of documents containing text and images

### The data

The source data that you will use in this notebook is a modified version of [Google-10K](https://abc.xyz/assets/investor/static/pdf/20220202_alphabet_10K.pdf) which provides a comprehensive overview of the company's financial performance, business operations, management, and risk factors. As the original document is rather large, you will be using a modified version with only 14 pages, split into two parts - [Part 1](https://storage.googleapis.com/github-repo/rag/intro_multimodal_rag/intro_multimodal_rag_old_version/data/google-10k-sample-part1.pdf) and [Part 2](https://storage.googleapis.com/github-repo/rag/intro_multimodal_rag/intro_multimodal_rag_old_version/data/google-10k-sample-part2.pdf) instead. Although it's truncated, the sample document still contains text along with images such as tables, charts, and graphs.

### Import helper functions to build metadata

Before building the multimodal RAG system, it's important to have metadata of all the text and images in the document. For references and citations purposes, the metadata should contain essential elements, including page number, file name, image counter, and so on. Hence, as a next step, you will generate embeddings from the metadata, which will is required to perform similarity search when querying the data.

In [6]:
# We import the `get_document_metadata` function from our `intro_multimodal_rag_utils` module.
# This function processes PDFs by extracting both text and images, generating embeddings,
# and returning two DataFrames: one for text metadata, and one for image metadata.
#
# Explanation for Beginners:
# 1) `intro_multimodal_rag_utils`: This is a custom utility module that contains various
#    helper functions for multimodal Retrieval-Augmented Generation (RAG).
# 2) `get_document_metadata`: Specifically handles extracting text and images from PDFs,
#    then embedding them for later similarity searches and generative tasks.

from intro_multimodal_rag_utils import get_document_metadata


### Extract and store metadata of text and images from a document

You just imported a function called `get_document_metadata()`. This function extracts text and image metadata from a document, and returns two dataframes, namely *text_metadata* and *image_metadata*, as outputs. If you want to find out more about how `get_document_metadata()` function is implemented using Gemini and the embedding models, you can take look at the [source code](https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/retrieval-augmented-generation/utils/intro_multimodal_rag_utils.py) directly.

The reason for extraction and storing both text metadata and image metadata is that just by using either of the two alone is not sufficient to come out with a relevent answer. For example, the relevant answers could be in visual form within a document, but text-based RAG won't be able to take into consideration of the visual images. You will also be exploring this example later in this notebook.

At the next step, you will use the function to extract and store metadata of text and images froma document. Please note that the following cell may take a few minutes to complete:

Note:

The current implementation works best:

* if your documents are a combination of text and images.
* if the tables in your documents are available as images.
* if the images in the document don't require too much context.

Additionally,

* If you want to run this on text-only documents, use normal RAG
* If your documents contain particular domain knowledge, pass that information in the prompt below.

<div class="alert alert-block alert-warning">
<b>⚠️ Do not send more than 50 pages in the logic below, its not degined to do that and you will get into quota issue. ⚠️</b>
</div>

In [None]:
# We specify the folder containing our PDF files. You can adjust this path
# depending on your environment:
#  - "/content/data/" if you're running in Google Colab/Colab Enterprise, or
#  - "data/" if running in Vertex AI Workbench.

# Explanation for Beginners:
# 1) pdf_folder_path: Holds the path to the directory where PDFs are stored.
# 2) image_description_prompt: A text prompt used by the Gemini model to
#    describe any images extracted from the PDFs.
#    - If the PDF has tables, the prompt asks to extract their elements.
#    - If it has graphs, it should explain the findings of those graphs.
#    - The prompt also instructs not to invent or guess numbers not actually in the image.

# Once we have these set, we call get_document_metadata with:
#   - The generative model (multimodal_model, i.e., Gemini 1.5 Pro).
#   - The folder path (pdf_folder_path).
#   - A directory to save images (image_save_dir).
#   - A custom prompt (image_description_prompt) that the model will use to describe images.
#   - embedding_size=1408 for more expressive embeddings.
#   - add_sleep_after_page=True to minimize quota issues by waiting after processing each page.
#   - sleep_time_after_page=5 to set the wait duration (in seconds).

# After processing, we get two DataFrames:
#   1) text_metadata_df: Holds extracted text with embeddings.
#   2) image_metadata_df: Holds extracted images with embeddings and descriptions.

# pdf_folder_path = "/content/data/"  # If running in Google Colab/Colab Enterprise
pdf_folder_path = "data/"             # If running in Vertex AI Workbench

image_description_prompt = """Explain what is going on in the image.
If it's a table, extract all elements of the table.
If it's a graph, explain the findings in the graph.
Do not include any numbers that are not mentioned in the image.
"""

# We call get_document_metadata to process each PDF, embedding text and images.
text_metadata_df, image_metadata_df = get_document_metadata(
    multimodal_model,  # Using the Gemini 1.5 Pro model for analysis
    pdf_folder_path,
    image_save_dir="images",
    image_description_prompt=image_description_prompt,
    embedding_size=1408,
    add_sleep_after_page=True,
    sleep_time_after_page=5,
    # generation_config=...,  # Optionally specify custom generation settings
    # safety_settings=...,    # Optionally specify custom safety settings
)

print("\n\n --- Completed processing. ---")


In [None]:
# Below are some optional parameters you can pass to the Gemini model via a GenerationConfig object 
# and safety_settings dictionary. They are currently commented out, but you can uncomment them if needed.

# Explanation for Beginners (inline comments):
# 1) generation_config: Provides fine-grained control over how the model generates responses.
#    - temperature: Higher values produce more diverse responses.
#    - max_output_tokens: Limits the length of the generated response.
# 2) safety_settings: Allows you to specify thresholds for blocking or filtering certain types of content.
#    - BLOCK_NONE means no blocking for that specific harm category (e.g., HARASSMENT, HATE_SPEECH).
# 3) If you encounter "Content has no parts" or "Exception occurred while calling gemini" errors,
#    you might lower or remove some of these thresholds. You can pass these parameters to 
#    the 'get_gemini_response' function or directly to your calls that generate content.

# Reference:
# - Gemini parameters: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini
# - Safety attribute configuration: https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/configure-safety-attributes

# generation_config = GenerationConfig(
#     temperature=0.2,
#     max_output_tokens=2048,
# )

# safety_settings = {
#     HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
#     HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
#     HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
#     HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
# }

# You can then use them like so:
# response = get_gemini_response(
#     generative_multimodal_model=multimodal_model,  # or your chosen model
#     model_input=some_input_list,
#     generation_config=generation_config,
#     safety_settings=safety_settings,
# )


#### Inspect the processed text metadata


The following cell will produce a metadata table which describes the different parts of text metadata, including:

- **text**: the original text from the page
- **text_embedding_page**: the embedding of the original text from the page
- **chunk_text**: the original text divided into smaller chunks
- **chunk_number**: the index of each text chunk
- **text_embedding_chunk**: the embedding of each text chunk

In [None]:
# We use the .head() method to display the first five rows of our text_metadata_df.
# This DataFrame contains metadata for extracted text from the PDFs, 
# such as file name, page number, chunked text, and embeddings.

text_metadata_df.head()


#### Inspect the processed image metadata

The following cell will produce a metadata table which describes the different parts of image metadata, including:
* **img_desc**: Gemini-generated textual description of the image.
* **mm_embedding_from_text_desc_and_img**: Combined embedding of image and its description, capturing both visual and textual information.
* **mm_embedding_from_img_only**: Image embedding without description, for comparison with description-based analysis.
* **text_embedding_from_image_description**: Separate text embedding of the generated description, enabling textual analysis and comparison.

In [None]:
# We use the .head() method to inspect the first five rows of our image_metadata_df.
# This DataFrame stores metadata about extracted images from the PDFs, including:
#   - 'file_name': Which PDF the image came from
#   - 'page_num': The PDF page number
#   - 'img_num': The image index on that page
#   - 'img_path': The local file path where the extracted image was saved
#   - 'img_desc': The description generated by Gemini
#   - 'mm_embedding_from_img_only': The image embedding without additional text context
#   - 'text_embedding_from_image_description': The text embedding of the generated image description

image_metadata_df.head()


### Import the helper functions to implement RAG

You will be importing the following functions which will be used in the remainder of this notebook to implement RAG:

* **get_similar_text_from_query():** Given a text query, finds text from the document which are relevant, using cosine similarity algorithm. It uses text embeddings from the metadata to compute and the results can be filtered by top score, page/chunk number, or embedding size.
* **print_text_to_text_citation():** Prints the source (citation) and details of the retrieved text from the `get_similar_text_from_query()` function.
* **get_similar_image_from_query():** Given an image path or an image, finds images from the document which are relevant. It uses image embeddings from the metadata.
* **print_text_to_image_citation():** Prints the source (citation) and the details of retrieved images from the `get_similar_image_from_query()` function.
* **get_gemini_response():** Interacts with a Gemini model to answer questions based on a combination of text and image inputs.
* **display_images():**  Displays a series of images provided as paths or PIL Image objects.

In [10]:
# We import several helper functions from our 'intro_multimodal_rag_utils' module. 
# Each function serves a specific purpose in our retrieval-augmented workflow:
# 1) display_images:              Displays image files inline in a notebook.
# 2) get_gemini_response:         Generates text outputs using the Gemini model (multimodal support).
# 3) get_similar_image_from_query: Finds images most relevant to a given query (text or image).
# 4) get_similar_text_from_query:  Finds text chunks most relevant to a given text query.
# 5) print_text_to_image_citation: Prints metadata/citations for matched images (e.g., page number).
# 6) print_text_to_text_citation:  Prints metadata/citations for matched text chunks (e.g., chunk number).

from intro_multimodal_rag_utils import (
    display_images,
    get_gemini_response,
    get_similar_image_from_query,
    get_similar_text_from_query,
    print_text_to_image_citation,
    print_text_to_text_citation,
)


Before implementing a multimodal RAG, let's take a step back and explore what you can achieve with just text or image embeddings alone. It will help to set the foundation for implementing a multimodal RAG, which you will be doing in the later part of the notebook. You can also use these essential elements together to build applications for multimodal use cases for extracting meaningful information from the document.

## Text Search

Let's start the search with a simple question and see if the simple text search using text embeddings can answer it. The expected answer is to show the value of basic and diluted net income per share of Google for different share types.

In [11]:
# We create a query string asking for details about basic and diluted net income 
# per share for Class A, B, and C stock of Google (Alphabet).
#
# Explanation for Beginners:
# 1) The variable `query` holds our user’s request in plain English.
# 2) We can later pass this query to functions like `get_similar_text_from_query`
#    to find the PDF pages or text chunks mentioning details about Google's
#    Class A, B, and C shares. 
# 3) This is helpful for locating specific financial information (like net income 
#    per share) within a larger set of documents.

query = "I need details for basic and diluted net income per share of Class A, Class B, and Class C share for google?"


### Search similar text with text query

In [None]:
# We match our user query against the text embeddings stored in the "text_embedding_chunk"
# column of our text_metadata_df DataFrame, and retrieve the top 3 most relevant text chunks.
#
# Explanation for Beginners:
# 1) get_similar_text_from_query: This function calculates similarity scores between the query 
#    embedding and the embeddings of each text chunk in our DataFrame.
# 2) text_metadata_df: A DataFrame that stores the text extracted from the PDFs, 
#    along with chunked embeddings for easier searching.
# 3) column_name="text_embedding_chunk": Specifies that we're comparing the query 
#    to the text embeddings at a chunk level (rather than page or entire doc).
# 4) top_n=3: We only want the 3 best matching chunks for faster inspection.
# 5) print_text_to_text_citation: Prints out each matched chunk's source info 
#    (like file name, page number, and the chunk text itself).

matching_results_text = get_similar_text_from_query(
    query,
    text_metadata_df,
    column_name="text_embedding_chunk",
    top_n=3,
    chunk_text=True,
)

# We print citations for all matched chunks (print_top=False), 
# each chunk's text content, file name, and page number.
print_text_to_text_citation(
    matching_results_text, 
    print_top=False, 
    chunk_text=True
)


You can see that the first high score match does have what we are looking for, but upon closer inspection, it mentions that the information is available in the "following" table. The table data is available as an image rather than as text, and hence, the chances are you will miss the information unless you can find a way to process images and their data.

However, Let's feed the relevant text chunk across the data into the Gemini 1.0 Pro model and see if it can get your desired answer by considering all the chunks across the document. This is like basic text-based RAG implementation.

In [None]:
# We print a separator to indicate we’re about to generate a final answer based on matched text.
print("\n **** Result: ***** \n")

# Explanation for Beginners:
# 1) 'matching_results_text': This dictionary contains the chunks of text that were deemed most relevant to our query.
# 2) We create a single 'context' string by joining all chunk_text values from the matched results, separated by newlines.
# 3) 'instruction': We build a prompt instructing the model to answer only using the provided context. If the context doesn’t
#    have the necessary info, we tell the model to respond with "not available in the context."
# 4) 'model_input': This is the final prompt we’ll pass to our model.

# Combine the text chunks into a single context string for the model to reference.
context = "\n".join(
    [value["chunk_text"] for key, value in matching_results_text.items()]
)

instruction = f"""Answer the question with the given context.
If the information is not available in the context, just return "not available in the context".
Question: {query}
Context: {context}
Answer:
"""

# 'model_input' is what we'll feed to the 'get_gemini_response' function.
model_input = instruction

# We generate a response with our text-only model (Gemini 1.0 Pro),
# configured for a relatively deterministic output (temperature=0.2).
# The function 'get_gemini_response' will stream partial outputs in real-time.
get_gemini_response(
    text_model,  # Our text-oriented Gemini 1.0 Pro model
    model_input=model_input,
    stream=True,
    generation_config=GenerationConfig(temperature=0.2),
)


You can see that it returned:

*"The provided context does not include the details for basic and diluted net income per share of Class A, Class B, and Class C share for google.
"*

This is expected as discussed previously. No other text chunk (total 3) had the information you sought.
This is because the information is only available in the images rather than in the text part of the document. Next, let's see if you can solve this problem by leveraging Gemini 1.0 Pro Vision and Multimodal Embeddings.

Note: We handcrafted examples in our document to simulate real-world cases where information is often embedded in charts, table, graphs, and other image-based elements and unavailable as plain text.  

### Search similar images with text query

Since plain text search didn't provide the desired answer and the information may be visually represented in a table or another image format, you will use multimodal capability of Gemini 1.0 Pro Vision model for the similar task. The goal here also is to find an image similar to the text query. You may also print the citations to verify.

In [14]:
# We define a query asking for details about basic and diluted net income per share
# for Google’s Class A, B, and C stock.
#
# Explanation for Beginners:
# 1) The variable `query` stores a user-requested question in plain English.
# 2) In subsequent cells, we can pass `query` to functions like 
#    `get_similar_text_from_query` for text-based matching 
#    or generate a model response with `get_gemini_response`.
# 3) If our dataset includes information on Google's financial statements, 
#    these references can help us locate relevant text or images in the PDFs.

query = "I need details for basic and diluted net income per share of Class A, Class B, and Class C share for google?"


In [None]:
# We look for images that are relevant to our user’s query about net income per share
# for Google’s Class A, B, and C. We do this by comparing the query text embedding 
# with the text embedding of each image’s description (i.e., the "text_embedding_from_image_description" column).
#
# Explanation for Beginners:
# 1) get_similar_image_from_query: Takes our text query and searches for matching images 
#    by comparing the query text embedding to each image's description embedding.
# 2) text_metadata_df, image_metadata_df: Contain all our extracted text and image metadata, respectively.
# 3) image_emb=False: We’re using a text-based query, not an image-based query.
# 4) top_n=3: Return the top 3 most relevant images for inspection.
# 5) embedding_size=1408: Matches the embedding dimension used earlier for our images.

matching_results_image = get_similar_image_from_query(
    text_metadata_df,
    image_metadata_df,
    query=query,
    column_name="text_embedding_from_image_description",  # Use image descriptions’ embeddings
    image_emb=False,
    top_n=3,
    embedding_size=1408,
)

# We could optionally print citations for the matched images 
# with print_text_to_image_citation, but it's commented out here:
# Markdown(print_text_to_image_citation(matching_results_image, print_top=True))

print("\n **** Result: ***** \n")

# We display the top matching image inline in our environment. 
# "image_object" is typically a Vertex AI Image object or PIL image 
# that can be displayed in notebooks.
display(matching_results_image[0]["image_object"])


Bingo! It found exactly what you were looking for. You wanted the details on Google's Class A, B, and C shares' basic and diluted net income, and guess what? This image fits the bill perfectly thanks to its descriptive metadata using Gemini.

You can also send the image and its description to Gemini 1.0 Pro Vision and get the answer as JSON:

In [None]:
print("\n **** Result: ***** \n")

# Explanation for Beginners:
# 1) We build a prompt that incorporates the top matching image (and its description) 
#    from our query-based image search.
# 2) 'context' includes the image object and the image description, 
#    which can help the model answer the user’s query in a more visual-aware manner.
# 3) 'instruction': We ask for a JSON-formatted answer, containing only the final response's value 
#    without extra text or commentary.
# 4) 'model_input': This is the final prompt we’ll feed to our multimodal model (Gemini 1.5 Pro Flash).
# 5) We wrap the final model response in 'Markdown(...)' to render it in a rich text format.

context = f"""Image: {matching_results_image[0]['image_object']}
Description: {matching_results_image[0]['image_description']}
"""

instruction = f"""Answer the question in JSON format with the given context of Image and its Description. Only include value.
Question: {query}
Context: {context}
Answer:
"""

model_input = instruction

# We make a call to our Gemini 1.5 Pro Flash model with streaming enabled. 
# The response is rendered as Markdown for improved readability.
Markdown(
    get_gemini_response(
        multimodal_model_flash,    # Faster variant of the Gemini 1.5 Pro model
        model_input=model_input,
        stream=True,               # Stream partial text responses in real time
        generation_config=GenerationConfig(
            temperature=1          # Higher temperature for more creative/less deterministic output
        ),
    )
)


In [None]:
## We display the citation details for the top matching image, including the file name,
## page number, and the Gemini-generated image description. By looking at the 
## "image description," we can see how it matched our text query.

## Explanation for Beginners:
## 1) The function `print_text_to_image_citation` prints out metadata for images 
##    that were matched to our query—like where in the PDF the image came from,
##    and the text description that was generated.
## 2) Setting `print_top=True` means we'll see only the top-matching image.

Markdown(
    print_text_to_image_citation(
        matching_results_image,
        print_top=True
    )
)


## Image Search

### Search similar image with image query

Imagine searching for images, but instead of typing words, you use an actual image as the clue. You have a table with numbers about the cost of revenue for two years, and you want to find other images that look like it, from the same document or across multiple documents.

Think of it like searching with a mini-map instead of a written address. It's a different way to ask, "Show me more stuff like this". So, instead of typing "cost of revenue 2020 2021 table", you show a picture of that table and say, "Find me more like this"

For demonstration purposes, we will only be finding similar images that show the cost of revenue or similar values in a single document below. However, you can scale this design pattern to match (find relevant images) across multiple documents.

In [None]:
# We specify a local file path to an image called "tac_table_revenue.png." 
# The goal is to see if this image (e.g., a table) can be matched against 
# images in our metadata to find similar tables.

# Explanation for Beginners:
# 1) image_query_path: Holds the path to our user-provided image file.
# 2) The image might be a table screenshot or an excerpt from the same PDF source.
#    We'll attempt to find similar images in our existing metadata.
# 3) Image.load_from_file(...): This function from Vertex AI’s Image class allows us 
#    to load the image so we can display it or embed it for similarity queries.

image_query_path = "tac_table_revenue.png"

# We print a short message indicating we’re about to show the user’s input image.
print("***Input image from user:***")

# Finally, we display the input image inline (in a Jupyter notebook, for instance).
# This gives us a visual reference of the user’s query image before searching 
# for similar images in the metadata.
Image.load_from_file(image_query_path)


You expect to find tables (as images) that are similar in terms of "Other/Total cost of revenues."

In [None]:
# We attempt to find images within our existing metadata that resemble a user-provided image
# (image_query_path). We do this by comparing the embeddings of our query image to the 
# "mm_embedding_from_img_only" embeddings stored in image_metadata_df.
#
# Explanation for Beginners:
# 1) text_metadata_df, image_metadata_df: DataFrames where we store text and image data, respectively.
# 2) query (optional): If we want to combine text filtering with image similarity, we can keep this 
#    parameter. Otherwise, it can be left empty or be a generic text.
# 3) image_emb=True: Informs the function that we are using an image-based query, not a text-based query.
# 4) image_query_path=image_query_path: The local path to the user’s input image.
# 5) column_name="mm_embedding_from_img_only": This column stores embeddings that we generated for
#    each image without any text context. We compare them to the query image’s embedding.
# 6) top_n=3: We only want the top 3 matches to our query image.
# 7) embedding_size=1408: This is the size (dimensionality) of the embeddings we generated for the 
#    images during preprocessing.

matching_results_image = get_similar_image_from_query(
    text_metadata_df,
    image_metadata_df,
    query=query,  # Optionally use the query text for filtering
    column_name="mm_embedding_from_img_only",  # Compare embeddings in this column to our query image
    image_emb=True,  # We are performing an image-based query
    image_query_path=image_query_path,  # The path to the input image from the user
    top_n=3,  # Return up to 3 most similar images
    embedding_size=1408,  # Size of the embeddings used for images
)

print("\n **** Result: ***** \n")

# We now display the top matching image. "image_object" typically holds a 
# PIL.Image object or a format compatible with IPython’s display methods.
display(
    matching_results_image[0]["image_object"]
)


It did find a similar-looking image (table), which gives more detail about different revenue, expenses, income, and a few more details based on the given image. More importantly, both tables show numbers related to the "cost of revenue."

You can also print the citation to see what it has matched.

In [None]:
# We display citation details for the top matching image found by our image query.
# This reveals where in the original PDFs (file name, page number) the similar image 
# was discovered, along with a brief description if available.
#
# Explanation for Beginners:
# 1) print_text_to_image_citation: 
#    A function that prints out metadata for each matched image,
#    including file paths, page numbers, similarity scores, etc.
# 2) print_top=True: 
#    Instructs the function to only show the details of the single highest-scoring match.

print_text_to_image_citation(
    matching_results_image,
    print_top=True
)


In [None]:
# We display the top two matched images (index 0 and 1) returned by our image query.
# The 'display_images' function will show each image inline, resizing them to 50% 
# of their original dimensions for easier viewing.
#
# Explanation for Beginners:
# 1) matched_results_image is a list (or dict) of matched images, 
#    each containing metadata like file path, similarity scores, etc.
# 2) matching_results_image[0]["img_path"] is the file path for the top matched image.
# 3) The second image is matching_results_image[1]["img_path"].
# 4) 'resize_ratio=0.5' scales the images down by half, making them more manageable 
#    in a notebook or console environment.

print("---------------Matched Images------------------\n")
display_images(
    [
        matching_results_image[0]["img_path"],  # File path for the top image
        matching_results_image[1]["img_path"],  # File path for the second best match
    ],
    resize_ratio=0.5,  # Resize images to 50% of original size
)


The ability to identify similar text and images based on user input, using Gemini and embeddings, forms a crucial foundation for development of multimodal RAG systems, which you explore in the next section.

### Comparative reasoning

Next, let's apply what you have done so far to doing comparative reasoning.

For this example:

Step 1: You will search all the images for a specific query

Step 2: Send those images to Gemini 1.0 Pro Vision to ask multiple questions, where it has to compare and provide you with answers.

In [22]:
# We search for images displaying a "Google Class A cumulative 5-year total return" graph.
# This time, we compare the user's text query to image descriptions rather than performing
# an image-based query.

# Explanation for Beginners:
# 1) get_similar_image_from_query: Our function that identifies images most relevant 
#    to a given query.
# 2) text_metadata_df, image_metadata_df: DataFrames containing embedded text and image data.
# 3) query: The user’s text query describing the type of image they seek.
# 4) column_name="text_embedding_from_image_description": We match the query against 
#    each image’s text description embedding.
# 5) image_emb=False: Tells the function we are using text as our query, not an image.
# 6) top_n=3: Returns the top 3 images most relevant to the query.
# 7) embedding_size=1408: The dimensionality of the embeddings we used for images.

matching_results_image_query_1 = get_similar_image_from_query(
    text_metadata_df,
    image_metadata_df,
    query="Show me all the graphs that shows Google Class A cumulative 5-year total return",
    column_name="text_embedding_from_image_description",
    image_emb=False,
    top_n=3,
    embedding_size=1408,
)


In [None]:
# We display the matched images that are most relevant to our text query about 
# "graphs showing Google Class A cumulative 5-year total return."
#
# Explanation for Beginners:
# 1) matching_results_image_query_1 is a dictionary or list of images 
#    that our system found most relevant to the query.
# 2) The function display_images takes a list of image paths (or PIL Image objects)
#    and renders them inline, optionally resizing them to a fraction (50%) 
#    of their original size for better viewing in a notebook.
# 3) matching_results_image_query_1[0]["img_path"] is the file path of the top 
#    matching image. Similarly, matching_results_image_query_1[1]["img_path"] 
#    is the second-best match.

print("---------------Matched Images------------------\n")
display_images(
    [
        matching_results_image_query_1[0]["img_path"],
        matching_results_image_query_1[1]["img_path"],
    ],
    resize_ratio=0.5,  # Resize images to 50% of their original dimensions
)


In [None]:
# We build a prompt that instructs the Gemini model (Gemini 1.5 Pro) to:
# 1) Compare two images labeled as Image_1 and Image_2.
# 2) Use the extracted text (extracted by Gemini for each image) as additional context.
# 3) Answer specific questions about Class A shares, differences between graphs, 
#    and how they compare to certain indices like S&P 500.
#
# Explanation for Beginners:
# 1) matching_results_image_query_1: This dictionary holds the top matches for our 
#    text-based image query. Each matched image has its own "image_object" and "image_description."
# 2) We create 'prompt': This variable combines both images and their descriptions 
#    under "Context," then asks a set of questions. The instructions ask the model to 
#    carefully think through the steps in bullet points, providing an explainable answer.
# 3) Finally, we feed this 'prompt' as the input list to get_gemini_response, along with:
#    - The model instance (multimodal_model).
#    - A GenerationConfig object specifying temperature=1 for a more creative, flexible output.
# 4) 'rich_Markdown' is used to format the model’s streamed response in Markdown for better readability.

prompt = f""" Instructions: Compare the images and the Gemini extracted text provided as Context: to answer Question:
Make sure to think thoroughly before answering the question and put the necessary steps to arrive at the answer in bullet points for easy explainability.

Context:
Image_1: {matching_results_image_query_1[0]["image_object"]}
gemini_extracted_text_1: {matching_results_image_query_1[0]['image_description']}
Image_2: {matching_results_image_query_1[1]["image_object"]}
gemini_extracted_text_2: {matching_results_image_query_1[2]['image_description']}

Question:
 - Key findings of Class A share?
 - What are the critical differences between the graphs for Class A Share?
 - What are the key findings of Class A shares concerning the S&P 500?
 - Which index best matches Class A share performance closely where Google is not already a part? Explain the reasoning.
 - Identify key chart patterns in both graphs.
 - Which index best matches Class A share performance closely where Google is not already a part? Explain the reasoning.
"""

rich_Markdown(
    get_gemini_response(
        multimodal_model,  # Our Gemini 1.5 Pro model for text + image analysis
        model_input=[prompt],
        stream=True,  # Stream partial text responses in real-time
        generation_config=GenerationConfig(
            temperature=1  # Higher temperature for a more creative, expansive response
        ),
    )
)


<div class="alert alert-block alert-warning">
<b>⚠️ Disclaimer: This is not a real investment advise and should not be taken seriously!! ⚠️</b>
</div>

## Multimodal retrieval augmented generation (RAG)

Let's bring everything together to implement multimodal RAG. You will use all the elements that you've explored in previous sections to implement the multimodal RAG. These are the steps:

* **Step 1:** The user gives a query in text format where the expected information is available in the document and is embedded in images and text.
* **Step 2:** Find all text chunks from the pages in the documents using a method similar to the one you explored in `Text Search`.
* **Step 3:** Find all similar images from the pages based on the user query matched with `image_description` using a method identical to the one you explored in `Image Search`.
* **Step 4:** Combine all similar text and images found in steps 2 and 3 as `context_text` and `context_images`.
* **Step 5:** With the help of Gemini, we can pass the user query with text and image context found in steps 2 & 3. You can also add a specific instruction the model should remember while answering the user query.
* **Step 6:** Gemini produces the answer, and you can print the citations to check all relevant text and images used to address the query.

### Step 1: User query

In [26]:
# We define a text query containing various questions related to Google’s Class A shares,
# financial metrics, and the impact of Covid. No images are being passed this time.
#
# Explanation for Beginners:
# 1) 'query' is a variable that holds our user’s complex set of questions, which we can
#    pass to functions like 'get_similar_text_from_query' to retrieve relevant information
#    from our text metadata.
# 2) These questions focus on financial and operational aspects such as revenue, 
#    operating expenses, net income, and the effect of Covid in 2020.
# 3) The user might also inquire about financial definitions (e.g., deferred income taxes),
#    or numerical changes in data (like a 41% increase in revenue).

query = """Questions:
 - What are the critical difference between various graphs for Class A Share?
 - Which index best matches Class A share performance closely where Google is not already a part? Explain the reasoning.
 - Identify key chart patterns for Google Class A shares.
 - What is cost of revenues, operating expenses and net income for 2020. Do mention the percentage change
 - What was the effect of Covid in the 2020 financial year?
 - What are the total revenues for APAC and USA for 2021?
 - What is deferred income taxes?
 - How do you compute net income per share?
 - What drove percentage change in the consolidated revenue and cost of revenue for the year 2021 and was there any effect of Covid?
 - What is the cause of 41% increase in revenue from 2020 to 2021 and how much is dollar change?
 """


### Step 2: Get all relevant text chunks

In [27]:
# We retrieve the top 10 text chunks that are most relevant to our new query,
# which contains multiple questions regarding Google Class A shares, revenues,
# operating expenses, the impact of COVID, etc.
#
# Explanation for Beginners:
# 1) get_similar_text_from_query: This function compares our 'query' 
#    with all the text chunks in 'text_metadata_df' using embedded vector similarity.
# 2) text_metadata_df: DataFrame containing extracted and embedded PDF text.
# 3) column_name="text_embedding_chunk": Specifies we are searching against
#    chunk-level text embeddings.
# 4) top_n=10: Returns the top 10 chunks that best match the query.
# 5) chunk_text=True: Returns individual chunk text instead of entire page text.

matching_results_chunks_data = get_similar_text_from_query(
    query,
    text_metadata_df,
    column_name="text_embedding_chunk",
    top_n=10,
    chunk_text=True,
)


### Step 3: Get all relevant images

In [28]:
# We search for images that might also be relevant to our text-based query by 
# matching the query against the image descriptions (their text embeddings).
#
# Explanation for Beginners:
# 1) get_similar_image_from_query: Searches for images whose descriptions (or embedded text) 
#    best match the user's query.
# 2) text_metadata_df, image_metadata_df: Our DataFrames containing extracted text and image metadata, respectively.
# 3) query: The user's textual query about finances, Class A shares, Covid impact, etc.
# 4) column_name="text_embedding_from_image_description": We compare the query’s embeddings 
#    to each image’s description embeddings.
# 5) image_emb=False: Indicates we're using a text query, not an image-based query.
# 6) top_n=10: Return the top 10 most relevant images.
# 7) embedding_size=1408: The size of the embeddings we used for images.

matching_results_image_fromdescription_data = get_similar_image_from_query(
    text_metadata_df,
    image_metadata_df,
    query=query,
    column_name="text_embedding_from_image_description",
    image_emb=False,
    top_n=10,
    embedding_size=1408,
)


### Step 4: Create context_text and context_images

In [30]:
# We gather the text chunks and images deemed relevant by our similarity searches.
# Explanation for Beginners:
# 1) matching_results_chunks_data and matching_results_image_fromdescription_data:
#    These dictionaries contain the most relevant text chunks and images, respectively,
#    based on our user query.
# 2) context_text: We collect the actual chunk text from each matching result
#    and then join them together (separated by newlines).
# 3) context_images: We combine each relevant image and its Gemini-generated caption
#    into a single list that could be passed to a model or used for further analysis.

# Combine all the selected relevant text chunks.
context_text = []
for key, value in matching_results_chunks_data.items():
    context_text.append(value["chunk_text"])

# Join chunk texts with a newline separator so we have a coherent context block.
final_context_text = "\n".join(context_text)

# Combine all the relevant images and their descriptions generated by Gemini.
context_images = []
for key, value in matching_results_image_fromdescription_data.items():
    # We extend our list with an identifier ("Image:"), the image object,
    # a label for the caption ("Caption:"), and the text description.
    context_images.extend([
        "Image: ",
        value["image_object"],
        "Caption: ",
        value["image_description"]
    ])


### Step 5: Pass context to Gemini

In [None]:
# We build a prompt that instructs the Gemini model to:
# 1) Analyze both text (final_context_text) and images (context_images) together.
# 2) Answer a list of questions (contained in 'query') in a thorough, step-by-step manner.
# 3) Return "Not enough context to answer" if the context doesn't provide the necessary information.
#
# Explanation for Beginners:
# 1) 'prompt': We embed our text context and image context, along with instructions 
#    on how to respond. This ensures the model has comprehensive information 
#    when generating its answer.
# 2) final_context_text: A joined string of relevant text chunks.
# 3) context_images: A list combining image objects and their captions, 
#    which the model can also interpret.
# 4) 'model_input': We pass this single prompt into 'get_gemini_response' as a list.
# 5) 'generation_config=GenerationConfig(temperature=1)': 
#    A higher temperature often produces more creative or exploratory answers.

prompt = f""" Instructions: Compare the images and the text provided as Context: to answer multiple Question:
Make sure to think thoroughly before answering the question and put the necessary steps to arrive at the answer in bullet points for easy explainability.
If unsure, respond, "Not enough context to answer".

Context:
 - Text Context:
 {final_context_text}
 - Image Context:
 {context_images}

{query}

Answer:
"""

# We call 'get_gemini_response' with our prompt (as the sole item in model_input) 
# and enable streaming so we can receive the response in real time. 
# Then we wrap it in `rich_Markdown` for nicer formatting in a Jupyter environment.
rich_Markdown(
    get_gemini_response(
        multimodal_model,
        model_input=[prompt],
        stream=True,
        generation_config=GenerationConfig(temperature=1),
    )
)


### Step 6: Print citations and references

In [None]:
# We display the first four images that were deemed relevant by our query-to-description 
# matching. Each image might provide context that supports the model’s answers.
#
# Explanation for Beginners:
# 1) matching_results_image_fromdescription_data is a collection of images (and 
#    their metadata) identified as being closely related to our text query.
# 2) display_images takes a list of paths (or PIL Image objects) 
#    and displays them inline in a Jupyter environment.
# 3) resize_ratio=0.5 scales the images to half of their original size, 
#    making them more manageable for viewing.

print("---------------Matched Images------------------\n")

display_images(
    [
        matching_results_image_fromdescription_data[0]["img_path"],  # Path to the top-matching image
        matching_results_image_fromdescription_data[1]["img_path"],  # 2nd most relevant
        matching_results_image_fromdescription_data[2]["img_path"],  # 3rd most relevant
        matching_results_image_fromdescription_data[3]["img_path"],  # 4th most relevant
    ],
    resize_ratio=0.5,  # Scale images to 50% of their original size
)


In [None]:
# We print out citations for all matched images (print_top=False), which provides
# detailed metadata about each image, such as:
#   - The file name of the PDF
#   - The page number where the image was extracted
#   - The path to the image file
#   - A short description generated by Gemini (the "image_description")
#
# Explanation for Beginners:
# 1) print_text_to_image_citation: This function displays relevant metadata
#    for each image, helping us see where the images originated and why
#    they are considered relevant.
# 2) print_top=False: We show citations for all matched images instead
#    of just the highest-scoring match.

print_text_to_image_citation(
    matching_results_image_fromdescription_data,
    print_top=False
)


In [None]:
# We print citations for the matched text chunks, providing information about:
#   - File names
#   - Page numbers
#   - The actual text chunks
#
# Explanation for Beginners:
# 1) print_text_to_text_citation: Displays helpful context about each matched text chunk, 
#    such as which PDF file and page it came from.
# 2) print_top=False: Means we show all matches rather than just the best one.
# 3) chunk_text=True: We want to see the actual text chunk in addition to the metadata.

print_text_to_text_citation(
    matching_results_chunks_data,
    print_top=False,
    chunk_text=True,
)


## Conclusions

Congratulations on making it through this multimodal RAG notebook!

While multimodal RAG can be quite powerful, note that it can face some limitations:

* **Data dependency:** Needs high-quality paired text and visuals.
* **Computationally demanding:** Processing multimodal data is resource-intensive.
* **Domain specific:** Models trained on general data may not shine in specialized fields like medicine.
* **Black box:** Understanding how these models work can be tricky, hindering trust and adoption.


Despite these challenges, multimodal RAG represents a significant step towards search and retrieval systems that can handle diverse, multimodal data.