##### Copyright 2025 Patrick Loeber

In [None]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Workshop: Build with Gemini (Part 2)

<a target="_blank" href="https://colab.sandbox.google.com/github/patrickloeber/workshop-build-with-gemini/blob/main/notebooks/part-2-multimodal-understanding.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This workshop teaches how to build with Gemini using the Gemini API and Python SDK.

Course outline:

- **[Part1: Quickstart + Text prompting](https://github.com/patrickloeber/workshop-build-with-gemini/blob/main/cookbooks/part-1-text-prompting.ipynb)**

- **Part 2 (this notebook): Multimodal understanding (image, video, audio, docs, code)**
  - Image
  - Video
  - Audio
  - Documents (PDFs)
  - Code
  - Final excercise: Analyze supermarket invoice

- **[Part 3: Thinking models + agentic capabilities (tool usage)](https://github.com/patrickloeber/workshop-build-with-gemini/blob/main/cookbooks/part-3-thinking-and-tools.ipynb)**

## 0. Use the Google AI Studio as playground

Explore and play with all models in the [Google AI Studio](https://aistudio.google.com/apikey).

## 1. Setup

Get a free API key in the [Google AI Studio](https://aistudio.google.com/apikey) and set up the [Google Gen AI Python SDK](https://github.com/googleapis/python-genai)

In [None]:
%pip install -U -q google-genai

In [None]:
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

In [None]:
from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

In [None]:
MODEL = "gemini-2.0-flash"

## Image understanding

Gemini models are able to process and understand images, e.g., you can use Gemini to describe, caption, and answer questions about images, and you can even use it for object detection.

In [None]:
!curl -o image.jpg "https://storage.googleapis.com/generativeai-downloads/images/Cupcakes.jpg"

In [None]:
from PIL import Image
image = Image.open("image.jpg")
print(image.size)
image

For total image payload size less than 20MB, we recommend either uploading base64 encoded images or directly uploading locally stored image files.

You can use a Pillow image in your prompt:

In [None]:
# TODO

Or you can use base64 encoded images

In [None]:
import requests

res = requests.get("https://storage.googleapis.com/generativeai-downloads/images/Cupcakes.jpg")

# TODO

You can use the File API for large payloads (>20MB).

 The File API lets you store up to 20 GB of files per project, with a per-file maximum size of 2 GB. Files are stored for 48 hours. They can be accessed in that period with your API key, but cannot be downloaded from the API. It is available at no cost in all regions where the Gemini API is available.

In [None]:
# TODO

#### Bounding box

Gemini models are trained to return bounding box coordinates.

**Important**: Gemini returns bounding box coordinates in this format:

- `[y_min, x_min, y_max, x_max]`
- and normalized to `[0,1000]`

**Tip**: Ask Gemini to return JSON format and configure `config={'response_mime_type': 'application/json'}`:

In [None]:
# TODO
bboxes = ...

Create a helper function to denormalize and draw the bounding boxes:



In [None]:
from PIL import ImageDraw, ImageFont

line_width = 4
font = ImageFont.load_default(size=16)

labels = list(set(box['label'] for box in bboxes))

def draw_bounding_boxes(image, bounding_boxes):
    img = image.copy()
    width, height = img.size

    draw = ImageDraw.Draw(img)

    colors = ['blue','red','green','yellow','orange','pink','purple']

    for box in bounding_boxes:
        y_min, x_min, y_max, x_max = box['box_2d']
        label = box['label']

        # Convert normalized coordinates to absolute coordinates
        y_min = int(y_min/1000 * height)
        x_min = int(x_min/1000 * width)
        y_max = int(y_max/1000 * height)
        x_max = int(x_max/1000 * width)

        color = colors[labels.index(label) % len(colors)]
        draw.rectangle([(x_min, y_min), (x_max, y_max)], outline=color, width=line_width)

        draw.text((x_min+line_width, y_min), label, fill=color, font=font)

    display(img)

draw_bounding_boxes(image, bboxes)

## Video

Gemini models are able to process videos. The 1M context window support up to approximately an hour of video data.

For technical details about supported video formats, see [the docs](https://ai.google.dev/gemini-api/docs/vision#technical-details-video).

In [None]:
!wget https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4 -O Post_its.mp4 -q

Use the File API to upload a video. Here we also check the processing state:

In [None]:
import time

def upload_video(video_file_name):
  video_file = client.files.upload(file=video_file_name)

  while video_file.state == "PROCESSING":
      print('Waiting for video to be processed.')
      time.sleep(10)
      video_file = client.files.get(name=video_file.name)

  if video_file.state == "FAILED":
    raise ValueError(video_file.state)

  print(f'Video processing complete: ' + video_file.uri)
  return video_file

post_its_video = upload_video('Post_its.mp4')

Now you can use the uploaded file in your prompt:

In [None]:
# TODO

#### YouTube video support

The Gemini API and AI Studio support YouTube URLs as a file data Part. You can include a YouTube URL with a prompt asking the model to summarize, translate, or otherwise interact with the video content.

In [None]:
# TODO

## Audio

You can use Gemini to process audio files. For example, you can use it to generate a transcript of an audio file or to summarize the content of an audio file.

Gemini represents each second of audio as 32 tokens; for example, one minute of audio is represented as 1,920 tokens.

For more info about technical details and supported formats, see [the docs](https://ai.google.dev/gemini-api/docs/audio#supported-formats).

In [None]:
URL = "https://storage.googleapis.com/generativeai-downloads/data/jeff-dean-presentation.mp3"
!wget -q $URL -O sample.mp3

In [None]:
import IPython
IPython.display.Audio("sample.mp3")

In [None]:
# TODO

1 minute audio = ~130 words or ~170 tokens
8192 / 170 = ~48 min output length.

You can use Gemini for transcribing, but be aware of the output token limit.

We can use `pydub` to split the audio file:

In [None]:
%pip install pydub

In [None]:
from pydub import AudioSegment
audio = AudioSegment.from_mp3("sample.mp3")
duration = 60 * 1000  # pydub works in milliseconds
audio_clip = audio[:duration]

In [None]:
audio_clip

In [None]:
import io
buffer = io.BytesIO()
audio_clip.export(buffer, format="mp3")

audio_bytes = buffer.read()

For files below 20 MB, you can provide the audio file directly as inline data in your request.

To do this, use `types.Part.from_bytes` and add it to the `contents` argument when calling `generate_content()`:

In [None]:
# TODO

Let's use a format that's easier to understand:

In [None]:
# TODO

Another useful prompt you can try with audio files:
- Refer to timestamps: `Provide a transcript of the speech from 02:30 to 03:29.`

## PDFs

PDFs can also be used in the same way:

In [None]:
URL = "https://storage.googleapis.com/generativeai-downloads/data/pdf_structured_outputs/invoice.pdf"
!wget -q $URL -O invoice.pdf

In [None]:
# TODO

In [None]:
# TODO, count tokens

**Next step**: A cool feature I recommend is to combine it with structured outputs using Pydantic.

In [None]:
# TODO

In [None]:
response.parsed.model_dump()

## Code

Gemini is good at understanding and generating code.

Let's use [gitingest](https://github.com/cyclotruc/gitingest) to chat with a GitHub repo:

In [None]:
%pip install gitingest

In [None]:
from gitingest import ingest_async

summary, tree, content = await ingest_async("https://github.com/patrickloeber/snake-ai-pytorch")

In [None]:
print(summary)

In [None]:
print(tree)

In [None]:
# TODO

## Exercise: Analyze supermarket invoice

Task:
- Define a schema for a single item that contains `item_name` and `item_cost`
- Define a schema for the supermarket invoice with `items`, `date`, and `total_cost`
- Use Gemini to extract all info from the supermarket bill into the defined supermarket invoice schema.
- Ask Gemini to list a few healthy recipes based on the items. If you have dietary restrictions, tell Gemini about it!

In [None]:
import requests
url = 'https://raw.githubusercontent.com/patrickloeber/workshop-build-with-gemini/main/data/rewe_invoice.pdf'
res = requests.get(url)
with open("rewe_invoice.pdf", "wb") as f:
    f.write(res.content)

In [None]:
rewe_pdf = client.files.upload(file='rewe_invoice.pdf')

In [None]:
# TODO

## Recap & Next steps

Gemini's multimodal capabilities are powerful, and with the Python SDK you only need a few lines of code to process various media types, including text, audio, images, videos, and PDFs.

For many use cases, it's helpful to constrain Gemini to respond with JSON using structured outputs.

More helpful resources:

- [Audio understanding docs](https://ai.google.dev/gemini-api/docs/audio?lang=python)
- [Visio understanding docs](https://ai.google.dev/gemini-api/docs/vision?lang=python)
- [Philschmid blog post: From PDFs to Insights](https://www.philschmid.de/gemini-pdf-to-data)
- [Structured output docs](https://ai.google.dev/gemini-api/docs/structured-output?lang=python)
- [Video understanding cookbook](https://github.com/google-gemini/cookbook/blob/main/quickstarts/Video_understanding.ipynb)

Next steps:

- **[Part 3: Thinking models + agentic capabilities (tool usage)](https://github.com/patrickloeber/workshop-build-with-gemini/blob/main/cookbooks/part-3-thinking-and-tools.ipynb)**
