In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemini Long Context Window Tutorial - Video Modality
<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/vijaykyr/genai-demos/blob/main/long_context/Gemini_Long_Context_Video.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
</table>

# Overview

Gemini 1.5 Pro supports up to 2 Million input tokens. This is the equivalent of roughly:
- 2000 pages of text
- 11 hours of audio
- 60 minutes of video (without audio)
- 50 minutes of video (with audio)

In the [previous notebook](Gemini_Long_Context_Text.ipynb) we explored long context windows using the text modality, and introduced the concepts of batching and caching to reduce cost and latency. 

In this notebook we will explore the [video modality](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/video-understanding). 

# Setup

## Install Dependencies (If Needed)

The list `packages` contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.

In [1]:
# tuples of (import name, install name)
packages = [
    ('vertexai','google-cloud-aiplatform'),
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

## Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [2]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Authenticate

Authenticate to GCP

In [3]:
import sys
if 'google.colab' in sys.modules:
    from google.colab import auth as google_auth
    google_auth.authenticate_user()

# If using local jupyter instance, uncomment and run:
# !gcloud auth login

## Config

Update the below variables to specify the GCP project ID and GCS storage location you want to use

In [4]:
GCP_PROJECT_ID = "vijay-sandbox-335018"
REGION = "us-central1"

## Import Libraries and Initialize Gemini

In [5]:
import datetime

from IPython.display import Markdown
import vertexai
from vertexai.preview import caching
from vertexai.preview.generative_models import GenerativeModel, Part

vertexai.init(project=GCP_PROJECT_ID, location=REGION)

# Gemini Config
GENERATION_CONFIG=dict(temperature=0)
model = GenerativeModel(
    "gemini-1.5-pro-001", 
    generation_config=GENERATION_CONFIG,
)

# Source Videos

To demonstrate Gemini's long context capabilities in the video modality we will use two videos from I/O 2024, Google's annual developer conference. 

1. The [opening keynote](https://youtu.be/uFroTufv6es). It is 21 minutes long and ~370K tokens. 
2. The [deepmind Keynote](https://www.youtube.com/watch?v=NVwUMyYuLtw). It is 17 minutes and ~300K tokens.

We will start with some questions single video questions, then we will demonstrate multi-video prompting by including both videos as context for a total of ~670K tokens.

These videos are publically available on youtube, however since the Gemini API requires video content to be staged in Google Cloud Storage we store copies of these videos there.

In [6]:
OPENING_URI = "gs://gen-ai-assets-public/Google_IO_2024_Keynote_Opening.mp4"
DEEPMIND_URI = "gs://gen-ai-assets-public/Google_IO_2024_Keynote_Deepmind.mp4"

# Single Video Prompts

## Cache

For any repeated long context prompts it is best practice to first cache. This reduces cost significantly. For more detailed analysis on the cost savings of caching see the [previous notebook](Gemini_Long_Context_Text.ipynb).

In [7]:
%%time

system_instruction = """
Here is the opening keynote from Google I/O 2024. Based on the video answer the following questions.
"""

contents = [
    Part.from_uri(OPENING_URI, mime_type="video/mp4"),
]

cached_content = caching.CachedContent.create(
    model_name="gemini-1.5-pro-001",
    system_instruction=system_instruction,
    contents=contents,
    ttl=datetime.timedelta(minutes=30),
)
cached_content = caching.CachedContent(cached_content_name=cached_content.name)
model_cached = GenerativeModel.from_cached_content(
    cached_content=cached_content,
    generation_config=GENERATION_CONFIG,
)

I0000 00:00:1724138936.979868 6320940 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported
I0000 00:00:1724139069.529169 6320940 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported
I0000 00:00:1724139070.640037 6320940 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


CPU times: user 121 ms, sys: 107 ms, total: 228 ms
Wall time: 2min 15s


## Prompts

### Prompt #1

In [8]:
%%time
response = model_cached.generate_content("Describe the setting in which the video takes place")
Markdown(response.text)

I0000 00:00:1724139071.582714 6320940 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


CPU times: user 30.7 ms, sys: 25.5 ms, total: 56.2 ms
Wall time: 44.2 s


The video takes place at the Google I/O 2024 keynote, held at the Shoreline Amphitheatre in Mountain View, California. The CEO of Google, Sundar Pichai, is giving the opening keynote speech. The stage has a large screen displaying the Google logo and various presentations. The audience consists of thousands of developers, with millions more joining virtually around the world. 


#### Analysis

This response demonstrates Gemini's use of both audio and visual signals in the video.
- *'The stage has a large screen displaying the Google logo and various presentations'*. This is a purely visual cue
- *'The audience consists thousands of developers, with millions more joining virtually around the world.'* This is an audio cue as the speaker says this. 

### Prompt #2



In [14]:
%%time
response = model_cached.generate_content("Give me the timestamps of all applauses in the video.")
Markdown(response.text)

CPU times: user 31.4 ms, sys: 25.9 ms, total: 57.3 ms
Wall time: 38.1 s


Sure, here are the timestamps of all the applauses in the video:

- 01:31-01:47
- 05:45-05:51
- 06:50-06:54
- 07:43-07:49
- 11:04-11:11
- 11:35-11:41
- 12:08-12:15
- 16:53-16:58

Let me know if you have any other questions. 


#### Analysis

This response demonstrates Gemini's retrieval accuracy over the span of the video, and could be used streamline editing a video.

### Prompt #3

In [9]:
%%time
response = model_cached.generate_content("Describe the hand gesture the speaker uses most frequently.")
Markdown(response.text)

CPU times: user 60.6 ms, sys: 52.8 ms, total: 113 ms
Wall time: 1min 41s


The speaker frequently uses a gesture where he brings his hands together in front of his chest, with his palms facing each other and fingers loosely interlaced. He often moves his hands slightly up and down while maintaining this gesture. 


#### Analysis

This response illustrates Gemini's attention to subtle visual details

### Prompt #4

In [13]:
%%time
response = model_cached.generate_content("Who presented the demo?")
Markdown(response.text)

CPU times: user 34.7 ms, sys: 27.1 ms, total: 61.7 ms
Wall time: 43.1 s


The demo was presented by Josh Woodward. 

#### Analysis

In the video Josh is only introduced by his first name, while his full name is briefly shown on a slide. Gemini is able to pick up on this text and associate it with the name of the speaker. It is also able to differentiate the demo portion of the talk from the main speaker (Sundar Pichai). 

# Multi Video Prompts

Now we will include multiple videos in the prompt. Gemini currently supports up to 10 videos per prompt.


## Cache

In [16]:
%%time

system_instruction = """
Here are two videos from Google IO 2024. The first is the opening keynote and the second is the deepmind keynote. Based on the videos answer the following questions.
"""

contents = [
    Part.from_uri(OPENING_URI, mime_type="video/mp4"),
    Part.from_uri(DEEPMIND_URI, mime_type="video/mp4"),
]

cached_content = caching.CachedContent.create(
    model_name="gemini-1.5-pro-001",
    system_instruction=system_instruction,
    contents=contents,
    ttl=datetime.timedelta(minutes=30),
)
cached_content = caching.CachedContent(cached_content_name=cached_content.name)
model_cached = GenerativeModel.from_cached_content(
    cached_content=cached_content,
    generation_config=GENERATION_CONFIG,
)

I0000 00:00:1724140259.936812 6320940 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported
I0000 00:00:1724140319.721502 6320940 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported
I0000 00:00:1724140320.651932 6320940 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


CPU times: user 79.5 ms, sys: 59.4 ms, total: 139 ms
Wall time: 1min 1s


## Prompts

### Prompt #5

In [17]:
%%time
res = model_cached.generate_content("How do the videos differ?")
Markdown(res.text)

I0000 00:00:1724140321.572232 6320940 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


CPU times: user 41.6 ms, sys: 33.5 ms, total: 75.1 ms
Wall time: 43.2 s


The first video is a Google I/O keynote speech by Sundar Pichai, CEO of Google, about the company's advancements in AI, particularly their Gemini model. The second video is a Google DeepMind keynote speech by Demis Hassabis, CEO of Google DeepMind, about the company's advancements in AI, particularly their Gemini model and Project Astra.

The first video focuses on the broad applications of Gemini across various Google products, highlighting its multimodality, long context window, and ability to handle complex tasks. It showcases examples like Ask Photos, which allows users to search their photo library using natural language, and AI agents that can summarize emails, transcribe meetings, and even schedule appointments.

The second video delves deeper into the technical aspects of Gemini, emphasizing its speed, efficiency, and ability to handle complex reasoning tasks. It also introduces Project Astra, an initiative focused on developing universal AI agents that can assist users in everyday life. The video showcases examples of how Gemini can be used for creative tasks like generating music and videos, highlighting its potential to revolutionize various industries.

In essence, the first video provides a high-level overview of Gemini's capabilities and its impact on Google products, while the second video offers a more technical and research-oriented perspective on Gemini and its potential for future AI development.

#### Analysis

This response demonstrates comparative analysis of two videos. It requires first an understanding of the contents of each individual video, then being able to reason about how they differ. 

### Prompt #6

In [24]:
%%time
res = model_cached.generate_content("What new features were launched? Format your response as a bulleted list.")
Markdown(res.text)

CPU times: user 50.3 ms, sys: 41.2 ms, total: 91.5 ms
Wall time: 1min 16s


Sure, here are the new features launched based on the video provided:

* **AI Overviews** - A new search experience that allows users to ask longer and more complex questions, even searching with photos.
* **Ask Photos** - A new feature in Google Photos that allows users to search their memories in a deeper way by asking questions about their photos.
* **2 Million Tokens Context Window** - An expansion of the context window in Gemini 1.5 Pro to 2 million tokens, opening up new possibilities for developers.
* **Audio Overviews** - A new feature in NotebookLM that allows users to listen to a lively science discussion personalized for them based on the text material they provide.
* **Gemini 1.5 Flash** - A lighter-weight model compared to Gemini 1.5 Pro, designed to be fast and cost-efficient to serve at scale while still featuring multimodal reasoning capabilities and breakthrough long context.
* **Project Astra** - A universal AI agent that can be truly helpful in everyday life.
* **Imagen 3** - Google's most capable image generation model yet, featuring stronger evaluations, extensive red teaming, and state-of-the-art watermarking with SynthID.
* **Music AI Sandbox** - A suite of professional music AI tools that can create new instrumental sections from scratch, transfer styles between tracks, and more.
* **Veo** - Google's newest and most capable generative video model, capable of creating high-quality 1080p videos from text, image, and video prompts.

Please note that some of these features are still in development and may not be available to the public yet.


#### Analysis

This response illustrates retrieval across multiple videos.

### Prompt #7

In [18]:
%%time
res = model_cached.generate_content("What technologies are introduced that can help artists?")
Markdown(res.text)

CPU times: user 40 ms, sys: 32.9 ms, total: 72.9 ms
Wall time: 46.6 s


The video shows two technologies that can help artists:

1. **Music AI Sandbox:** This is a suite of professional music AI tools that can create new instrumental sections from scratch, transfer styles between tracks, and more.
2. **Veo:** This is a generative video model that can create high-quality 1080p videos from text, image, and video prompts. It can capture the details of your instructions in different visual and cinematic styles. You can prompt for things like aerial shots of a landscape or a timelapse and further edit your videos using additional prompts.

Both of these technologies are powered by Google's Gemini AI model.


#### Analysis

The artist collaborations are shown in the second video only. Gemini is able to isolate this video and pick out the relevant technologies mentioned.

# Conclusion

We have demonstrated combining Gemini's long context and multi-modal capability to analyze videos of considerable length. Gemini has demonstrated competence on retrieval, description, and reasoning tasks. We demonstrated this on both single and multi-video prompts.