# Multimodality

Let's copy an image from a public bucket with examples first:

In [None]:
!gsutil cp gs://cloud-samples-data/generative-ai/image/boats.jpeg .

We can send it as bytes and ask the model to describe it:

In [None]:
import base64
from langchain_google_vertexai import ChatVertexAI
from langchain_core.messages import HumanMessage


llm = ChatVertexAI(model_name="gemini-2.0-flash-001")

with open("boats.jpeg", 'rb') as image_file:
  image_bytes = image_file.read()
  base64_bytes = base64.b64encode(image_bytes).decode("utf-8")

prompt = [
    {"type": "text", "text": "Describe the image: "},
    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_bytes}"}},
]

response = llm.invoke([HumanMessage(content=prompt)])
print(response.content)

The image shows a calm body of water with two boats anchored in the foreground and a city skyline in the background. The first boat is a pontoon boat, with a dark blue stripe, a white deck, and a green Bimini top. The second boat is a smaller, open motorboat. In the background, there's a large stone bridge with multiple arches, leading to a cityscape with a mix of modern high-rise buildings and older structures, including a building with two distinctive domes. The sky is overcast, contributing to the subdued lighting in the scene.


We can also do the same with videos:

In [3]:
video_uri = "gs://cloud-samples-data/generative-ai/video/animals.mp4"
prompt = [
    {"type": "text", "text": "Describe the video in a few sentences."},
    {"type": "media", "file_uri": video_uri, "mime_type": "video/mp4"},
]

response = llm.invoke([HumanMessage(content=prompt)])
print(response.content)

The video shows the Los Angeles Zoo with zoo employees setting up different enclosures for animals to interact with Google Photos cameras in order to give people unique perspectives of the animals that they've never seen before. The animals in the video include giraffes, elephants, tigers, and otters. It also showcases how Google Photos automatically backs up photos and lets users share them via social media.


Also we can define an offset (a piece of video to be processed by the model):

In [4]:
offset_hint = {
            "start_offset": {"seconds": 10},
            "end_offset": {"seconds": 20},
        }
prompt = [
    {"type": "text", "text": "Describe the video in a few sentences."},
    {"type": "media", "file_uri": video_uri, "mime_type": "video/mp4", "video_metadata": offset_hint},
]

response = llm.invoke([HumanMessage(content=prompt)])
print(response.content)

The video is about what would happen if animals in the real world were given technology like the characters in Zootopia. The video shows Nick Wilde taking a selfie, followed by a shot of the Los Angeles Zoo and an interview with Rico Farmer, Brand and Experiential Marketing Manager. He says they’re at the Los Angeles Zoo, letting animals talk.


We can also use prompt substitution to pass bytes to the prompt:

In [5]:
image_uri = "gs://cloud-samples-data/generative-ai/image/boats.jpeg"

In [6]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [("user",
     [{"type": "image_url",
       "image_url": {"url": "data:image/jpeg;base64,{image_bytes_str}"},
       }])]
)
prompt.invoke({"image_bytes_str": "test-url"})

ChatPromptValue(messages=[HumanMessage(content=[{'type': 'image_url', 'image_url': {'url': 'data:image/jpeg;base64,test-url'}}], additional_kwargs={}, response_metadata={})])