# Multimodal Understanding with Amazon Nova Models
This notebook showcases how the Amazon models are able to understand and answer questions about videos and images.

You can upload a video or image and ask the model questions. We can try this below:

In [None]:
%pip install -q -r requirements.txt

In [5]:
from IPython.display import Video

Video("the-sea.mp4")

We can read the file into bytes which is used to pass into the `Converse` API as bytes.

In [None]:
with open("./the-sea.mp4", "rb") as file:
    media_bytes = file.read()

With the data in bytes, we can pass it straight into the Converse API

In [None]:
import boto3

client = boto3.client("bedrock-runtime")

messages = [
    {
        "role": "user",
        "content": [
            {"video": {"format": "mp4", "source": {"bytes": media_bytes}}},
            {"text": "What is happening in this video?"},
        ],
    }
]

response = client.converse(modelId="us.amazon.nova-lite-v1:0", messages=messages)
print(response["output"]["message"]["content"][0]["text"])

The video may not always be an appropriate resolution to use as an input for the models. As such, we can easily downscale using ffmpeg-python which will upload the file to S3 where the model can read it from

In [None]:
import sagemaker
from utils import resize_video

bucket_name = sagemaker.session.Session().default_bucket()

input_s3_uri = resize_video(media_bytes, bucket_name)
input_s3_uri

With an S3 URI, we can pass this into the API and see we get a similar output.

In [None]:
messages = [
    {
        "role": "user",
        "content": [
            {
                "video": {
                    "format": "mp4",
                    "source": {"s3Location": {"uri": input_s3_uri}},
                }
            },
            {"text": "What is happening in this video?"},
        ],
    }
]

response = client.converse(modelId="us.amazon.nova-lite-v1:0", messages=messages)
response["output"]["message"]["content"][0]["text"]

These models will sample a particular number of frames to be used as input alongside our prompt.

Using a helper library we can complete a similar task, however only up to 20 images are supported by the Converse API so results may vary.

Using a helper library, this may look like:

In [None]:
import matplotlib.pyplot as plt
from utils import resample_video_to_frames, convert_frames_to_converse_format

frames = resample_video_to_frames(media_bytes)
print(f"The video now has {len(frames)} frames")
plt.imshow(frames[0])
plt.show()

In order to be valid for the Converse API, we can convert each frame to a bytes representation to be passed into the Converse API to invoke the model.

In [None]:
converted_frames = convert_frames_to_converse_format(frames)

content = [
    {"image": {"format": "jpeg", "source": {"bytes": frame_bytes}}}
    for frame_bytes in converted_frames
]
content.append({"text": "What is happening in the video?"})

messages = [{"role": "user", "content": content}]

messages = [
    {
        "role": "user",
        "content": [
            {
                "video": {
                    "format": "mp4",
                    "source": {"s3Location": {"uri": input_s3_uri}},
                }
            },
            {"text": "What is happening in this video?"},
        ],
    }
]

response = client.converse(modelId="us.amazon.nova-lite-v1:0", messages=messages)
response["output"]["message"]["content"][0]["text"]

Since the model will sample frames from our video, we may want to understand what this will look like for a video of an arbitrary length, or we might want to get an estimation of how many tokens our video might use.

We can plot these values using a function like below

In [None]:
from utils import plot_samples_fps_tokens

plot_samples_fps_tokens()

We can see that after a certain point, the model samples at most 960 frames which is equivalent to 276,480 input tokens and the FPS sampled will reduce to keep this consistent.

With these numbers, we can roughly estimate our frame sample rate, FPS and token count for our video.

In [None]:
from utils import get_sampled_fps, get_sampled_frame_count, get_sampled_tokens

duration_in_seconds = 240

print(
    f"Our video which is {duration_in_seconds} long will be sampled at a rate of {get_sampled_fps(duration_in_seconds)} FPS for a frame count of {get_sampled_frame_count(duration_in_seconds)} which will use around {get_sampled_tokens(duration_in_seconds)} tokens"
)