# Multimodal inference using Gemma 3n via pipeline

In this notebook, we present the different multimodal inference pipeline possibilities to performe inference using the Gemma 3n model family

Possible combinations (using text, image, video, and audio)

- Image + text -> text
- Video + text -> text
- Audio + text -> text
- Video + audio -> text
- Video + audio + text -> text

## Install dependencies and login

In [6]:
!pip install -U -q timm transformers datasets av

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m117.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.3/35.3 MB[0m [31m54.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m48.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [1]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load videos and adapt it

In [24]:
import shutil
from huggingface_hub import hf_hub_download
import os

os.makedirs("videos", exist_ok=True)

dataset_id = "sergiopaniego/sample_videos"
video_filenames = ["sample_0.mp4", "sample_1.mp4"]

for filename in video_filenames:
    local_path = hf_hub_download(
        repo_id=dataset_id,
        repo_type="dataset",
        filename=filename
    )

    final_path = os.path.join("videos", filename)
    shutil.copy(local_path, final_path)

Separate audio and video

In [4]:
import subprocess

video_dir = "videos"
audio_dir = "audios"
os.makedirs(audio_dir, exist_ok=True)

for filename in os.listdir(video_dir):
    if not filename.endswith(".mp4"):
        continue

    idx = filename.split("_")[1].split(".")[0]
    video_path = os.path.join(video_dir, filename)
    audio_path = os.path.join(audio_dir, f"sample_{idx}.wav")

    subprocess.run([
        "ffmpeg", "-i", video_path,
        "-q:a", "0", "-map", "a",
        audio_path,
        "-y"
    ], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

## Load pipeline configuration

In [1]:
import torch
from transformers import pipeline

pipe = pipeline(
   "image-text-to-text",
   model="google/gemma-3n-E4B-it", # "google/gemma-3n-E4B-it"
   device="cuda",
   torch_dtype=torch.bfloat16
)

Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


model.safetensors.index.json:   0%|          | 0.00/171k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.66G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.70M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/769 [00:00<?, ?B/s]

Device set to use cuda


## Inference using pipeline

### Image + text -> text

In [2]:
messages = [
   {
       "role": "user",
       "content": [
           {"type": "image", "image": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
           {"type": "text", "text": "Describe this image"}
       ]
   }
]

In [3]:
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])



An eye-level, low-angle, vertical shot features a sleek, futuristic aircraft soaring through a sky filled with clouds. The aircraft is predominantly light blue and white, with a distinctive, almost manta ray-like shape. It has a pointed nose and a swept-back wing design. 

The main body of the plane is elongated and tapers towards the tail. Along the side, there are several small, circular windows. The tail section is particularly striking, with two prominent vertical stabilizers angled outward. The word "GLOO" is printed in a stylish, sans-serif font on the side of the tail.

The background is a soft, hazy mix of pale blue sky and fluffy white clouds. Below the aircraft, the landscape is partially visible through the clouds, revealing hints of brown and grey terrain. 

The overall impression is one of modern technology and flight, with a sense of upward movement and freedom. The lighting suggests a bright, sunny day, casting subtle shadows on the


### Video + text -> text

In [4]:
messages = [
   {
       "role": "user",
       "content": [
           {"type": "text", "text": "Describe this video"}
       ]
   }
]

In [5]:
import cv2
from PIL import Image
import numpy as np

def downsample_video(video_path):
    vidcap = cv2.VideoCapture(video_path)
    total_frames = int(vidcap.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = vidcap.get(cv2.CAP_PROP_FPS)

    frames = []
    frame_indices = np.linspace(0, total_frames - 1, 6, dtype=int)

    for i in frame_indices:
        vidcap.set(cv2.CAP_PROP_POS_FRAMES, i)
        success, image = vidcap.read()
        if success:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # Convert from BGR to RGB
            pil_image = Image.fromarray(image)
            timestamp = round(i / fps, 2)
            frames.append((pil_image, timestamp))

    vidcap.release()
    return frames

In [25]:
frames = downsample_video(f"videos/sample_0.mp4")

In [26]:
for frame in frames:
    image, timestamp = frame
    image.save(f"image_{timestamp}.png")
    messages[0]["content"].append({"type": "image", "image": f"image_{timestamp}.png"})

In [27]:
messages

[{'role': 'user',
  'content': [{'type': 'text', 'text': 'Describe this video'},
   {'type': 'image', 'image': 'image_0.0.png'},
   {'type': 'image', 'image': 'image_0.73.png'},
   {'type': 'image', 'image': 'image_1.47.png'},
   {'type': 'image', 'image': 'image_2.23.png'},
   {'type': 'image', 'image': 'image_2.97.png'},
   {'type': 'image', 'image': 'image_3.73.png'}]}]

In [28]:
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

Here's a description of the video:

The video shows a person holding and displaying a **bright pink folding hand fan**. The fan is made of paper and features decorative **floral designs**, predominantly roses in shades of orange and red, along with green leaves. The fan has vertical ribs and a slightly scalloped edge. 

Throughout the video, the person rotates and unfolds the fan, showcasing the details of the artwork. The fan appears to be held indoors, with a **beige tiled wall** visible in the background. The person is wearing a **black wristband with white detailing**.

The video seems to be a simple presentation of the fan, possibly to highlight its beauty or for a product showcase.


### Audio + text -> text

We load in `torch_dtype=torch.float32` to allow inferencing with audio. If you're in Colab, restart the session to avoid OOM.

In [3]:
import torch
from transformers import pipeline

pipe = pipeline(
   "image-text-to-text",
   model="google/gemma-3n-E4B-it", # "google/gemma-3n-E4B-it"
   device="cuda",
   torch_dtype=torch.float32
)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda


In [4]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English:"},
            {"type": "audio", "audio": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/speech.wav"},
        ]
    }
]

In [5]:
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])



Send a text to Mike. I'll be home late tomorrow.


### Video + audio -> text

In [16]:
messages = [
   {
       "role": "user",
       "content": [
       ]
   }
]

In [17]:
for frame in frames:
    image, timestamp = frame
    image.save(f"image_{timestamp}.png")
    messages[0]["content"].append({"type": "image", "image": f"image_{timestamp}.png"})

In [18]:
messages[0]["content"].append({"type": "audio", "audio": f"audios/sample_0.wav"})

In [19]:
messages

[{'role': 'user',
  'content': [{'type': 'image', 'image': 'image_0.0.png'},
   {'type': 'image', 'image': 'image_0.73.png'},
   {'type': 'image', 'image': 'image_1.47.png'},
   {'type': 'image', 'image': 'image_2.23.png'},
   {'type': 'image', 'image': 'image_2.97.png'},
   {'type': 'image', 'image': 'image_3.73.png'},
   {'type': 'audio', 'audio': 'audios/sample_0.wav'}]}]

In [20]:
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

This is a **hand fan**. It appears to be made of paper or thin cardstock and has a pink surface decorated with floral designs. It is likely used for cooling oneself or as a decorative item.


### Video + audio + text -> text



In [11]:
messages = [
   {
       "role": "user",
       "content": [
           {"type": "audio", "audio": f"audios/sample_0.wav"},
       ]
   }
]

In [12]:
for frame in frames:
    image, timestamp = frame
    image.save(f"image_{timestamp}.png")
    messages[0]["content"].append({"type": "image", "image": f"image_{timestamp}.png"})

In [13]:
messages[0]["content"].append({"type": "text", "text": f"Answer to the question in the audio with the video."})

In [14]:
messages

[{'role': 'user',
  'content': [{'type': 'audio', 'audio': 'audios/sample_0.wav'},
   {'type': 'image', 'image': 'image_0.0.png'},
   {'type': 'image', 'image': 'image_0.73.png'},
   {'type': 'image', 'image': 'image_1.47.png'},
   {'type': 'image', 'image': 'image_2.23.png'},
   {'type': 'image', 'image': 'image_2.97.png'},
   {'type': 'image', 'image': 'image_3.73.png'},
   {'type': 'text',
    'text': 'Answer to the question in the audio with the video.'}]}]

In [15]:
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

Here is the answer to the question in the audio with the video: 

It's a **folding fan**. It appears to be made of paper or similar lightweight material and has a pink color with floral decorations.
