# Video Understanding with Qwen3-VL (Together AI)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Video_Understanding.ipynb)


## Introduction

In this notebook, we'll explore Qwen3-VL's video understanding capabilities using Together AI's API. We'll cover:

1. Video description and summarization
2. Temporal event localization with timestamps
3. Video Q&A

**Note:** Together AI currently supports video URLs only. Frame list input with custom FPS is not supported through the API.


### Install required libraries


In [None]:
!pip install together pydantic


In [6]:
import os
import json
import together
from pydantic import BaseModel, Field
from IPython.display import Markdown, display

# Together AI Configuration
client = together.Together()

MODEL_ID = "Qwen/Qwen3-VL-32B-Instruct"

print(f"Using model: {MODEL_ID}")
print(f"API Key configured: {bool(os.environ.get('TOGETHER_API_KEY'))}")


Using model: Qwen/Qwen3-VL-32B-Instruct
API Key configured: True


In [8]:
# Define Pydantic schema for temporal event localization
class TemporalEvent(BaseModel):
    start_time: str = Field(description="Start time of the event in mm:ss format")
    end_time: str = Field(description="End time of the event in mm:ss format")
    description: str = Field(description="Brief description of the event")

class VideoEvents(BaseModel):
    events: list[TemporalEvent] = Field(description="A series of localized activity events in the video")

def inference_with_video(video_url, prompt, max_tokens=16000):
    """Run inference with a video URL using Together AI API."""
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "video_url", "video_url": {"url": video_url}},
            ],
        }],
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content

def inference_with_video_json_schema(video_url, prompt, schema: type[BaseModel], max_tokens=16000):
    """Run inference with a video URL using Together AI API and a JSON schema."""
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[
            {
                "role": "system",
                "content": f"Analyze the video and respond only in JSON following this schema: {json.dumps(schema.model_json_schema())}",
            },
            {
                "role": "user",
                "content": [
                    {"type": "video_url", "video_url": {"url": video_url}},
                    {"type": "text", "text": prompt},
                ],
            }
        ],
        max_tokens=max_tokens,
        response_format={
            "type": "json_schema",
            "schema": schema.model_json_schema(),
        },
    )
    return json.loads(response.choices[0].message.content)


## 1. Video Description

Describe the content of a video.


In [9]:
# Example: Describe what's happening in a video
video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"
prompt = "What's happening in this video? Describe the content in detail."

response = inference_with_video(video_url, prompt)
display(Markdown(response))


This video is a promotional advertisement for **Google Chromecast**, showcasing its ability to stream content from personal devices to a TV. The video follows a series of diverse, relatable scenarios where people use Chromecast to enhance their entertainment experiences.

---

### **Opening Scene (0:00–0:02)**
The video opens with a close-up of a hand plugging a small black device labeled **“chrome”** into an **HDMI port** on the back of a TV. This is the Chromecast device. The shot emphasizes its compact size and ease of setup.

---

### **Scene 1: Family Bonding (0:02–0:05)**
A father and daughter are sitting on a couch. The daughter is holding a tablet, and they appear to be watching something together. The scene transitions to a **tablet screen** showing a cartoon penguin, which then appears **mirrored on the TV**. This demonstrates Chromecast’s ability to stream content from a tablet to a larger screen.

---

### **Scene 2: Roommates Enjoying Media (0:05–0:10)**
Two young men are in a bedroom. One is using a laptop to watch a video of a robot or mech character. The video is then shown playing on the TV, which is placed on a media stand with speakers. This highlights Chromecast’s compatibility with laptops and its use in shared living spaces.

---

### **Scene 3: Sports and Social Viewing (0:10–0:16)**
A person is using a laptop to watch a basketball game. The video is then mirrored on a TV in a living room, where a group of friends are gathered, watching and reacting to the game. This scene emphasizes Chromecast’s utility for social viewing and sports fans.

---

### **Scene 4: Family Movie Night (0:16–0:21)**
A man and a woman are lying on a bed, watching a movie on a tablet. The scene transitions to the same movie playing on a TV. The couple is then shown laughing and enjoying the movie together. This illustrates Chromecast’s role in enhancing family entertainment.

---

### **Scene 5: Music and Gaming (0:21–0:27)**
A man is playing a video game on a laptop, and the game is mirrored on the TV. The scene cuts to a group of friends watching a car racing game on the TV, cheering and laughing. This showcases Chromecast’s ability to stream gaming content.

---

### **Scene 6: Photo Sharing and Nostalgia (0:27–0:32)**
A TV displays a photo slideshow of children and babies. The scene cuts to two men sitting on a couch, one with a laptop, both smiling as they look at the photos. This highlights Chromecast’s use for sharing personal memories.

---

### **Scene 7: Music and Fun (0:32–0:38)**
A man is playing a drum set, and the performance is shown on a TV. The scene cuts to two young men in a bedroom, one holding a guitar, the other on a laptop, both smiling and enjoying the music. This scene shows Chromecast’s use for music streaming and creative expression.

---

### **Scene 8: Dance Party (0:38–0:43)**
A group of people are dancing in a living room, with a TV showing a dance video. The scene cuts to a dog lying on a couch, seemingly “watching” the TV. This adds humor and shows Chromecast’s ability to bring entertainment to the whole household.

---

### **Scene 9: Baby and Parent (0:43–0:48)**
A woman is holding a baby while watching a cartoon on a tablet. The cartoon is mirrored on the TV, where a man and woman are laughing together. This scene emphasizes Chromecast’s family-friendly use.

---

### **Scene 10: Creative and Social Use (0:48–0:53)**
A man is using a tablet to take a photo of a plate of food. The photo is then displayed on a TV, where a group of people are gathered, laughing and enjoying the moment. This demonstrates Chromecast’s use for sharing photos and moments with others.

---

### **Closing Scene (0:53–0:59)**
The video ends with a hand unplugging the Chromecast from the TV. A white screen appears with the text:  
> **“Everything you love, now on your TV.”**  
> **“For $35.”**  
> **“For everyone.”**  
> **“chromecast”**  
> **“google.com/chromecast”**

This reinforces the product’s affordability, simplicity, and universal appeal.

---

### **Overall Theme**
The video effectively communicates that **Google Chromecast** is a simple, affordable device that allows users to stream content from their personal devices (tablets, laptops, smartphones) to their TV, enhancing family time, social gatherings, and personal entertainment. It emphasizes **ease of use, versatility, and accessibility** for people of all ages and lifestyles.

The tone is warm, relatable, and upbeat, focusing on **shared experiences** and **joyful moments** made possible by technology.

## 2. Temporal Event Localization

Localize events in a video with timestamps.


In [14]:
# Example: Temporal event localization
video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"
prompt = "Extract time and description of every detail of events in the video. For each event, provide the result in JSON format with the start and end timestamp (mm:ss) and a description in the events array."

result = inference_with_video_json_schema(video_url, prompt, VideoEvents)
print(json.dumps(result, indent=2))


{
  "events": [
    {
      "start_time": "00:00",
      "end_time": "00:02",
      "description": "A hand plugs a Chromecast device into the HDMI port of a TV."
    },
    {
      "start_time": "00:02",
      "end_time": "00:04",
      "description": "A father and daughter are sitting on a couch, with the daughter playfully jumping on the father's lap."
    },
    {
      "start_time": "00:04",
      "end_time": "00:06",
      "description": "Two young men are in a bedroom; one is playing a guitar while the other is using a laptop."
    },
    {
      "start_time": "00:06",
      "end_time": "00:10",
      "description": "A person is casting a video of a penguin from a tablet to a TV, with the video appearing on the TV screen."
    },
    {
      "start_time": "00:10",
      "end_time": "00:12",
      "description": "A person is using a laptop to cast a video to a TV, which is shown on the TV screen."
    },
    {
      "start_time": "00:12",
      "end_time": "00:17",
      "descript

## 3. Video Q&A

Answer questions about video content.


In [15]:
# Example: Video Q&A
video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
prompt = "What characters appear in this video? What are they doing?"

response = inference_with_video(video_url, prompt)
display(Markdown(response))


The video features several animated characters in a vibrant, colorful forest setting. The main character is a large, white, fluffy rabbit named Big Buck Bunny, who is initially seen sleeping in a grassy burrow. He wakes up, stretches, and begins to enjoy the morning by sniffing flowers and chasing a butterfly.

As the story progresses, Big Buck Bunny encounters three smaller animals: a squirrel, a fox, and a mouse. These characters appear mischievous and are seen plotting something behind a tree. The squirrel, in particular, becomes the focus of the conflict, as he steals the butterfly that Big Buck Bunny was admiring. This leads to a chase sequence where Big Buck Bunny tries to retrieve the butterfly.

The squirrel then attempts to escape by using a makeshift glider made from leaves and sticks, which he launches from a tree. He flies through the air, narrowly avoiding obstacles like a fence made of pencils, and lands safely on a branch. Big Buck Bunny watches this with a mix of annoyance and amusement.

Throughout the video, the characters interact in a playful and humorous manner, with exaggerated expressions and movements typical of animated shorts. The background is lush and detailed, with green grass, trees, flowers, and a bright blue sky, creating a cheerful and whimsical atmosphere.

The video concludes with the credits rolling, showing the names of the production team and contributors, while the squirrel and the butterfly are seen flying together in the sky, suggesting a resolution to their earlier conflict.

Overall, the video is a lighthearted and entertaining animated short that showcases the adventures of Big Buck Bunny and his forest friends, emphasizing themes of playfulness, mischief, and ultimately, harmony.