# Video Understanding with Qwen3-VL (Together AI)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Video_Understanding.ipynb)


## Introduction

In this notebook, we'll explore Qwen3-VL's video understanding capabilities using Together AI's API. We'll cover:

1. Video description and summarization
2. Temporal event localization with timestamps
3. Video Q&A

**Note:** Together AI currently supports video URLs only. Frame list input with custom FPS is not supported through the API.


### Install required libraries


In [None]:
!pip install openai


In [None]:
import os
import openai
from IPython.display import Markdown, display

# Together AI Configuration
client = openai.OpenAI(
    api_key=os.environ.get("TOGETHER_API_KEY"),
    base_url="https://api.together.xyz/v1",
)

MODEL_ID = "Qwen/Qwen3-VL-32B-Instruct"

print(f"Using model: {MODEL_ID}")
print(f"API Key configured: {bool(os.environ.get('TOGETHER_API_KEY'))}")


In [None]:
def inference_with_video(video_url, prompt, max_tokens=4096):
    """Run inference with a video URL using Together AI API."""
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "video_url", "video_url": {"url": video_url}},
            ],
        }],
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content


## 1. Video Description

Describe the content of a video.


In [None]:
# Example: Describe what's happening in a video
video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"
prompt = "What's happening in this video? Describe the content in detail."

response = inference_with_video(video_url, prompt)
display(Markdown(response))


## 2. Temporal Event Localization

Localize events in a video with time222222222stamps.2222222,  kkknsnvpksnvpknsdknvkdnvddddddddddd


In [None]:
# Example: Temporal event localization
video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"
prompt = """Localize a series of activity events in the video, output the start and end timestamp for each event, 
and describe each event with sentences. Provide the result in JSON format with 'mm:ss' format for time depiction."""

response = inference_with_video(video_url, prompt)
display(Markdown(response))


## 3. Video Q&A

Answer questions about video content.


In [None]:
# Example: Video Q&A
video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
prompt = "What characters appear in this video? What are they doing?"

response = inference_with_video(video_url, prompt)
display(Markdown(response))
