<a href="https://colab.research.google.com/github/MehediAhamed/vlmrun-cookbook/blob/artifact-autoreload-error-fixed-video-understanding/notebooks/12_orion_video_understanding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a> | <a href="https://chat.vlm.run"><b>Chat</b></a>
</p>
</div>

# VLM Run Orion - Video Understanding, Reasoning and Execution

This comprehensive cookbook demonstrates [VLM Run Orion's](https://vlm.run/orion) video understanding, reasoning and execution capabilities. For more details on the API, see the [Agent API docs](https://docs.vlm.run/agents/introduction).

For this notebook, we'll cover how to use the **VLM Run Agent Chat Completions API** - an OpenAI-compatible interface for building powerful visual intelligence with the same familiar chat-completions interface.

We'll cover the following topics:
 1. Video uploads (load videos from URLs/files)
 2. Video Captioning & Summarization (generate detailed captions, summaries, and chapters)
 3. Video Frame Sampling (extract frames at specific timestamps or intervals)
 4. Video Trimming (extract specific segments from videos)
 5. Video Parsing & Analysis (parse video content, detect scene changes)
 6. Video Generation (text-to-video generation)
 7. Streaming Responses (for long-running video tasks)

## Prerequisites

- Python 3.10+
- VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))
- VLM Run Python Client with OpenAI extra `vlmrun[openai]`

## Setup

First, install the required packages and configure the environment.

In [1]:
# Install required packages
!pip install vlmrun[openai] --upgrade --quiet
!pip install pillow requests numpy --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.4/88.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.0/66.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.3/151.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m53.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [2]:
import os
import getpass
import json
from typing import List, Any
from functools import cached_property

import numpy as np
from PIL import Image
from pydantic import BaseModel, Field

VLMRUN_API_KEY = os.getenv("VLMRUN_API_KEY", None)
if VLMRUN_API_KEY is None:
    VLMRUN_API_KEY = getpass.getpass("Enter your VLM Run API key: ")

Enter your VLM Run API key: ··········


## Initialize the VLM Run Client

We use the OpenAI-compatible chat completions interface through the VLM Run SDK.

In [3]:
from vlmrun.client import VLMRun

client = VLMRun(
    api_key=VLMRUN_API_KEY, base_url="https://agent.vlm.run/v1"
)
print("VLM Run client initialized successfully!")
print(f"Base URL: https://agent.vlm.run/v1")
print(f"Model: vlmrun-orion-1")

VLM Run client initialized successfully!
Base URL: https://agent.vlm.run/v1
Model: vlmrun-orion-1


## Response Models (dtypes)

We define Pydantic models for structured outputs. These models include **cached properties** that automatically download and convert videos/images from URLs for easy manipulation.

In [4]:
from vlmrun.common.utils import download_image


class VideoUrlResponse(BaseModel):
    """Response model for video URL operations."""
    url: str = Field(..., description="Pre-signed URL to the video")

class VideoUrlListResponse(BaseModel):
    """Response model for video URL list operations."""
    urls: List[VideoUrlResponse] = Field(..., description="List of pre-signed URLs to the videos")


class ParsedVideoResponse(BaseModel):
    """Response model for parsed video content."""

    class VideoChapter(BaseModel):
        """A chapter/segment of a video with timestamps."""
        start_time: str = Field(..., description="Start time of the chapter in HH:MM:SS format")
        end_time: str = Field(..., description="End time of the chapter in HH:MM:SS format")
        description: str = Field(..., description="Description of the chapter content")

    topic: str = Field(..., description="Main topic of the video")
    summary: str = Field(..., description="Summary of the video content")
    chapters: List[VideoChapter] = Field(default_factory=list, description="List of video chapters with timestamps and descriptions")


class VideoFramesResponse(BaseModel):
    """Response model for video frame sampling."""

    class VideoFrame(BaseModel):
        """A single frame extracted from a video."""
        url: str = Field(..., description="URL of the video frame.")
        timestamp: str = Field(..., description="Timestamp of the frame in HH:MM:SS.MS format")

        @cached_property
        def image(self) -> Image.Image | None:
            """Download and return the frame as a PIL Image."""
            return download_image(self.url)

    frames: List[VideoFrame] = Field(..., description="List of extracted frames")

    @cached_property
    def images(self) -> List[Image.Image]:
        """Download and return all frames as PIL Images."""
        return [frame.image for frame in self.frames if frame.image is not None]


class VideoTrimResponse(BaseModel):
    """Response model for video trimming operations."""
    url: str = Field(..., description="URL of the trimmed video")
    start_time: str = Field(..., description="Start time of the trimmed segment")
    end_time: str = Field(..., description="End time of the trimmed segment")


class VideoHighlightsResponse(BaseModel):
    """Response model for video highlight extraction."""

    class VideoHighlight(BaseModel):
        """A highlight segment from a video."""
        start_time: str = Field(..., description="Start time of the highlight in HH:MM:SS.MS format")
        end_time: str = Field(..., description="End time of the highlight in HH:MM:SS.MS format")
        url: str = Field(..., description="URL of the extracted highlight video")
        description: str = Field(default="", description="Description of the highlight")

    highlights: List[VideoHighlight] = Field(..., description="List of extracted highlights")


print("Response models defined successfully!")
print("Models include cached properties for automatic video/image downloading.")

Response models defined successfully!
Models include cached properties for automatic video/image downloading.


## Helper Functions

We create helper functions to simplify making chat completion requests with structured outputs.

In [5]:
import hashlib
import cachetools
from typing import Type, TypeVar
from pathlib import Path
from IPython.display import HTML, display
from vlmrun.common.image import encode_image


T = TypeVar('T', bound=BaseModel)


def display_videos(urls: str | list[str], texts: list[str] | None = None, width: int = 600) -> HTML:
    """Display a video from URL in the notebook."""
    if isinstance(urls, str):
        urls = [urls]
    if texts is None:
        texts = [None] * len(urls)
    elif isinstance(texts, str):
        texts = [texts]
    elif len(texts) != len(urls):
        raise ValueError("`texts` must be a list of the same length as `urls`")
    html = ""
    for url, text in zip(urls, texts):
        html += f"<div style='display:inline-block; margin:5px; text-align:center'>"
        html += f"<video width='{width}' controls>"
        html += f"<source src='{url}' type='video/mp4'>"
        html += "Your browser does not support the video tag."
        html += "</video>"
        if text:
            html += f"<div style='font-size:12px; color:#f0f0f0; margin-top:5px'>{text}</div>"
        html += "</div>"
    return display(HTML(f"<div style='display:flex; flex-wrap:wrap'>{html}</div>"))


def display_images(images: Image.Image | list[Image.Image], texts: list[str] | None = None, width: int = 300):
    """Display images with optional captions."""
    if isinstance(images, Image.Image):
        images = [images]
    if texts is None:
        texts = [None] * len(images)
    elif isinstance(texts, str):
        texts = [texts]
    elif len(texts) != len(images):
        raise ValueError("`texts` must be a list of the same length as `images`")

    imgs_html = ""
    for image, text in zip(images, texts):
        W, H = image.size
        if W > width:
            H = int(H * width / W)
            W = width
            image = image.resize((W, H))
        im_bytes = encode_image(image, format="JPEG")
        imgs_html += f"<div style='display:inline-block; margin:5px; text-align:center'>"
        imgs_html += f"<img src='{im_bytes}' style='width:{width}px; border-radius:6px'>"
        if text:
            imgs_html += f"<div style='font-size:12px; color:#f0f0f0; margin-top:5px'>{text}</div>"
        imgs_html += f"</div>"
    return display(HTML(f"<div style='display:flex; flex-wrap:wrap'>{imgs_html}</div>"))


def custom_key(prompt: str, images: list[str] | list[Image.Image] | None = None, videos: list[str] | None = None, response_model: Type[T] | None = None, model: str = "vlmrun-orion-1:auto"):
    """Custom key for caching chat_completion."""
    image_keys = []
    for image in images:
        if isinstance(image, Image.Image):
            thumb = image.copy()
            thumb.thumbnail((128, 128))
            encoded = encode_image(thumb, format="JPEG")
            image_keys.append(encoded)
        elif isinstance(image, str):
            image_keys.append(image)
    image_keys = tuple(image_keys)
    video_keys = tuple(videos) if videos else ()
    response_key = hashlib.sha256(json.dumps(response_model.model_json_schema(), sort_keys=True).encode()).hexdigest() if response_model else ""
    return (prompt, image_keys, video_keys, response_key, model)


In [6]:
from typing import Type, TypeVar, Any
from pathlib import Path
from PIL import Image
import uuid

T = TypeVar("T")


# ---------- Artifact Resolver (RUNTIME SAFE) ----------

def is_artifact_ref(obj) -> bool:
    return (
        isinstance(obj, str)
        and (
            obj.startswith("img_")
            or obj.startswith("vid_")
            or obj.startswith("url_")
        )
    )


def resolve_artifacts(obj, client, session_id):
    if is_artifact_ref(obj):
        return client.artifacts.get(
            session_id=session_id,
            object_id=obj
        )

    if isinstance(obj, list):
        return [resolve_artifacts(x, client, session_id) for x in obj]

    if hasattr(obj, "__dict__"):
        for field, value in obj.__dict__.items():
            setattr(obj, field, resolve_artifacts(value, client, session_id))
        return obj

    return obj


# ---------- Chat Completion Helper ----------

def chat_completion(
    prompt: str,
    images: list[str] | list[Image.Image] | None = None,
    videos: list[str] | list[Path] | None = None,
    response_model: Type[T] | None = None,
    model: str = "vlmrun-orion-1:auto"
) -> Any:

    session_id = str(uuid.uuid4())

    content = [{"type": "text", "text": prompt}]

    if images:
        for image in images:
            if isinstance(image, Image.Image):
                encoded = encode_image(image, format="JPEG")
                content.append({"type": "image", "image": encoded})
            elif isinstance(image, str):
                content.append({"type": "image_url", "image_url": {"url": image}})
            else:
                raise ValueError(f"Invalid image type: {type(image)}")

    if videos:
        for video in videos:
            if isinstance(video, Path):
                file = client.files.upload(file=video, purpose="assistants")
                content.append({"type": "input_file", "file_id": file.id})
            elif isinstance(video, str):
                content.append({"type": "video_url", "video_url": {"url": video}})
            else:
                raise ValueError(f"Invalid video type: {type(video)}")

    kwargs = {
        "model": model,
        "messages": [{"role": "user", "content": content}],
        "extra_body": {"session_id": session_id}
    }

    if response_model:
        kwargs["response_format"] = {
            "type": "json_schema",
            "schema": response_model.model_json_schema()
        }

    response = client.agent.completions.create(**kwargs)

    response_text = response.choices[0].message.content

    if not response_model:
        return response_text

    parsed = response_model.model_validate_json(response_text)

    #  RESOLVE ALL ARTIFACT REFERENCES
    parsed = resolve_artifacts(
        parsed,
        client=client,
        session_id=response.session_id
    )

    return parsed

print("Helper functions defined!")

Helper functions defined!


## Video Understanding, Reasoning, and Execution Capabilities

VLM Run agents can perform a wide range of video processing tasks including captioning, summarization, frame extraction, trimming, and more.

### 1. Video Uploads

With the VLM Run Agent API, you can either upload videos from URLs or from local files and pass them to chat completions.

In the `chat_completion` helper function above, we use the following to upload videos:

```python
for video_url in videos:
    if isinstance(video_url, Path):
        file = client.files.upload(file=video_url, purpose="assistants")
        content.append({"type": "input_file", "file_id": file.id})
    elif isinstance(video_url, str):
        assert video_url.startswith("http"), "Video URLs must start with http or https"
        content.append({"type": "video_url", "video_url": {"url": video_url}})
```


Let's look at a simple video below:

In [7]:
VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4"

print(">> VIDEO")
display_videos(VIDEO_URL, width=600)

>> VIDEO


### 2. Video Captioning & Summarization

Generate detailed captions, summaries, and chapter breakdowns for videos. The agent analyzes both visual and audio content to provide comprehensive descriptions.

### 2a. Simple Video Description

Get a quick, natural language description of a video without structured output.

In [8]:
VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4"

result = chat_completion(
    prompt="Describe what happens in this video in 2-3 sentences.",
    images=[],
    videos=[VIDEO_URL],
)

print(">> RESPONSE")
print(result)

print("\n>> VIDEO")
display_videos(VIDEO_URL, texts=result, width=600)

>> RESPONSE
The video tells the story of a family bakery, tracing its history through generations with archival photos and interviews. It highlights the bakery's connection to its community, the challenges it faced like a fire and economic downturn, and the family's determination to continue their legacy. The narrative emphasizes the deep roots and passion involved in running the business.

>> VIDEO


### 2b. Structured Video Understanding

Parse a video and get a detailed summary with topic, summary, and chapter breakdowns.

In [9]:
VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4"

result = chat_completion(
    prompt="Parse this video and provide a detailed summary with topic, summary, and chapter breakdowns.",
    images= [],
    videos=[VIDEO_URL],
    response_model=ParsedVideoResponse
)

print(">> RESPONSE")
print(result.model_dump_json(indent=2))

md_str = ""
md_str += f"Topic: {result.topic}\n"
md_str += f"\nSummary: {result.summary}\n"
md_str += f"\nChapters ({len(result.chapters)} total):\n"
for i, chapter in enumerate(result.chapters, 1):
    md_str += f"  {i:02d}. [{chapter.start_time} - {chapter.end_time}] {chapter.description}\n"

print("\n>> VIDEO")
display_videos(VIDEO_URL, width=600)
print(md_str)

>> RESPONSE
{
  "topic": "Jenny Lee Bakery: A Family Legacy",
  "summary": "The video chronicles the history, challenges, and resilience of the Jenny Lee Bakery, a multi-generational family business in McKees Rocks, Pennsylvania. It highlights the family's deep connection to the community, the impact of significant setbacks like a fire and economic recession, and their efforts to adapt and continue the legacy. The narrative covers the bakery's origins in 1941, personal anecdotes from family members, the devastating fire that impacted the business, and the subsequent economic recession. It concludes with Scott Baker's strategic shift to focus on in-house production and direct sales to stores for the future of the family business.",
  "chapters": [
    {
      "start_time": "00:00",
      "end_time": "00:06",
      "description": "Establishing shot of McKees Rocks, Pennsylvania, featuring a prominent bridge and the town's landscape."
    },
    {
      "start_time": "00:06",
      "end_t

Topic: Jenny Lee Bakery: A Family Legacy

Summary: The video chronicles the history, challenges, and resilience of the Jenny Lee Bakery, a multi-generational family business in McKees Rocks, Pennsylvania. It highlights the family's deep connection to the community, the impact of significant setbacks like a fire and economic recession, and their efforts to adapt and continue the legacy. The narrative covers the bakery's origins in 1941, personal anecdotes from family members, the devastating fire that impacted the business, and the subsequent economic recession. It concludes with Scott Baker's strategic shift to focus on in-house production and direct sales to stores for the future of the family business.

Chapters (17 total):
  01. [00:00 - 00:06] Establishing shot of McKees Rocks, Pennsylvania, featuring a prominent bridge and the town's landscape.
  02. [00:06 - 00:15] Scott Baker introduces himself and his family's generational connection to the community. He mentions his grandfathe

### 3. Video Frame Sampling

Extract frames from videos at specific timestamps or regular intervals. This is useful for thumbnail generation, video analysis, and content indexing.

In [11]:
VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4"

result = chat_completion(
    prompt=f"Given the chapter details from the video, sample a frame from every 4 chapters and return the frame URLs with timestamps. <summary>{md_str}</summary>.",
    images=[],
    videos=[VIDEO_URL],
    response_model=VideoFramesResponse
)

print(">> RESPONSE")
print(f"Extracted {len(result.frames)} frames:")
for frame in result.frames:
    print(f"  - ts={frame.timestamp}, image: {frame.image}...")

print("\n>> FRAMES")
display_images(result.images, texts=[f"ts={f.timestamp}" for f in result.frames], width=250)

>> RESPONSE
Extracted 4 frames:
  - ts=00:00:19.000, image: <PIL.Image.Image image mode=RGB size=1280x720 at 0x7B2BB3233590>...
  - ts=00:01:01.000, image: <PIL.Image.Image image mode=RGB size=1280x720 at 0x7B2BB21E82C0>...
  - ts=00:01:37.000, image: <PIL.Image.Image image mode=RGB size=1280x720 at 0x7B2BB2176180>...
  - ts=00:02:10.000, image: <PIL.Image.Image image mode=RGB size=1280x720 at 0x7B2BB1F5E0F0>...

>> FRAMES


### 4. Video Trimming

Extract specific segments from videos by specifying start and end times. Perfect for creating clips, highlights, or removing unwanted portions.

In [12]:
VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4"

result = chat_completion(
    prompt="Trim this video from 00:30 to 00:45 seconds and return the trimmed pre-signed video URL.",
    images=[],
    videos=[VIDEO_URL],
    response_model=VideoTrimResponse
)

print(">> RESPONSE")
print(f"Trimmed video URL: {result.url}")
print(f"Start time: {result.start_time}")
print(f"End time: {result.end_time}")

print("\n>> TRIMMED VIDEO")
display_videos(result.url, width=600)

>> RESPONSE
Trimmed video URL: https://storage.googleapis.com/vlm-userdata-prod/agents/artifacts/cf4592b6-e159-45e4-b51e-ce338d93f666/de6bed07-18ed-48cf-92fa-b4f380d314e7/vid_b0a964.mp4?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251219%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251219T150311Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=1db24a8507f61256a0253bf53583dc4d0783fe2caf804d3d9351e15782c3db6f0abe35e7fdd265dcfa90c7e1928835980f5112b94951fc44ef228a1814d0631eff37c8185b814a43b90b07527e708c3db793d66a27142ebab7d236e9bbbd2f5687883d38d84b5e2e0930ff1d5a758edcdf7cbf57ea8599510cf795bd7b5215ffe277226d60cfacc6db1701c6ff5a5b1d2dec4c3ac9171752f42aaf3924c68b1965d77d64f76a7472f0d38dc831b9b96865e7981d8da42a87afa8f2a3b9ef9c81c5186de7e1ab7e330b1ae5cce833c8407b236db249167e29ec02fb06724cbcf46c7b6580f63696f2de31c0c5b32089e500b57618b1b22cc141273bd348c526a1
Start time: 00:30
End time: 00:45

>> TRIMMED VIDEO


### 5. Video Highlight Extraction

Automatically identify and extract the most interesting or important moments from a video. The agent analyzes the content to find key scenes.

In [13]:
VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4"

result = chat_completion(
    prompt="Extract the 3 best/most interesting moments from this video as separate clips with timestamps and descriptions.",
    images=[],
    videos=[VIDEO_URL],
    response_model=VideoHighlightsResponse
)

print(">> RESPONSE")
print(f"Extracted {len(result.highlights)} highlights:")
for i, highlight in enumerate(result.highlights, 1):
    print(f"  {i:02d}. [{highlight.start_time} - {highlight.end_time}] {highlight.description or ''}")

>> RESPONSE
Extracted 3 highlights:
  01. [01:37 - 01:42] This clip features dramatic footage of heavy machinery demolishing the old brick building that once housed the bakery. The excavator tears into the structure, sending bricks and dust flying, visually symbolizing the end of an era and the physical destruction of a long-standing establishment.
  02. [01:11 - 01:28] The video shows a newspaper headline announcing "Fire damages bakery complex," followed by Scott Baker recounting the devastating fire and the look of despair on his father's face. This moment captures a significant turning point and the emotional toll of a major setback for the family business.
  03. [01:58 - 02:05] This segment presents a series of historical black and white photographs. It showcases early bakery operations, including horse-drawn delivery wagons with "M.A. Baker's Sons Holsom Bread" branding, and a group portrait of early family members. This montage effectively conveys the deep roots and multi-genera

### 6. Video Duration & Metadata

Get information about video duration and other metadata.

In [14]:
VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.agent/soccer_ball_juggling.mp4"

result = chat_completion(
    prompt="How long is this video in minutes and seconds? Also describe the video resolution and quality if you can determine it.",
    images=[],
    videos=[VIDEO_URL],

)

display_videos(VIDEO_URL, width=600)
print(">> RESPONSE")
print(result)

>> RESPONSE
The video is 19 seconds long. The resolution cannot be determined from the provided information, but the video appears to be of good quality, with good sharpness, detail, and natural colors.


### 7. Video Generation

Generate videos from text descriptions + image inputs. The agent can create short video clips based on your prompts.

In [15]:
result = chat_completion(
    prompt="Generate a powerful paint explosion video effect of this logo in an empty room, spreading it's colors outwards onto the white walls. Return the proper presigned URL with https at starting.",
    images=["https://raw.githubusercontent.com/vlm-run/.github/main/profile/assets/vlm-blue.png"],
    videos=None,
    response_model=VideoUrlListResponse
)

print(">> RESPONSE")
print(f"Generated video URLs")
print(result.model_dump_json(indent=2))

print("\n>> GENERATED VIDEO")
display_videos([f.url for f in result.urls], width=600)

>> RESPONSE
Generated video URLs
{
  "urls": [
    {
      "url": "https://storage.googleapis.com/vlm-userdata-prod/agents/cache/511aee28-63ea-45e3-9364-5ca5f4d5ca63/8425069056913061644/sample_0.mp4?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=vlm-deployments%40vlm-infra-prod.iam.gserviceaccount.com%2F20251219%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20251219T150555Z&X-Goog-Expires=604800&X-Goog-SignedHeaders=host&X-Goog-Signature=5ec1123798364fe78f9f3702e072196eeb59f98e6768294cd979a7836dd4e3e88083d8188641edaaef8a0ef33dd94e23798c7be6308168cbcbe5c10730a41d5154ee86728971b2eace9746aeafb78b0d3d6e0edd1762c72802a1eaf11d0228c4d1e0df3c6d7539a1af6714887211e4dd27d46e96b1cec8602c5903d823f6ef1aef2e334e98c009274d832d69eac878a66f266ffbc45640463addb3b0fe2287e7f21a88e73542628a61686eb598c75be796bf77c4b520b86d2b35e802bc05fa6f1208abe1385e4f07fd852a849036f1b87f22ef2ba6ec8b2ec210edd253896327f4760f93340bd562c868719fd460557aa9db14bb2814ed5b9c2e159718dc8591"
    }
  ]
}

>> GENERATED VIDEO


---

## Conclusion

This cookbook demonstrated the comprehensive video understanding capabilities of the **VLM Run Orion Agent API**.

### Key Takeaways

1. **OpenAI-Compatible Interface**: The API follows the OpenAI chat completions format, making it easy to integrate with existing workflows and tools.
2. **Structured Outputs**: Use Pydantic models with `response_model` parameter to get type-safe, validated responses with automatic parsing.
3. **Video Processing**: Support for video loading, captioning, summarization, frame extraction, trimming, and highlight detection.
4. **Video Generation**: Create videos from text descriptions using AI-powered generation.
5. **Streaming Support**: For long-running tasks, enable streaming to receive partial results as they become available, improving user experience.
6. **Flexible Prompting**: Natural language prompts allow you to combine multiple operations in a single request, reducing API calls and latency.

### Video Capabilities Summary

| Capability | Description |
|------------|-------------|
| **Captioning** | Generate detailed captions and summaries with chapter breakdowns |
| **Frame Sampling** | Extract frames at specific timestamps or intervals |
| **Trimming** | Cut videos to specific time ranges |
| **Highlight Extraction** | Automatically identify and extract key moments |
| **Video Generation** | Create videos from text descriptions |
| **Watermarking (coming soon)** | Add overlays and watermarks to videos |
| **YouTube Support (coming soon)** | Load and analyze YouTube videos directly |

### Next Steps

- Explore the [VLM Run Documentation](https://docs.vlm.run) for more details
- Check out the [Video Capabilities Guide](https://docs.vlm.run/agents/capabilities/video) for advanced features
- Join our [Discord community](https://discord.gg/AMApC2UzVY) for support
- Check out more examples in the [VLM Run Cookbook](https://github.com/vlm-run/vlmrun-cookbook)

Happy building!