<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a> | <a href="https://chat.vlm.run"><b>Chat</b></a>
</p>
</div>

# VLM Run Orion - Video Understanding, Reasoning and Execution (Node.js)

This comprehensive cookbook demonstrates [VLM Run Orion's](https://vlm.run/orion) video understanding, reasoning and execution capabilities using **Node.js/TypeScript**. For more details on the API, see the [Agent API docs](https://docs.vlm.run/agents/introduction).

For this notebook, we'll cover how to use the **VLM Run Agent Chat Completions API** - an OpenAI-compatible interface for building powerful visual intelligence with the same familiar chat-completions interface.

We'll cover the following topics:
 1. Video uploads (load videos from URLs/files)
 2. Video Captioning & Summarization (generate detailed captions, summaries, and chapters)
 3. Video Frame Sampling (extract frames at specific timestamps or intervals)
 4. Video Trimming (extract specific segments from videos)
 5. Video Parsing & Analysis (parse video content, detect scene changes)
 6. Video Generation (text-to-video generation)
 7. Streaming Responses (for long-running video tasks)

## Prerequisites

- Node.js 18+
- VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))
- Deno or tslab kernel for running TypeScript in Jupyter


## Setup

First, install the required packages and configure the environment.


// Install the VLM Run SDK
// npm install vlmrun zod zod-to-json-schema

// If using Deno kernel, install dependencies via npm specifiers
// For tslab, run: npm install vlmrun zod zod-to-json-schema in your project directory


In [None]:
// Import the VLM Run SDK and dependencies
import { VlmRun } from "npm:vlmrun";
import { z } from "npm:zod";
import { zodToJsonSchema } from "npm:zod-to-json-schema";


In [None]:
// Get API key from environment variable
const VLMRUN_API_KEY = Deno.env.get("VLMRUN_API_KEY");

if (!VLMRUN_API_KEY) {
    throw new Error("Please set the VLMRUN_API_KEY environment variable");
}

console.log("✓ API Key loaded successfully");


## Initialize the VLM Run Client

We use the OpenAI-compatible chat completions interface through the VLM Run SDK.


In [None]:
// Initialize the VLM Run client using the SDK
const client = new VlmRun({
    apiKey: VLMRUN_API_KEY,
    baseURL: "https://agent.vlm.run/v1"  // Use the agent API endpoint
});

console.log("✓ VLM Run client initialized successfully!");
console.log("Base URL: https://agent.vlm.run/v1");
console.log("Model: vlmrun-orion-1");


## Response Models (Schemas)

We define Zod schemas for structured outputs. These schemas provide type-safe, validated responses.


In [None]:
// Video URL Response Schema
const VideoUrlResponseSchema = z.object({
    url: z.string().describe("Pre-signed URL to the video")
});

type VideoUrlResponse = z.infer<typeof VideoUrlResponseSchema>;

// Video URL List Response Schema
const VideoUrlListResponseSchema = z.object({
    urls: z.array(VideoUrlResponseSchema).describe("List of pre-signed URLs to the videos")
});

type VideoUrlListResponse = z.infer<typeof VideoUrlListResponseSchema>;

// Video Chapter Schema
const VideoChapterSchema = z.object({
    start_time: z.string().describe("Start time of the chapter in HH:MM:SS format"),
    end_time: z.string().describe("End time of the chapter in HH:MM:SS format"),
    description: z.string().describe("Description of the chapter content")
});

// Parsed Video Response Schema
const ParsedVideoResponseSchema = z.object({
    topic: z.string().describe("Main topic of the video"),
    summary: z.string().describe("Summary of the video content"),
    chapters: z.array(VideoChapterSchema).default([]).describe("List of video chapters with timestamps and descriptions")
});

type ParsedVideoResponse = z.infer<typeof ParsedVideoResponseSchema>;

// Video Frame Schema
const VideoFrameSchema = z.object({
    url: z.string().describe("URL of the video frame"),
    timestamp: z.string().describe("Timestamp of the frame in HH:MM:SS.MS format")
});

// Video Frames Response Schema
const VideoFramesResponseSchema = z.object({
    frames: z.array(VideoFrameSchema).describe("List of extracted frames")
});

type VideoFramesResponse = z.infer<typeof VideoFramesResponseSchema>;

// Video Trim Response Schema
const VideoTrimResponseSchema = z.object({
    url: z.string().describe("URL of the trimmed video"),
    start_time: z.string().describe("Start time of the trimmed segment"),
    end_time: z.string().describe("End time of the trimmed segment")
});

type VideoTrimResponse = z.infer<typeof VideoTrimResponseSchema>;

// Video Highlight Schema
const VideoHighlightSchema = z.object({
    start_time: z.string().describe("Start time of the highlight in HH:MM:SS.MS format"),
    end_time: z.string().describe("End time of the highlight in HH:MM:SS.MS format"),
    url: z.string().describe("URL of the extracted highlight video"),
    description: z.string().default("").describe("Description of the highlight")
});

// Video Highlights Response Schema
const VideoHighlightsResponseSchema = z.object({
    highlights: z.array(VideoHighlightSchema).describe("List of extracted highlights")
});

type VideoHighlightsResponse = z.infer<typeof VideoHighlightsResponseSchema>;

console.log("✓ Response schemas defined successfully!");
console.log("Schemas include type-safe validation for structured outputs.");


In [None]:
/**
 * Make a chat completion request with optional videos and structured output.
 * 
 * @param prompt - The text prompt/instruction
 * @param videos - Optional list of videos to process (URLs or file paths)
 * @param images - Optional list of images to process (URLs)
 * @param responseSchema - Optional Zod schema for structured output
 * @param model - Model to use (default: vlmrun-orion-1:auto)
 * @returns Parsed response if responseSchema provided, else raw response text
 */
async function chatCompletion<T>(
    prompt: string,
    videos?: string[],
    images?: string[],
    responseSchema?: z.ZodSchema<T>,
    model: string = "vlmrun-orion-1:auto"
): Promise<T | string> {
    const content: any[] = [];
    content.push({ type: "text", text: prompt });

    // Add images if provided
    if (images) {
        for (const image of images) {
            if (typeof image === "string") {
                if (!image.startsWith("http")) {
                    throw new Error("Image URLs must start with http or https");
                }
                content.push({
                    type: "image_url",
                    image_url: { url: image, detail: "auto" }
                });
            }
        }
    }

    // Add videos if provided
    if (videos) {
        for (const video of videos) {
            if (typeof video === "string") {
                if (video.startsWith("http")) {
                    // Video URL
                    content.push({
                        type: "video_url",
                        video_url: { url: video }
                    });
                } else {
                    // Local file path - upload first
                    const file = await client.files.upload({
                        filePath: video,
                        purpose: "assistants"
                    });
                    content.push({
                        type: "input_file",
                        file_id: file.id
                    });
                }
            }
        }
    }

    const kwargs: any = {
        model: model,
        messages: [{ role: "user", content: content }]
    };

    if (responseSchema) {
        kwargs.response_format = {
            type: "json_schema",
            schema: zodToJsonSchema(responseSchema)
        } as any;
    }

    const response = await client.agent.completions.create(kwargs);
    const responseText = response.choices[0].message.content || "";

    if (responseSchema) {
        const parsed = JSON.parse(responseText);
        return responseSchema.parse(parsed) as T;
    }

    return responseText;
}

console.log("✓ Helper functions defined!");


### 1. Video Uploads

With the VLM Run Agent API, you can either upload videos from URLs or from local files and pass them to chat completions.

In the `chatCompletion` helper function above, we use the following to upload videos:

```typescript
if (videos) {
    for (const video of videos) {
        if (typeof video === "string") {
            if (video.startsWith("http")) {
                // Video URL
                content.push({
                    type: "video_url",
                    video_url: { url: video }
                });
            } else {
                // Local file path - upload first
                const file = await client.files.upload({
                    filePath: video,
                    purpose: "assistants"
                });
                content.push({
                    type: "input_file",
                    file_id: file.id
                });
            }
        }
    }
}
```


Let's look at a simple video below:


In [None]:
const VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4";

console.log(">> VIDEO URL:", VIDEO_URL);
console.log("Note: In a Jupyter notebook, you can display videos using HTML or markdown cells");


### 2. Video Captioning & Summarization

Generate detailed captions, summaries, and chapter breakdowns for videos. The agent analyzes both visual and audio content to provide comprehensive descriptions.


### 2a. Simple Video Description

Get a quick, natural language description of a video without structured output.


In [None]:
const VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4";

const result = await chatCompletion(
    "Describe what happens in this video in 2-3 sentences.",
    [VIDEO_URL]
);

console.log(">> RESPONSE");
console.log(result);
console.log("\n>> VIDEO URL:", VIDEO_URL);


### 2b. Structured Video Understanding

Parse a video and get a detailed summary with topic, summary, and chapter breakdowns.


In [None]:
const VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4";

const result = await chatCompletion(
    "Parse this video and provide a detailed summary with topic, summary, and chapter breakdowns.",
    [VIDEO_URL],
    undefined,
    ParsedVideoResponseSchema
) as ParsedVideoResponse;

console.log(">> RESPONSE");
console.log(JSON.stringify(result, null, 2));

let mdStr = "";
mdStr += `Topic: ${result.topic}\n`;
mdStr += `\nSummary: ${result.summary}\n`;
mdStr += `\nChapters (${result.chapters.length} total):\n`;
result.chapters.forEach((chapter, i) => {
    mdStr += `  ${String(i + 1).padStart(2, "0")}. [${chapter.start_time} - ${chapter.end_time}] ${chapter.description}\n`;
});

console.log("\n>> VIDEO URL:", VIDEO_URL);
console.log("\n>> FORMATTED SUMMARY:");
console.log(mdStr);


### 3. Video Frame Sampling

Extract frames from videos at specific timestamps or regular intervals. This is useful for thumbnail generation, video analysis, and content indexing.


In [None]:
const VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4";

// First, get the video summary to use for frame sampling
const summaryResult = await chatCompletion(
    "Parse this video and provide a detailed summary with topic, summary, and chapter breakdowns.",
    [VIDEO_URL],
    undefined,
    ParsedVideoResponseSchema
) as ParsedVideoResponse;

// Now sample frames based on chapters
const result = await chatCompletion(
    `Given the chapter details from the video, sample a frame from every 4 chapters and return the frame URLs with timestamps. Summary: ${JSON.stringify(summaryResult)}`,
    [VIDEO_URL],
    undefined,
    VideoFramesResponseSchema
) as VideoFramesResponse;

console.log(">> RESPONSE");
console.log(`Extracted ${result.frames.length} frames:`);
result.frames.forEach((frame, i) => {
    console.log(`  ${i + 1}. ts=${frame.timestamp}, url: ${frame.url}`);
});


### 4. Video Trimming

Extract specific segments from videos by specifying start and end times. Perfect for creating clips, highlights, or removing unwanted portions.


In [None]:
const VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4";

const result = await chatCompletion(
    "Trim this video from 00:30 to 00:45 seconds and return the trimmed video URL.",
    [VIDEO_URL],
    undefined,
    VideoTrimResponseSchema
) as VideoTrimResponse;

console.log(">> RESPONSE");
console.log(`Trimmed video URL: ${result.url}`);
console.log(`Start time: ${result.start_time}`);
console.log(`End time: ${result.end_time}`);


### 5. Video Highlight Extraction

Automatically identify and extract the most interesting or important moments from a video. The agent analyzes the content to find key scenes.


In [None]:
const VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4";

const result = await chatCompletion(
    "Extract the 3 best/most interesting moments from this video as separate clips with timestamps and descriptions.",
    [VIDEO_URL],
    undefined,
    VideoHighlightsResponseSchema
) as VideoHighlightsResponse;

console.log(">> RESPONSE");
console.log(`Extracted ${result.highlights.length} highlights:`);
result.highlights.forEach((highlight, i) => {
    console.log(`  ${String(i + 1).padStart(2, "0")}. [${highlight.start_time} - ${highlight.end_time}] ${highlight.description || ""}`);
});


### 6. Video Duration & Metadata

Get information about video duration and other metadata.


In [None]:
const VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.agent/soccer_ball_juggling.mp4";

const result = await chatCompletion(
    "How long is this video in minutes and seconds? Also describe the video resolution and quality if you can determine it.",
    [VIDEO_URL]
);

console.log(">> VIDEO URL:", VIDEO_URL);
console.log("\n>> RESPONSE");
console.log(result);


### 7. Video Generation

Generate videos from text descriptions + image inputs. The agent can create short video clips based on your prompts.


In [None]:
const result = await chatCompletion(
    "Generate a powerful paint explosion video effect of this logo in an empty room, spreading it's colors outwards onto the white walls. Return the pre-signed URL to the video.",
    undefined,
    ["https://raw.githubusercontent.com/vlm-run/.github/main/profile/assets/vlm-blue.png"],
    VideoUrlListResponseSchema
) as VideoUrlListResponse;

console.log(">> RESPONSE");
console.log("Generated video URLs");
console.log(JSON.stringify(result, null, 2));

console.log(`\n>> Generated ${result.urls.length} video(s)`);
result.urls.forEach((url, i) => {
    console.log(`  Video ${i + 1}: ${url.url}`);
});


### 8. Streaming Responses

For long-running tasks, you can use streaming to get partial results as they become available.


In [None]:
const VIDEO_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/video.transcription/bakery.mp4";

const stream = await client.agent.completions.create({
    model: "vlmrun-orion-1:auto",
    messages: [{
        role: "user",
        content: [
            { type: "text", text: "Describe this video in detail" },
            { type: "video_url", video_url: { url: VIDEO_URL } }
        ]
    }],
    stream: true
});

console.log("Streaming response:");
console.log("----------------------------------------");
let fullResponse = "";
for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) {
        fullResponse += content;
        // In a real notebook, you might want to display this incrementally
        process.stdout.write(content);
    }
}
console.log("\n----------------------------------------");
console.log("\n>> Full response length:", fullResponse.length, "characters");


---

## Conclusion

This cookbook demonstrated the comprehensive video understanding capabilities of the **VLM Run Orion Agent API** using Node.js/TypeScript.

### Key Takeaways

1. **OpenAI-Compatible Interface**: The API follows the OpenAI chat completions format, making it easy to integrate with existing workflows and tools.
2. **Structured Outputs**: Use Zod schemas with `response_format` parameter to get type-safe, validated responses with automatic parsing.
3. **Type Safety**: TypeScript and Zod provide compile-time and runtime type checking for better developer experience.
4. **Video Processing**: Support for video loading, captioning, summarization, frame extraction, trimming, and highlight detection.
5. **Video Generation**: Create videos from text descriptions using AI-powered generation.
6. **Streaming Support**: For long-running tasks, enable streaming to receive partial results as they become available, improving user experience.
7. **Flexible Prompting**: Natural language prompts allow you to combine multiple operations in a single request, reducing API calls and latency.

### Video Capabilities Summary

| Capability | Description |
|------------|-------------|
| **Captioning** | Generate detailed captions and summaries with chapter breakdowns |
| **Frame Sampling** | Extract frames at specific timestamps or intervals |
| **Trimming** | Cut videos to specific time ranges |
| **Highlight Extraction** | Automatically identify and extract key moments |
| **Video Generation** | Create videos from text descriptions |
| **Watermarking (coming soon)** | Add overlays and watermarks to videos |
| **YouTube Support (coming soon)** | Load and analyze YouTube videos directly |

### Next Steps

- Explore the [VLM Run Documentation](https://docs.vlm.run) for more details
- Check out the [Video Capabilities Guide](https://docs.vlm.run/agents/capabilities/video) for advanced features
- Join our [Discord community](https://discord.gg/AMApC2UzVY) for support
- Check out more examples in the [VLM Run Cookbook](https://github.com/vlm-run/vlmrun-cookbook)
- Review the [VLM Run Node.js SDK](https://github.com/vlm-run/vlmrun-node-sdk) documentation

Happy building!
