<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a>
</p>
<p align="center">
<a href="https://discord.gg/AMApC2UzVY"><img alt="Discord" src="https://img.shields.io/badge/discord-chat-purple?color=%235765F2&label=discord&logo=discord"></a>
<a href="https://twitter.com/vlmrun"><img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/vlmrun.svg?style=social&logo=twitter"></a>
</p>
</div>

Welcome to **[VLM Run Cookbooks](https://github.com/vlm-run/vlmrun-cookbook)**, a comprehensive collection of examples and notebooks demonstrating the power of structured visual understanding using the [VLM Run Platform](https://app.vlm.run). 

## Breaking the 8000 Token Barrier: Long-form visual transcription with VLM Run

VLM Run is pioneering an API designed for video understanding at scale, capable of processing long-form content such as keynotes or films in a single request without partitioning. This capability extends beyond the typical 8192 output token limit found in many APIs, allowing for comprehensive visual transcription that includes detailed descriptions of both audio and visual elements in the video.

This notebook demonstrates how to extract both audio transcripts and visual scene descriptions from video content using VLM Run's advanced video transcription capabilities.

### Environment Setup

To get started, install the VLM Run Python SDK and sign-up for an API key on the [VLM Run App](https://app.vlm.run).
- Store the VLM Run API key under the `VLM_RUN_API_KEY` environment variable.

### Prerequisites

* Python 3.9+
* VLM Run API key (get one at [app.vlm.run](https://app.vlm.run))

## Setup

First, let's install the required packages:

In [1]:
! pip install "vlmrun[all]" --quiet
! pip install yt-dlp --quiet

## Configure VLM Run

In [2]:
import os
import getpass

VLMRUN_BASE_URL = os.getenv("VLMRUN_BASE_URL", "https://dev.vlm.run/v1")
VLMRUN_API_KEY = os.getenv("VLMRUN_API_KEY", None)
if VLMRUN_API_KEY is None:
    VLMRUN_API_KEY = getpass.getpass()

 ········


In [3]:
from vlmrun.client import VLMRun

client = VLMRun(base_url=VLMRUN_BASE_URL, api_key=VLMRUN_API_KEY)

### Download sample YouTube video

For this example, we're going to be using a sample YouTube video.


In [4]:
# Download sample youtube video for transcription purposes
import yt_dlp
from vlmrun.constants import VLMRUN_TMP_DIR

URL = "https://www.youtube.com/watch?v=KxjPgGLVJSg"

height = 720
options = {
    "outtmpl": str(VLMRUN_TMP_DIR / "%(id)s.%(ext)s"),
    "format": f"bestvideo[height<={height}][ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best",
    "keepvideo": True,
}
with yt_dlp.YoutubeDL(options) as ydl:
    info = ydl.extract_info(URL, download=True)
    path = VLMRUN_TMP_DIR / f"{info['id']}.mp4"
print(f"Downloaded video [path={path.name}, size={path.stat().st_size / 1024 / 1024:.2f} MB]")

[youtube] Extracting URL: https://www.youtube.com/watch?v=KxjPgGLVJSg
[youtube] KxjPgGLVJSg: Downloading webpage
[youtube] KxjPgGLVJSg: Downloading tv client config
[youtube] KxjPgGLVJSg: Downloading player 74e4bb46
[youtube] KxjPgGLVJSg: Downloading tv player API JSON
[youtube] KxjPgGLVJSg: Downloading ios player API JSON
[youtube] KxjPgGLVJSg: Downloading m3u8 information
[info] KxjPgGLVJSg: Downloading 1 format(s): 398+140
[download] /Users/kaushikbokka/.vlmrun/tmp/KxjPgGLVJSg.mp4 has already been downloaded
Downloaded video [path=KxjPgGLVJSg.mp4, size=24.85 MB]


### Visualize the video

In [5]:
from IPython.display import HTML, display

_, yt_id = URL.split("?v=")
IFRAME_STR = f'<iframe width="560" height="315" src="https://www.youtube.com/embed/{yt_id}?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>'

display(HTML(IFRAME_STR))



### Generate structured data from a long-form video

Let's take this 4-minute long video and generate audio and visual trascripts. We take both the audio and video transcripts and segment them into ~20s scenes.

In [11]:
from vlmrun.client.types import GenerationConfig

# Generate structured data from the video
response = client.video.generate(
    domain="video.transcription",
    file=path,
    batch=True,
    config=GenerationConfig(detail="hi"),
)
print(response.model_dump_json(indent=2))

[32m2025-03-13 08:45:59.568[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.predictions[0m:[36m_handle_file_or_url[0m:[36m317[0m - [34m[1mUploading file [path=/Users/kaushikbokka/.vlmrun/tmp/KxjPgGLVJSg.mp4, size=24.85 MB] to VLM Run[0m
[32m2025-03-13 08:45:59.571[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m56[0m - [34m[1mComputing md5 hash for file [file=/Users/kaushikbokka/.vlmrun/tmp/KxjPgGLVJSg.mp4][0m
[32m2025-03-13 08:45:59.634[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m62[0m - [34m[1mComputed md5 hash for file [file=/Users/kaushikbokka/.vlmrun/tmp/KxjPgGLVJSg.mp4, hash=8e8ee35999cc6b6a45a6ed3f9dfac24a][0m
[32m2025-03-13 08:45:59.635[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m65[0m - [34m[1mChecking if file exists in the database [file=/Users/kaushikbokka/.vlmrun/tmp/KxjPgGLVJSg.mp4, hash=8e8ee35999cc6b6a45a6ed3f9dfac24a]

{
  "id": "25936135-1f6a-4f1c-b22c-6fca51880ec1",
  "created_at": "2025-03-13T03:16:02.401101",
  "completed_at": null,
  "response": null,
  "status": "pending",
  "usage": {
    "elements_processed": null,
    "element_type": null,
    "credits_used": null
  }
}


In [12]:
from vlmrun.client.types import PredictionResponse

# Wait for the prediction to complete
response: PredictionResponse = client.predictions.wait(id=response.id, timeout=1000, sleep=5)
assert isinstance(response, PredictionResponse)

Waiting for prediction to complete:   5%|▊               | 48/1000 [04:28<1:28:51,  5.60s/it]


### Analyzing the Transcription Results

The transcription result contains rich structured data with both audio and visual information for each segment. Let's explore different ways to visualize and work with this data:

#### 1. Understanding the Response Structure

The response contains:
- `segments`: List of video segments with audio and visual transcriptions
- `metadata`: Overall video information (language, content, topics, duration)

In [13]:
import pandas as pd
pd.set_option('display.max_colwidth', 80)

# Print the high-level video transcription
df = pd.json_normalize(response.response)
df.head()

Unnamed: 0,segments,metadata.description,metadata.topics,metadata.duration
0,"[{'start_time': 0.0, 'end_time': 25.8, 'audio': {'content': ' Like the only ...",,,488.56


#### 2. Exploring Segment Details

Each segment contains:
- `start_time` and `end_time`: Temporal boundaries in seconds
- `audio.content`: Text transcription of spoken content
- `video.content`: Description of visual elements in the scene

In [15]:
pd.set_option('display.max_colwidth', 600)

segments_json = response.response.get("segments", [])
segments_df = pd.json_normalize(segments_json)
segments_df["preview"] = segments_df.apply(
    lambda x: IFRAME_STR.replace("?rel=0", f"?start={int(x['start_time'])}&end={int(x['end_time'])}"), axis=1
)
HTML(segments_df.to_html(escape=False))

Unnamed: 0,start_time,end_time,audio.content,video.content,preview
0,0.0,25.8,"Like the only way to find these opportunities to learn about them is to find weirdos on the internet that are also into this thing. Yes. And they're figuring it out too. And you can kind of compare notes. Yes. And this is how new industries are created. Literally. By weirdos on the internet. Like literally. Literally. This is Dalton, plus Michael, and today we're going to talk about why AI is going to create more successful founders in the world.",Two men are sitting at a table in a brightly lit room with large windows in the background. The man on the left is wearing a light gray button-up shirt and has curly hair. He is gesturing with his hands as he speaks. The man on the right is wearing a blue button-up shirt and glasses. He is smiling and listening attentively. They appear to be engaged in a conversation.,
1,25.8,51.71,"It's interesting, as we've gotten older, we kind of see a new set of tools come into the market and then an explosion in the number of founders who can now create value. And we've seen this before, right? Like, what was the first time you saw this? I certainly noticed when the internet was new, people that knew how to build websites were suddenly able to make lots of money from","A man in a blue shirt is seated at a table, engaged in a conversation with another person whose back is facing the camera. The man in the blue shirt is speaking and gesturing with his hands, while the other person listens attentively. The setting appears to be an indoor office or meeting room with a window in the background. The video includes text overlays that read ""AI Will Create More Successful Founders"" and ""Founder Explosion.""",
2,51.71,71.89,"the skill. And it was like really basic stuff. High school kids were making tons of money. Yep. I remember people that could just figure out how to sell stuff on eBay, where you would go buy something cheap but then listed on eBay and arbitrage. Yep. Basically, you would see people that kind of understood the new tooling that came out and would like do a hustle and make ungodly amounts of money.","Two men are engaged in a conversation at a table. The man on the left, wearing a light gray button-up shirt, is gesturing with his hands as he speaks. The man on the right, dressed in a blue jacket over a black shirt, listens attentively with his arms crossed. The background features a large window with a view of a cityscape, suggesting an urban setting. The conversation appears to be casual and focused, possibly discussing business or personal matters.",
3,72.01,92.67,"Yeah. And it was just because they understood the new tools. And I already wasn't even a hustle. Like it was a good business. Like it was, they saw that tools enabled new businesses. You know, we saw this, you know, tail end of the open source world where like we could build all of Justin TV with free software. Yep.","A man with curly hair is speaking animatedly to another man who is sitting and listening attentively. The speaker gestures with his hands as he talks, emphasizing his points. The listener remains seated, occasionally nodding and responding to the speaker. The setting appears to be an office or meeting room with a window in the background.",
4,92.67,112.85,"And then we were there in the beginning of cloud compute where we didn't have to rack servers anymore. Any kid could sign up for an Amazon account, put a couple bucks down, and get access to a server. And so what's interesting is that we might, I think we feel pretty good about saying this, we might be on","A man with a bald head and a gray beard is sitting at a table, wearing a blue shirt over a black t-shirt. He is engaged in a conversation with another person whose back is facing the camera. The man is speaking and gesturing with his hands, occasionally clapping them together. The setting appears to be an indoor environment, possibly an office or a meeting room, with a neutral background.",
5,112.85,135.69,"the cusp of the next one of these. And that means there are maybe a whole bunch of new opportunities for successful businesses to be created. Yeah, starting now. Yeah, I mean, here's another metaphor. When the iPhone came out, who would have thought that Flappy Bird would have been created? And I think I read that that guy made like 20 million in cash.","Two men are engaged in a conversation at a table. The man on the left, wearing a light gray shirt, is speaking animatedly, gesturing with his hands as he talks. The man on the right, dressed in a blue jacket over a black shirt, listens attentively, occasionally nodding and responding. The setting appears to be an office or conference room with large windows in the background, allowing natural light to fill the space.",
6,135.87,159.75,"Boom. In like two months and then shut it down. And so if you watch, okay, iPhone, Steve Jobs on stage, some guy in Southeast Asia building Flappy Bird. That's like wild. Never would have guessed. And so, again, to be very direct, what we're arguing is that when brand new technologies come out that are powerful, the people that are on the cusp of understanding them and that quickly","Two men are sitting at a table in a modern office setting with large windows in the background. The man on the left is wearing a light gray button-up shirt and has curly hair. He is gesturing with his hands as he speaks, occasionally raising them to emphasize points. The man on the right is wearing a blue jacket and glasses, with a beard. He is listening attentively, nodding his head and smiling. The conversation appears to be casual and friendly.",
7,159.75,180.77,"build businesses or build useful things using those tools have a very unique view of creating businesses and wealth. And again, to be on the nose for AI, it seems like you can do things that would require way more headcount than you would otherwise. Yes. And so, you know, we're not even saying we know the ideas.","A man with curly hair is speaking animatedly to another man who is seated and listening attentively. The speaker uses expressive hand gestures as he explains something, while the listener remains still, occasionally nodding his head. The setting appears to be an office or a casual meeting room with large windows in the background, allowing natural light to fill the space.",
8,180.97,201.72,"No. We're just saying if you're watching this and you're interested in being a founder or maybe not working at a company. Yeah. And you just pay attention to every new thing that comes out and try to find these opportunities or, I don't't know arbitrage is the right word, but no, just you know, new opportunities. New opportunities using these cutting edge tools","Two men are sitting at a table in an office setting. The man on the left is wearing a light gray shirt and has curly hair. He is gesturing with his hands as he speaks. The man on the right is wearing a blue jacket over a black shirt and has a beard. He is listening attentively with his hands clasped together. The background features large windows with a view of a body of water, and the room has a modern, minimalist design.",
9,201.72,221.8,"and you're on the bleeding edge, you're not competing with anyone. No. It's green field. I think what's cool is any time one of these technologies shifts happens, the cost of starting a business, some set of businesses, reduces by up to like 10x. Yep. And so suddenly, businesses that either wouldn't have made sense","Two men are sitting at a table in a modern office setting. The man on the left is wearing a light gray button-up shirt and has curly hair. He is gesturing with his hands as he speaks. The man on the right is wearing a blue jacket over a black shirt and has a beard. He is listening attentively and occasionally responds with hand gestures. The background features large windows with a view of the outside, and the room has a clean, minimalist design.",


As you can see, the video has been segmented into ~20s scenes each with detailed audio transcriptions and corresponding visual captions. This provides developers with a powerful means to understand the video content at a granular level.

### Thanks for following along!

Head over to the [VLM Run App](https://app.vlm.run) to try out the [VLM Run](https://vlm.run) API for yourself!