<div align="center">
<p align="center" style="width: 100%;">
    <img src="https://raw.githubusercontent.com/vlm-run/.github/refs/heads/main/profile/assets/vlm-black.svg" alt="VLM Run Logo" width="80" style="margin-bottom: -5px; color: #2e3138; vertical-align: middle; padding-right: 5px;"><br>
</p>
<h2>Cookbook</h2>
<p align="center"><a href="https://docs.vlm.run"><b>Website</b></a> | <a href="https://docs.vlm.run/"><b>API Docs</b></a> | <a href="https://docs.vlm.run/blog"><b>Blog</b></a> | <a href="https://discord.gg/AMApC2UzVY"><b>Discord</b></a>
</p>
<p align="center">
<a href="https://discord.gg/AMApC2UzVY"><img alt="Discord" src="https://img.shields.io/badge/discord-chat-purple?color=%235765F2&label=discord&logo=discord"></a>
<a href="https://twitter.com/vlmrun"><img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/vlmrun.svg?style=social&logo=twitter"></a>
</p>
</div>

Welcome to the [VLM Run](https://vlm.run) Colab Cookbook! This notebook serves as an example to help developers leverage the power of Vision Language Models (VLMs) for visual ETL.


### Environment Setup

To get started, install the VLM Run Python SDK and sign-up for an API key on the [VLM Run App](https://app.vlm.run).
- Store the VLM Run API key under the `VLM_RUN_API_KEY` environment variable.

### Install Dependencies

In [None]:
%pip install "vlmrun[all]"
%pip install yt-dlp

### Initialize the VLM Run Client

In [None]:
from vlmrun.client import VLMRun


client = VLMRun()
client

In [3]:
# Let's check if the API is online
client.healthcheck()

True

### Download sample YouTube video

For this example, we're going to be using a sample YouTube video.


In [None]:
# Download sample youtube video for transcription purposes
import yt_dlp
from vlmrun.constants import VLMRUN_TMP_DIR

URL = "https://www.youtube.com/watch?v=KxjPgGLVJSg"

height = 720
options = {
    "outtmpl": str(VLMRUN_TMP_DIR / "%(id)s.%(ext)s"),
    "format": f"bestvideo[height<={height}][ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best",
    "keepvideo": True,
}
with yt_dlp.YoutubeDL(options) as ydl:
    info = ydl.extract_info(URL, download=True)
    path = VLMRUN_TMP_DIR / f"{info['id']}.mp4"
print(f"Downloaded video [path={path.name}, size={path.stat().st_size / 1024 / 1024:.2f} MB]")

### Visualize the video

In [42]:
from IPython.display import HTML, display

_, yt_id = URL.split("?v=")
IFRAME_STR = f'<iframe width="560" height="315" src="https://www.youtube.com/embed/{yt_id}?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>'

display(HTML(IFRAME_STR))

### Generate structured data from a long-form video

Let's take this 4-minute long video and generate audio and visual trascripts. We take both the audio and video transcripts and segment them into ~20s scenes.

In [29]:
from vlmrun.client.types import GenerationConfig

# Generate structured data from the video
response = client.video.generate(
    domain="video.transcription",
    file=path,
    batch=True,
    config=GenerationConfig(detail="hi"),
)
print(response.model_dump_json(indent=2))

[32m2025-03-11 21:44:20.193[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.predictions[0m:[36m_handle_file_or_url[0m:[36m317[0m - [34m[1mUploading file [path=/Users/sudeep/.vlmrun/tmp/KxjPgGLVJSg.mp4, size=24.85 MB] to VLM Run[0m
[32m2025-03-11 21:44:20.193[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m56[0m - [34m[1mComputing md5 hash for file [file=/Users/sudeep/.vlmrun/tmp/KxjPgGLVJSg.mp4][0m
[32m2025-03-11 21:44:20.242[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m62[0m - [34m[1mComputed md5 hash for file [file=/Users/sudeep/.vlmrun/tmp/KxjPgGLVJSg.mp4, hash=8e8ee35999cc6b6a45a6ed3f9dfac24a][0m
[32m2025-03-11 21:44:20.243[0m | [34m[1mDEBUG   [0m | [36mvlmrun.client.files[0m:[36mget_cached_file[0m:[36m65[0m - [34m[1mChecking if file exists in the database [file=/Users/sudeep/.vlmrun/tmp/KxjPgGLVJSg.mp4, hash=8e8ee35999cc6b6a45a6ed3f9dfac24a][0m
[32m2025-03-11 21:

{
  "id": "6ad94f2d-173f-4635-a294-0daf6be63779",
  "created_at": "2025-03-12T04:44:21.947459",
  "completed_at": null,
  "response": null,
  "status": "pending",
  "usage": {
    "elements_processed": null,
    "element_type": null,
    "credits_used": null
  }
}


In [34]:
from vlmrun.client.types import PredictionResponse

# Wait for the prediction to complete
response: PredictionResponse = client.predictions.wait(id=response.id, timeout=600, sleep=5)
assert isinstance(response, PredictionResponse)

Waiting for prediction to complete:   0%|          | 0/600 [00:01<?, ?it/s]


In [43]:
import pandas as pd
pd.set_option('display.max_colwidth', 80)

# Print the high-level video transcription
df = pd.json_normalize(response.response)
df.head()

Unnamed: 0,segments,metadata.language,metadata.content,metadata.topics,metadata.duration
0,"[{'start_time': 0.0, 'end_time': 25.8, 'audio': {'content': ' Like the only ...",,,,488.56


In [40]:
pd.set_option('display.max_colwidth', 600)

segments_json = response.response.get("segments", [])
segments_df = pd.json_normalize(segments_json)
segments_df["preview"] = segments_df.apply(
    lambda x: IFRAME_STR.replace("?rel=0", f"?start={int(x['start_time'])}&end={int(x['end_time'])}"), axis=1
)
HTML(segments_df.to_html(escape=False))

Unnamed: 0,start_time,end_time,audio.content,video.content,preview
0,0.0,25.8,"Like the only way to find these opportunities to learn about them is to find weirdos on the internet that are also into this thing. Yes. And they're figuring it out too. And you can kind of compare notes. Yes. And this is how new industries are created. Literally. By weirdos on the internet. Like literally. Literally. This is Dalton, plus Michael, and today we're going to talk about why AI is going to create more successful founders in the world.","Two men are engaged in a conversation at a table. The man on the left, wearing a light gray shirt, is gesturing with his hands as he speaks. The man on the right, dressed in a blue shirt, listens attentively and occasionally responds with hand gestures. They appear to be in a professional setting, possibly an office or conference room, with large windows in the background allowing natural light to fill the space.",
1,25.8,51.71,"It's interesting, as we've gotten older, we kind of see a new set of tools come into the market and then an explosion in the number of founders who can now create value. And we've seen this before, right? Like, what was the first time you saw this? I certainly noticed when the internet was new, people that knew how to build websites were suddenly able to make lots of money from","The video features two individuals engaged in a conversation at a table. The person on the left, wearing a light gray shirt, is facing the person on the right, who is dressed in a blue jacket over a black shirt. The background is minimalistic, with a plain wall and a window allowing natural light to enter. The text overlay on the left side of the screen reads ""AI Will Create More Successful Founders"" and ""Founder Explosion."" On the right side, there is a list titled ""Founder Explosion"" with various items such as ""On The Cusp,"" ""Cost Of Business,"" ""Get In Early,"" ""Whatnot,"" ""Endless Opportunity,"" and ""Internet Weirdos."" The conversation appears to be focused on the impact of artificial intelligence on business and entrepreneurship.",
2,51.71,71.89,"the skill. And it was like really basic stuff. High school kids were making tons of money. Yep. I remember people that could just figure out how to sell stuff on eBay, where you would go buy something cheap but then listed on eBay and arbitrage. Yep. Basically, you would see people that kind of understood the new tooling that came out and would like do a hustle and make ungodly amounts of money.","The video features two men engaged in a conversation in an office setting. The man on the left, wearing a light gray button-up shirt, is actively speaking and gesturing with his hands, while the man on the right, dressed in a blue jacket over a black shirt, listens attentively with his arms crossed. The background includes a large window with blinds partially drawn, allowing natural light to filter into the room. The conversation appears to be focused on business-related topics, as indicated by the text on the right side of the screen, which lists various themes such as 'On The Cusp,' 'Cost Of Business,' 'Get In Early,' 'Whatnot,' 'Endless Opportunity,' and 'Internet Weirdos.' The overall atmosphere suggests a professional discussion.",
3,72.01,92.67,"Yeah. And it was just because they understood the new tools. And I already wasn't even a hustle. Like it was a good business. Like it was, they saw that tools enabled new businesses. You know, we saw this, you know, tail end of the open source world where like we could build all of Justin TV with free software. Yep.","The video features two men engaged in a conversation in an office setting. The man on the left, wearing a light gray shirt, is gesturing animatedly with his hands as he speaks, indicating an active discussion. The man on the right, dressed in a blue jacket over a black shirt, listens attentively with his hands clasped together on the table. The background includes a window with blinds partially open, allowing natural light to filter into the room. On the right side of the screen, there is a vertical list titled ""Founder Explosion"" with various topics such as ""On The Cusp,"" ""Cost Of Business,"" ""Get In Early,"" ""Whatnot,"" ""Endless Opportunity,"" and ""Internet Weirdos.""",
4,92.67,112.85,"And then we were there in the beginning of cloud compute where we didn't have to rack servers anymore. Any kid could sign up for an Amazon account, put a couple bucks down, and get access to a server. And so what's interesting is that we might, I think we feel pretty good about saying this, we might be on","The video features a conversation between two men seated at a table in a modern office setting. The man on the right, wearing a blue shirt and glasses, is speaking animatedly, gesturing with his hands as he discusses various topics related to entrepreneurship and business. The man on the left, dressed in a light-colored shirt, listens attentively, occasionally nodding and responding. The background includes a white wall and a window, suggesting a professional environment. On the right side of the screen, there is a list of topics being discussed, such as 'Founder Explosion,' 'On The Cusp,' 'Cost Of Business,' 'Get In Early,' 'Whatnot,' 'Endless Opportunity,' and 'Internet Weirdos.'",
5,112.85,135.69,"the cusp of the next one of these. And that means there are maybe a whole bunch of new opportunities for successful businesses to be created. Yeah, starting now. Yeah, I mean, here's another metaphor. When the iPhone came out, who would have thought that Flappy Bird would have been created? And I think I read that that guy made like 20 million in cash.","The video features two men engaged in a conversation at a table. The man on the left, wearing a light gray shirt, is speaking animatedly, gesturing with his hands as he talks. The man on the right, dressed in a blue jacket over a black shirt, listens attentively, occasionally nodding and responding. The setting appears to be an office or meeting room with large windows in the background, allowing natural light to fill the space. The overall atmosphere suggests a professional discussion or interview.",
6,135.87,159.75,"Boom. In like two months and then shut it down. And so if you watch, okay, iPhone, Steve Jobs on stage, some guy in Southeast Asia building Flappy Bird. That's like wild. Never would have guessed. And so, again, to be very direct, what we're arguing is that when brand new technologies come out that are powerful, the people that are on the cusp of understanding them and that quickly","Two men are engaged in a conversation at a table. The man on the left, wearing a light gray shirt, is gesturing animatedly with his hands as he speaks. The man on the right, dressed in a blue jacket, listens attentively, occasionally nodding and smiling. The background features a large window with a view of a body of water, suggesting an indoor setting with natural light.",
7,159.75,180.77,"build businesses or build useful things using those tools have a very unique view of creating businesses and wealth. And again, to be on the nose for AI, it seems like you can do things that would require way more headcount than you would otherwise. Yes. And so, you know, we're not even saying we know the ideas.","The video features two men engaged in a conversation in an indoor setting. The man on the left, wearing a light gray button-up shirt, is actively gesturing with his hands as he speaks, indicating an animated discussion. The man on the right, dressed in a dark blue shirt, listens attentively, occasionally nodding and responding. The background includes a window with blinds, suggesting a modern office or studio environment. The video also displays a sidebar with various topics such as 'On The Cusp,' 'Cost Of Business,' 'Get In Early,' 'Whatnot,' 'Endless Opportunity,' 'Internet Weirdos,' and 'New Is The Time,' which likely relate to the conversation's themes.",
8,180.97,201.72,"No. We're just saying if you're watching this and you're interested in being a founder or maybe not working at a company. Yeah. And you just pay attention to every new thing that comes out and try to find these opportunities or, I don't't know arbitrage is the right word, but no, just you know, new opportunities. New opportunities using these cutting edge tools","The video depicts a conversation between two men seated at a table in an office setting. The man on the left, wearing a light gray shirt, is gesturing animatedly with his hands as he speaks, while the man on the right, dressed in a blue jacket over a black shirt, listens attentively with his hands clasped together. The background features large windows with a view of a cityscape, and the room has a modern, minimalist design with white walls and a light-colored floor. The conversation appears to be focused and engaged, with both individuals actively participating in the dialogue.",
9,201.72,221.8,"and you're on the bleeding edge, you're not competing with anyone. No. It's green field. I think what's cool is any time one of these technologies shifts happens, the cost of starting a business, some set of businesses, reduces by up to like 10x. Yep. And so suddenly, businesses that either wouldn't have made sense","The video features two men engaged in a conversation at a table. The man on the left, wearing a light gray shirt, is gesturing with his hands as he speaks, while the man on the right, dressed in a blue jacket over a black shirt, listens attentively. The setting appears to be an office or meeting room with large windows in the background, allowing natural light to fill the space. The conversation seems to revolve around business topics, as indicated by the text on the right side of the screen, which includes phrases like 'Cost Of Business,' 'Get In Early,' 'Whatnot,' 'Endless Opportunity,' 'Internet Weirdos,' and 'Now Is The Time.' The overall atmosphere suggests a professional discussion.",


As you can see, the video has been segmented into ~20s scenes each with detailed audio transcriptions and corresponding visual captions. This provides developers with a powerful means to understand the video content at a granular level.

### Thanks for following along!

Head over to the [VLM Run App](https://app.vlm.run) to try out the [VLM Run](https://vlm.run) API for yourself!