watch-cli

Give your AI agent eyes and ears for any social video.

watch https://twitter.com/anyone/status/12345

You get back: a video file, evenly-spaced frames as JPGs, and the full audio transcript. Your agent reads them and "watches" the video — works on YouTube, X, LinkedIn, TikTok, Reddit, Vimeo, and Facebook. Login-walled posts (LinkedIn, X, FB) fall back to your browser cookies automatically.

Why this exists

Large language models can't watch video natively — they read text and look at still images. Modern multimodal APIs will analyze a full video for you, but they're slow and expensive. The trick: you almost never need them.

A video is just frames + audio. Each piece has a fast, near-free tool already:

yt-dlp downloads from any social platform
ffmpeg extracts evenly-spaced frames
An ASR model transcribes the audio
A multimodal LLM hears tone, music, SFX, language, mood

Compose them and your agent has video understanding — without burning a multimodal LLM on every frame.

What it looks like

$ watch https://www.linkedin.com/posts/some-talk_activity-12345

VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 218
FRAMES:
  /tmp/frames_abc123/frame_01.jpg
  /tmp/frames_abc123/frame_02.jpg
  …
TRANSCRIPT:
  Today I want to talk about how decomposition unlocks 10× cost reduction in
  multimodal pipelines …

Your agent reads the JPGs and the transcript. That's the whole watch.

Benchmark

After noticing how much we burned on multimodal calls, we measured:

Approach	Cost / 1-hour video	Time
Multimodal LLM on full video	~$5	30–60s
watch-cli (Kyma audio)	< $0.10	~10–15s
Ratio	~50× cheaper	~5× faster

The savings compound when an agent watches dozens of videos in a session.

Install

curl -fsSL https://raw.githubusercontent.com/sonpiaz/watch-cli/main/install.sh | bash

Or from a clone:

git clone https://github.com/sonpiaz/watch-cli ~/.watch-cli
cd ~/.watch-cli && ./install.sh

The installer checks for yt-dlp, ffmpeg, jq, curl, python3 and symlinks the commands into ~/.local/bin. On macOS:

brew install yt-dlp ffmpeg jq

On Debian/Ubuntu:

sudo apt install yt-dlp ffmpeg jq python3 curl

Setup

export KYMA_API_KEY=kyma-xxxxxxxx

watch-cli runs on Kyma for the audio side — one key that opens every gate (text, image, video, audio, and what comes next). Free credit at signup covers hundreds of videos.

Get a Kyma key at kymaapi.com.

Prefer bring-your-own-keys? Comment in GROQ_API_KEY and GOOGLE_AI_KEY in .env.example and watch-cli falls back to direct provider calls.

Commands

watch <url> [frame-count] [--cookies <file>]
  Orchestrator. Downloads, extracts frames, transcribes — one block out.

dl-video <url> [out-dir] [--cookies <file>]
  Just download the video. Returns the local mp4 path.

extract-frames <video> [count] [out-dir]
  Pull N evenly-spaced JPG frames. Default 8.

transcribe <audio-or-video> [language]
  Speech-to-text. Auto-extracts audio from video first.

audio-q <audio-or-video> "<question>"
  Audio scene Q&A — tone, music, SFX, language, emotion.
  Beyond pure transcription.

models [--all]
  List audio models available on Kyma (live, no hardcoded list).
  --all to see every Kyma SKU (text + image + video + audio).

How `transcribe` and `audio-q` stay current

The scripts call Kyma using the transcribe and audio-understand aliases, not raw model IDs. When Kyma swaps the underlying model (Whisper v4, Voxtral, a faster ASR), watch-cli keeps working without an update — the alias points to whichever model is current. Run watch-cli models any time to see what's behind the alias today.

Login-walled videos

Most YouTube / TikTok / Reddit / Vimeo / public X work without setup. LinkedIn, private X posts, and Facebook need a session.

watch-cli auto-detects cookies from any signed-in browser (Chrome → Firefox → Safari → Edge → Brave → Chromium). Just sign in normally and re-run.

For servers / CI without browsers, pass a manual cookies file:

watch <url> --cookies ~/cookies.txt

Full setup walkthrough: docs/cookies.md.

Use with Claude Code (or any agent)

You have access to a `watch` command that takes a URL and returns
a video, 8 frames, and the transcript. Read the frames as images and
the transcript as text — that's enough to "watch" any social video.

The output block is structured so an agent can parse it without help: VIDEO: line, FRAMES: block (one path per line), TRANSCRIPT: block.

How it works

URL ──▶ yt-dlp ──▶ video.mp4 ──┬──▶ ffmpeg ──▶ frames/*.jpg
                                │
                                └──▶ ffmpeg ──▶ audio.mp3 ──┬──▶ Kyma /v1/audio/transcriptions
                                                            │     (Whisper Large v3 Turbo, 228× realtime)
                                                            │
                                                            └──▶ Kyma /v1/audio/understand
                                                                  (Gemini 3 Flash audio — tone/music/SFX)

Each step is a primitive. None of them needs a vision LLM.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin		bin
docs		docs
lib		lib
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

watch-cli

Why this exists

What it looks like

Benchmark

Install

Setup

Commands

How `transcribe` and `audio-q` stay current

Login-walled videos

Use with Claude Code (or any agent)

How it works

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

watch-cli

Why this exists

What it looks like

Benchmark

Install

Setup

Commands

How transcribe and audio-q stay current

Login-walled videos

Use with Claude Code (or any agent)

How it works

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

How `transcribe` and `audio-q` stay current

Packages