Skip to content

sonpiaz/watch-cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

watch-cli

Give your AI agent eyes and ears for any social video.

watch https://twitter.com/anyone/status/12345

You get back: a video file, evenly-spaced frames as JPGs, and the full audio transcript. Your agent reads them and "watches" the video — works on YouTube, X, LinkedIn, TikTok, Reddit, Vimeo, and Facebook. Login-walled posts (LinkedIn, X, FB) fall back to your browser cookies automatically.


Why this exists

Large language models can't watch video natively — they read text and look at still images. Modern multimodal APIs will analyze a full video for you, but they're slow and expensive. The trick: you almost never need them.

A video is just frames + audio. Each piece has a fast, near-free tool already:

  • yt-dlp downloads from any social platform
  • ffmpeg extracts evenly-spaced frames
  • An ASR model transcribes the audio
  • A multimodal LLM hears tone, music, SFX, language, mood

Compose them and your agent has video understanding — without burning a multimodal LLM on every frame.


What it looks like

$ watch https://www.linkedin.com/posts/some-talk_activity-12345

VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 218
FRAMES:
  /tmp/frames_abc123/frame_01.jpg
  /tmp/frames_abc123/frame_02.jpg
  …
TRANSCRIPT:
  Today I want to talk about how decomposition unlocks 10× cost reduction in
  multimodal pipelines …

Your agent reads the JPGs and the transcript. That's the whole watch.


Benchmark

After noticing how much we burned on multimodal calls, we measured:

Approach Cost / 1-hour video Time
Multimodal LLM on full video ~$5 30–60s
watch-cli (Kyma audio) < $0.10 ~10–15s
Ratio ~50× cheaper ~5× faster

The savings compound when an agent watches dozens of videos in a session.


Install

curl -fsSL https://raw.githubusercontent.com/sonpiaz/watch-cli/main/install.sh | bash

Or from a clone:

git clone https://github.com/sonpiaz/watch-cli ~/.watch-cli
cd ~/.watch-cli && ./install.sh

The installer checks for yt-dlp, ffmpeg, jq, curl, python3 and symlinks the commands into ~/.local/bin. On macOS:

brew install yt-dlp ffmpeg jq

On Debian/Ubuntu:

sudo apt install yt-dlp ffmpeg jq python3 curl

Setup

export KYMA_API_KEY=kyma-xxxxxxxx

watch-cli runs on Kyma for the audio side — one key that opens every gate (text, image, video, audio, and what comes next). Free credit at signup covers hundreds of videos.

Get a Kyma key at kymaapi.com.

Prefer bring-your-own-keys? Comment in GROQ_API_KEY and GOOGLE_AI_KEY in .env.example and watch-cli falls back to direct provider calls.


Commands

watch <url> [frame-count] [--cookies <file>]
  Orchestrator. Downloads, extracts frames, transcribes — one block out.

dl-video <url> [out-dir] [--cookies <file>]
  Just download the video. Returns the local mp4 path.

extract-frames <video> [count] [out-dir]
  Pull N evenly-spaced JPG frames. Default 8.

transcribe <audio-or-video> [language]
  Speech-to-text. Auto-extracts audio from video first.

audio-q <audio-or-video> "<question>"
  Audio scene Q&A — tone, music, SFX, language, emotion.
  Beyond pure transcription.

models [--all]
  List audio models available on Kyma (live, no hardcoded list).
  --all to see every Kyma SKU (text + image + video + audio).

How transcribe and audio-q stay current

The scripts call Kyma using the transcribe and audio-understand aliases, not raw model IDs. When Kyma swaps the underlying model (Whisper v4, Voxtral, a faster ASR), watch-cli keeps working without an update — the alias points to whichever model is current. Run watch-cli models any time to see what's behind the alias today.


Login-walled videos

Most YouTube / TikTok / Reddit / Vimeo / public X work without setup. LinkedIn, private X posts, and Facebook need a session.

watch-cli auto-detects cookies from any signed-in browser (Chrome → Firefox → Safari → Edge → Brave → Chromium). Just sign in normally and re-run.

For servers / CI without browsers, pass a manual cookies file:

watch <url> --cookies ~/cookies.txt

Full setup walkthrough: docs/cookies.md.


Use with Claude Code (or any agent)

You have access to a `watch` command that takes a URL and returns
a video, 8 frames, and the transcript. Read the frames as images and
the transcript as text — that's enough to "watch" any social video.

The output block is structured so an agent can parse it without help: VIDEO: line, FRAMES: block (one path per line), TRANSCRIPT: block.


How it works

URL ──▶ yt-dlp ──▶ video.mp4 ──┬──▶ ffmpeg ──▶ frames/*.jpg
                                │
                                └──▶ ffmpeg ──▶ audio.mp3 ──┬──▶ Kyma /v1/audio/transcriptions
                                                            │     (Whisper Large v3 Turbo, 228× realtime)
                                                            │
                                                            └──▶ Kyma /v1/audio/understand
                                                                  (Gemini 3 Flash audio — tone/music/SFX)

Each step is a primitive. None of them needs a vision LLM.


License

MIT. © 2026 Son Piaz.

About

Give your AI agent eyes and ears for any social video. ~50× cheaper than calling a multimodal API on full video.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages