Give an AI agent a local video timeline it can query.
Vidsight extracts speech, selected frame captions, and on-screen text into one SQLite store,
then exposes it over MCP.
Status: pre-release / in active development. The core pipeline, SQLite timeline, local provider stack, CLI, and MCP server are working today. Distribution is GitHub-only for now (not yet on PyPI). See
docs/DESIGN.mdfor the full architecture and roadmap.
Vidsight is built around a durable timeline instead of a one-off transcript dump:
- More than transcripts. Speech (ASR), selected frame captions, and on-screen text (OCR) are aligned on the same timeline, so an agent can retrieve what was said, shown, or written.
- Local-first. The core store and query path are local. URL ingest uses
yt-dlp, and model weights may download on first use when you install the optional local stack. - Lean by default. The default install is
vidsight[mcp]. Heavy ASR/OCR/VLM dependencies are opt-in with--localorvidsight[local,mcp].
Input can be a local file or a URL. URLs require the local extra because they use yt-dlp.
Process once into a durable store, then search, zoom into time ranges, or assemble cited context.
curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bashThis is safe for an agent to run. The default install is lightweight: it installs vidsight[mcp],
creates a self-contained environment under ~/.vidsight/, and registers the MCP server for detected
agent environments. Supported targets today:
- Claude Code
- Codex
- OpenCode
Installer source: install.sh.
Manual target selection:
curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash -s -- --target claude --target codexInstall the local provider stack (ASR/OCR/VLM) up front:
curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash -s -- --localFrom a local clone, the same options work:
git clone https://github.com/szepix/vidsight && cd vidsight
bash install.sh --target claude --localRuntime files live under ~/.vidsight/: a dedicated venv/, model cache/, downloaded video
data/, and the vidsight.db timeline. Agent integrations are written to the relevant config
directories. For Claude Code that includes the vidsight skill and /watch slash command; for
Codex/OpenCode it includes MCP config and a matching skill.
Restart/reconnect your agent after install.
The default installer is intentionally small. Use --local only on machines that should process video
themselves.
| Mode | What gets installed | Typical first install | Disk before videos/model cache | Runtime expectations |
|---|---|---|---|---|
| default | Vidsight CLI + MCP server | about 1-3 minutes | hundreds of MB | light CPU/RAM; can query existing timelines |
--local |
default + yt-dlp, ASR, OCR, VLM stack, Torch |
about 5-20 minutes | about 1.1 GB on a clean macOS test before model weights | CPU works; GPU/MPS helps vision; 8 GB RAM minimum, 16 GB recommended |
First video ingest can take longer than installation because it may download ASR/VLM model weights and
transcribe the whole video. Disk usage then includes the downloaded source video, extracted audio, the
SQLite timeline, and model caches. Speech-only ingest is the cheapest path; VLM captioning is the heaviest
stage. The default local ingest favors complete context; use --scan speech for a fast transcript-only pass.
Uninstall removes all of the above, including the venv/cache/data/db folder:
curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/uninstall.sh | bashUninstaller source: uninstall.sh.
To only unregister integrations while keeping ~/.vidsight, add --keep-data.
git clone https://github.com/szepix/vidsight && cd vidsight
pip install ".[local,mcp]" # or ".[mcp]" for a tiny install without the local model stackExtras:
| Extra | Pulls in |
|---|---|
local |
local provider stack: yt-dlp, faster-whisper, transformers (<5), torch, accelerate, Pillow, rapidocr-onnxruntime |
mcp |
the MCP server (mcp[cli]) |
dev |
pytest, ruff, mypy |
(mlx Apple acceleration and cloud providers are on the roadmap, not yet shipped.)
System requirement: ffmpeg (brew install ffmpeg · apt install ffmpeg · winget install ffmpeg).
Python: 3.11+.
Windows note: the shell installer is for macOS/Linux/WSL. Native Windows should use the manual
pip install path and the MCP config shown below until a .ps1 installer exists.
After installing with --local, just talk to the agent (the bundled skill triggers on video questions):
"What is this video about? https://www.youtube.com/watch?v=…" "Ingest
~/talks/keynote.mp4, then tell me every moment they show a code example and what the code says."
Or use the slash command in Claude Code: /watch <youtube-url | path | vid_id> [question].
With the lightweight default install, the MCP server can start and query an existing timeline, but ingesting
new videos needs a provider such as faster-whisper. Install the local stack with --local when you want
the agent to process videos on this machine.
When local providers are installed, agents should prefer the high-level watch tool/CLI command.
It ingests the video, collects metadata, stage status, segment counts, transcript text, timeline
samples, and cited retrieval context in one response. Lower-level tools remain available for
diagnostics and follow-up searches.
A plain ingest uses the auto scan preset. auto means a full scan for normal profiles, but with
profile="tiny" it resolves to screen so it does not silently claim VLM while the profile disables
visual captions. Agents should choose a cheaper scan when the request is narrow:
| Scan preset | Runs | Best for |
|---|---|---|
auto / full |
ASR + OCR + VLM | broad summaries, "what happens", reading visible text, mixed questions |
screen |
ASR + OCR | talks, slides, demos, trailers with important on-screen text |
speech |
ASR only | fast transcript-only summaries or long videos where the user wants a cheap first pass |
For long videos, batches, or unclear cost expectations, the agent should ask before running auto/full.
Optional dependencies. If an optional visual capability needs a package that is not installed, that
stage is skipped and reported in ingest warnings. The agent can then, with your consent, call
install_dependency(...) to install into the server's own venv. For a clean local setup, prefer reinstalling
with --local.
For a manual MCP registration in another client, the default database is ~/.vidsight/vidsight.db:
{
"mcpServers": {
"vidsight": {
"command": "vidsight",
"args": ["mcp"]
}
}
}Vidsight turns one video source into one searchable timeline:
| Stage | Output |
|---|---|
| Resolve + probe | Local file, metadata, stable vid_... id |
| Scene detection | Time ranges for visual sampling |
| Audio extraction → ASR | speech segments |
| Frame sampling → VLM | caption segments |
| Frame sampling → OCR | ocr segments |
| Embedding + storage | SQLite timeline with baseline text search |
- Ingest — resolve a local path or URL (
yt-dlp) to a normalized file; hash it (sha256, so the video id is content-based). - Probe & scene-detect — read duration/fps; find shot boundaries with ffmpeg-backed sampling.
- Scene-aware sampling — caption + OCR run on a representative frame per scene (not every second). This keeps even a slow CPU vision model cheap, so it works on modest hardware.
- Extract — audio goes to ASR; sampled frames go to the vision model (caption) and OCR (on-screen text).
- Fuse — every signal becomes a time-stamped row in one SQLite timeline (
kind = speech | caption | ocr | audio_event), with dependency-free text vectors for baseline search. Provenance is recorded. - Query — your agent searches, zooms into a time window, or asks a question. Processing is idempotent and resumable: re-ingesting the same video skips finished stages, and an optional stage (VLM/OCR) that fails a dependency check is skipped, reported, and retried on the next ingest — never fatal.
Vidsight retrieves and grounds; your agent's own model does the reasoning. No chat LLM is bundled.
| Tool | What it does |
|---|---|
watch(source, question?, profile?, scan?, providers?, k?) |
One-call agent workflow: ingest, then return video metadata, segment counts, stage states, transcript text, timeline excerpt, warnings, and cited context for the question. |
ingest(source, profile?, scan?, providers?) |
Lower-level ingest into the timeline store; streams progress and returns {video_id, status, warnings}. scan="auto" chooses the local preset from the profile unless providers is supplied. |
list_videos() |
List processed videos and their status. |
search(query, video_id?, kinds?, k?) |
Baseline text-vector search across speech / captions / OCR; returns time-stamped hits. |
get_timeline(video_id, t_start, t_end) |
Everything in a time window — "what's on screen at 2:00–2:30". |
get_transcript(video_id, t_start?, t_end?) |
Plain transcript, optionally for a range. |
ask(question, video_id?, kinds?, k?) |
Retrieval-only: returns assembled, cited context for your agent to answer from. |
get_segment(segment_id) |
A single segment by id. |
install_dependency(package) |
Self-healing deps: install a package into the server's own venv, then reset its module cache so the new capability works on the next call — no restart. Used (with user consent) when an optional stage reports a missing dependency. |
Choose scan presets per ingest with scan, or choose exact backends with providers, e.g.
providers={"asr": "faster-whisper", "vlm": "moondream2", "ocr": "rapidocr"}.
CLI (bash / macOS / Linux / WSL)
vidsight watch "https://www.youtube.com/watch?v=..." --question "summarize this video"Lower-level CLI:
VIDEO_ID=$(vidsight ingest ~/talks/keynote.mp4)
vidsight search "pricing slide" --video "$VIDEO_ID"
vidsight timeline "$VIDEO_ID" 120 150
vidsight transcript "$VIDEO_ID" > keynote.txt
vidsight providers # show registered backendsFast transcript-only pass:
VIDEO_ID=$(vidsight ingest ~/talks/keynote.mp4 --scan speech)CLI (PowerShell)
$videoId = vidsight ingest "$HOME\Videos\keynote.mp4"
vidsight search "pricing slide" --video $videoId
vidsight timeline $videoId 120 150
vidsight transcript $videoId | Out-File keynote.txt
vidsight providersAgent prompts (once the MCP server is connected)
- "Summarize this lecture and cite the timestamps for each main point."
- "Find where the speaker demos the dashboard and read the text shown on screen."
- "What product names appear visually in the first 5 minutes?"
Layered: a pure Python core engine with thin CLI and MCP frontends — so the engine is testable in isolation and reusable.
core/ ingest · probe · scenes · audio · frames · pipeline · store · query · paths · config
providers/ asr · vlm · ocr · embed (swappable backends, discovered via entry points)
cli/ command-line frontend
mcp/ FastMCP server (thin wrapper over core query + ingest, runs ingest off the event loop)
The timeline is the unifying model: every extracted signal is a time-stamped segment row, so search and
retrieval are uniform regardless of which provider produced it. Heavy/optional dependencies are imported
lazily, so the base install stays tiny and import vidsight pulls in no model framework.
Providers are pluggable behind ABCs; third parties can ship a provider package with Python entry points
without forking the core. The local extra installs the current local model provider set.
Cloud/BYOK providers are not implemented yet. Today there is no built-in support for OpenAI,
Anthropic, Groq, or other API keys. The future cloud extra is reserved for bring-your-own-key
providers.
| Step | Default (local) | Notes |
|---|---|---|
| Speech (ASR) | faster-whisper base (CTranslate2) |
language auto-detected |
| Vision (caption) | moondream2 (transformers) | chooses CUDA, Apple MPS, or CPU; weights download lazily |
| On-screen text (OCR) | RapidOCR (ONNX) | |
| Embeddings | SimpleTextEmbedding (dependency-free) |
lexical baseline; model embedders can be added later |
Profiles: tiny (CPU, vision off) · balanced (default, vision on) · quality (larger ASR, more frames).
Test coverage currently runs on macOS in development. Linux/macOS/Windows CI is planned.
Note: on Apple Silicon the vision model runs on MPS; the MCP server is registered with
PYTORCH_ENABLE_MPS_FALLBACK=1so any op not yet implemented on MPS falls back to CPU instead of erroring.
- Core pipeline + SQLite timeline + local provider stack
- MCP server + CLI
- Agent installer for Claude Code, Codex, and OpenCode
- Claude Code skill and
/watchcommand - Native PowerShell installer for Windows
- Speaker diarization (who said what)
- Audio-event tagging beyond speech (music / applause / …)
-
cloudandmlxprovider extras - Batch mode for many videos
- Timeline browser (web UI)
Issues and PRs welcome. Adding a provider (a new ASR/vision/OCR/embedding backend) is the easiest first
contribution — implement the interface and register an entry point; no core fork needed. See
docs/DESIGN.md.
Source-available under the PolyForm Noncommercial License 1.0.0. Noncommercial use, modification, and sharing are allowed with preserved author credit. Commercial use is not granted by the repo license; it requires a separate paid written license or agreement with Daniel Szepietowski (szepix) that grants consent and defines compensation.