GitHub - szepix/vidsight: Local-first video timelines for AI agents: transcribe, read on-screen text, and query videos over MCP.

Give an AI agent a local video timeline it can query.
Vidsight extracts speech, selected frame captions, and on-screen text into one SQLite store, then exposes it over MCP.

Status: pre-release / in active development. The core pipeline, SQLite timeline, local provider stack, CLI, and MCP server are working today. Distribution is GitHub-only for now (not yet on PyPI). See docs/DESIGN.md for the full architecture and roadmap.

Why Vidsight

Vidsight is built around a durable timeline instead of a one-off transcript dump:

More than transcripts. Speech (ASR), selected frame captions, and on-screen text (OCR) are aligned on the same timeline, so an agent can retrieve what was said, shown, or written.
Local-first. The core store and query path are local. URL ingest uses yt-dlp, and model weights may download on first use when you install the optional local stack.
Lean by default. The default install is vidsight[mcp]. Heavy ASR/OCR/VLM dependencies are opt-in with --local or vidsight[local,mcp].

Input can be a local file or a URL. URLs require the local extra because they use yt-dlp. Process once into a durable store, then search, zoom into time ranges, or assemble cited context.

Install

Option A — agent installer (recommended)

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash

This is safe for an agent to run. The default install is lightweight: it installs vidsight[mcp], creates a self-contained environment under ~/.vidsight/, and registers the MCP server for detected agent environments. Supported targets today:

Claude Code
Codex
OpenCode

Installer source: install.sh.

Manual target selection:

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash -s -- --target claude --target codex

Install the local provider stack (ASR/OCR/VLM) up front:

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash -s -- --local

From a local clone, the same options work:

git clone https://github.com/szepix/vidsight && cd vidsight
bash install.sh --target claude --local

Runtime files live under ~/.vidsight/: a dedicated venv/, model cache/, downloaded video data/, and the vidsight.db timeline. Agent integrations are written to the relevant config directories. For Claude Code that includes the vidsight skill and /watch slash command; for Codex/OpenCode it includes MCP config and a matching skill.

Restart/reconnect your agent after install.

Install size & requirements

The default installer is intentionally small. Use --local only on machines that should process video themselves.

Mode	What gets installed	Typical first install	Disk before videos/model cache	Runtime expectations
default	Vidsight CLI + MCP server	about 1-3 minutes	hundreds of MB	light CPU/RAM; can query existing timelines
`--local`	default + `yt-dlp`, ASR, OCR, VLM stack, Torch	about 5-20 minutes	about 1.1 GB on a clean macOS test before model weights	CPU works; GPU/MPS helps vision; 8 GB RAM minimum, 16 GB recommended

First video ingest can take longer than installation because it may download ASR/VLM model weights and transcribe the whole video. Disk usage then includes the downloaded source video, extracted audio, the SQLite timeline, and model caches. Speech-only ingest is the cheapest path; VLM captioning is the heaviest stage. The default local ingest favors complete context; use --scan speech for a fast transcript-only pass.

Uninstall removes all of the above, including the venv/cache/data/db folder:

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/uninstall.sh | bash

Uninstaller source: uninstall.sh. To only unregister integrations while keeping ~/.vidsight, add --keep-data.

Option B — manual (any MCP client / CLI use)

git clone https://github.com/szepix/vidsight && cd vidsight
pip install ".[local,mcp]"     # or ".[mcp]" for a tiny install without the local model stack

Extras:

Extra	Pulls in
`local`	local provider stack: `yt-dlp`, `faster-whisper`, `transformers` (<5), `torch`, `accelerate`, `Pillow`, `rapidocr-onnxruntime`
`mcp`	the MCP server (`mcp[cli]`)
`dev`	`pytest`, `ruff`, `mypy`

(mlx Apple acceleration and cloud providers are on the roadmap, not yet shipped.)

System requirement: ffmpeg (brew install ffmpeg · apt install ffmpeg · winget install ffmpeg). Python: 3.11+.

Windows note: the shell installer is for macOS/Linux/WSL. Native Windows should use the manual pip install path and the MCP config shown below until a .ps1 installer exists.

Use With Your Agent

After installing with --local, just talk to the agent (the bundled skill triggers on video questions):

"What is this video about? https://www.youtube.com/watch?v=…" "Ingest ~/talks/keynote.mp4, then tell me every moment they show a code example and what the code says."

Or use the slash command in Claude Code: /watch <youtube-url | path | vid_id> [question].

With the lightweight default install, the MCP server can start and query an existing timeline, but ingesting new videos needs a provider such as faster-whisper. Install the local stack with --local when you want the agent to process videos on this machine.

When local providers are installed, agents should prefer the high-level watch tool/CLI command. It ingests the video, collects metadata, stage status, segment counts, transcript text, timeline samples, and cited retrieval context in one response. Lower-level tools remain available for diagnostics and follow-up searches.

A plain ingest uses the auto scan preset. auto means a full scan for normal profiles, but with profile="tiny" it resolves to screen so it does not silently claim VLM while the profile disables visual captions. Agents should choose a cheaper scan when the request is narrow:

Scan preset	Runs	Best for
`auto` / `full`	ASR + OCR + VLM	broad summaries, "what happens", reading visible text, mixed questions
`screen`	ASR + OCR	talks, slides, demos, trailers with important on-screen text
`speech`	ASR only	fast transcript-only summaries or long videos where the user wants a cheap first pass

For long videos, batches, or unclear cost expectations, the agent should ask before running auto/full.

Optional dependencies. If an optional visual capability needs a package that is not installed, that stage is skipped and reported in ingest warnings. The agent can then, with your consent, call install_dependency(...) to install into the server's own venv. For a clean local setup, prefer reinstalling with --local.

For a manual MCP registration in another client, the default database is ~/.vidsight/vidsight.db:

{
  "mcpServers": {
    "vidsight": {
      "command": "vidsight",
      "args": ["mcp"]
    }
  }
}

How It Works

Vidsight turns one video source into one searchable timeline:

Stage	Output
Resolve + probe	Local file, metadata, stable `vid_...` id
Scene detection	Time ranges for visual sampling
Audio extraction → ASR	`speech` segments
Frame sampling → VLM	`caption` segments
Frame sampling → OCR	`ocr` segments
Embedding + storage	SQLite timeline with baseline text search

Ingest — resolve a local path or URL (yt-dlp) to a normalized file; hash it (sha256, so the video id is content-based).
Probe & scene-detect — read duration/fps; find shot boundaries with ffmpeg-backed sampling.
Scene-aware sampling — caption + OCR run on a representative frame per scene (not every second). This keeps even a slow CPU vision model cheap, so it works on modest hardware.
Extract — audio goes to ASR; sampled frames go to the vision model (caption) and OCR (on-screen text).
Fuse — every signal becomes a time-stamped row in one SQLite timeline (kind = speech | caption | ocr | audio_event), with dependency-free text vectors for baseline search. Provenance is recorded.
Query — your agent searches, zooms into a time window, or asks a question. Processing is idempotent and resumable: re-ingesting the same video skips finished stages, and an optional stage (VLM/OCR) that fails a dependency check is skipped, reported, and retried on the next ingest — never fatal.

Vidsight retrieves and grounds; your agent's own model does the reasoning. No chat LLM is bundled.

Tools (MCP)

Tool	What it does
`watch(source, question?, profile?, scan?, providers?, k?)`	One-call agent workflow: ingest, then return video metadata, segment counts, stage states, transcript text, timeline excerpt, warnings, and cited context for the question.
`ingest(source, profile?, scan?, providers?)`	Lower-level ingest into the timeline store; streams progress and returns `{video_id, status, warnings}`. `scan="auto"` chooses the local preset from the profile unless `providers` is supplied.
`list_videos()`	List processed videos and their status.
`search(query, video_id?, kinds?, k?)`	Baseline text-vector search across speech / captions / OCR; returns time-stamped hits.
`get_timeline(video_id, t_start, t_end)`	Everything in a time window — "what's on screen at 2:00–2:30".
`get_transcript(video_id, t_start?, t_end?)`	Plain transcript, optionally for a range.
`ask(question, video_id?, kinds?, k?)`	Retrieval-only: returns assembled, cited context for your agent to answer from.
`get_segment(segment_id)`	A single segment by id.
`install_dependency(package)`	Self-healing deps: install a package into the server's own venv, then reset its module cache so the new capability works on the next call — no restart. Used (with user consent) when an optional stage reports a missing dependency.

Choose scan presets per ingest with scan, or choose exact backends with providers, e.g. providers={"asr": "faster-whisper", "vlm": "moondream2", "ocr": "rapidocr"}.

Examples

CLI (bash / macOS / Linux / WSL)

vidsight watch "https://www.youtube.com/watch?v=..." --question "summarize this video"

Lower-level CLI:

VIDEO_ID=$(vidsight ingest ~/talks/keynote.mp4)
vidsight search "pricing slide" --video "$VIDEO_ID"
vidsight timeline "$VIDEO_ID" 120 150
vidsight transcript "$VIDEO_ID" > keynote.txt
vidsight providers                             # show registered backends

Fast transcript-only pass:

VIDEO_ID=$(vidsight ingest ~/talks/keynote.mp4 --scan speech)

CLI (PowerShell)

$videoId = vidsight ingest "$HOME\Videos\keynote.mp4"
vidsight search "pricing slide" --video $videoId
vidsight timeline $videoId 120 150
vidsight transcript $videoId | Out-File keynote.txt
vidsight providers

Agent prompts (once the MCP server is connected)

"Summarize this lecture and cite the timestamps for each main point."
"Find where the speaker demos the dashboard and read the text shown on screen."
"What product names appear visually in the first 5 minutes?"

Architecture

Layered: a pure Python core engine with thin CLI and MCP frontends — so the engine is testable in isolation and reusable.

core/      ingest · probe · scenes · audio · frames · pipeline · store · query · paths · config
providers/ asr · vlm · ocr · embed     (swappable backends, discovered via entry points)
cli/       command-line frontend
mcp/       FastMCP server (thin wrapper over core query + ingest, runs ingest off the event loop)

The timeline is the unifying model: every extracted signal is a time-stamped segment row, so search and retrieval are uniform regardless of which provider produced it. Heavy/optional dependencies are imported lazily, so the base install stays tiny and import vidsight pulls in no model framework.

Compatibility & Providers

Providers are pluggable behind ABCs; third parties can ship a provider package with Python entry points without forking the core. The local extra installs the current local model provider set.

Cloud/BYOK providers are not implemented yet. Today there is no built-in support for OpenAI, Anthropic, Groq, or other API keys. The future cloud extra is reserved for bring-your-own-key providers.

Step	Default (local)	Notes
Speech (ASR)	faster-whisper `base` (CTranslate2)	language auto-detected
Vision (caption)	moondream2 (transformers)	chooses CUDA, Apple MPS, or CPU; weights download lazily
On-screen text (OCR)	RapidOCR (ONNX)
Embeddings	`SimpleTextEmbedding` (dependency-free)	lexical baseline; model embedders can be added later

Profiles: tiny (CPU, vision off) · balanced (default, vision on) · quality (larger ASR, more frames). Test coverage currently runs on macOS in development. Linux/macOS/Windows CI is planned.

Note: on Apple Silicon the vision model runs on MPS; the MCP server is registered with PYTORCH_ENABLE_MPS_FALLBACK=1 so any op not yet implemented on MPS falls back to CPU instead of erroring.

Roadmap

Contributing

Issues and PRs welcome. Adding a provider (a new ASR/vision/OCR/embedding backend) is the easiest first contribution — implement the interface and register an entry point; no core fork needed. See docs/DESIGN.md.

License

Source-available under the PolyForm Noncommercial License 1.0.0. Noncommercial use, modification, and sharing are allowed with preserved author credit. Commercial use is not granted by the repo license; it requires a separate paid written license or agreement with Daniel Szepietowski (szepix) that grants consent and defines compensation.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
docs/assets		docs/assets
integrations		integrations
src/vidsight		src/vidsight
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml
uninstall.sh		uninstall.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why Vidsight

Install

Option A — agent installer (recommended)

Install size & requirements

Option B — manual (any MCP client / CLI use)

Use With Your Agent

How It Works

Tools (MCP)

Examples

Architecture

Compatibility & Providers

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Why Vidsight

Install

Option A — agent installer (recommended)

Install size & requirements

Option B — manual (any MCP client / CLI use)

Use With Your Agent

How It Works

Tools (MCP)

Examples

Architecture

Compatibility & Providers

Roadmap

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages