Skip to content

szepix/vidsight

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vidsight — local-first video understanding for AI agents

Give an AI agent a local video timeline it can query.
Vidsight extracts speech, selected frame captions, and on-screen text into one SQLite store, then exposes it over MCP.

License: PolyForm Noncommercial 1.0.0 MCP Local-first Python Status

Status: pre-release / in active development. The core pipeline, SQLite timeline, local provider stack, CLI, and MCP server are working today. Distribution is GitHub-only for now (not yet on PyPI). See docs/DESIGN.md for the full architecture and roadmap.


Why Vidsight

Vidsight is built around a durable timeline instead of a one-off transcript dump:

  • More than transcripts. Speech (ASR), selected frame captions, and on-screen text (OCR) are aligned on the same timeline, so an agent can retrieve what was said, shown, or written.
  • Local-first. The core store and query path are local. URL ingest uses yt-dlp, and model weights may download on first use when you install the optional local stack.
  • Lean by default. The default install is vidsight[mcp]. Heavy ASR/OCR/VLM dependencies are opt-in with --local or vidsight[local,mcp].

Input can be a local file or a URL. URLs require the local extra because they use yt-dlp. Process once into a durable store, then search, zoom into time ranges, or assemble cited context.

Install

Option A — agent installer (recommended)

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash

This is safe for an agent to run. The default install is lightweight: it installs vidsight[mcp], creates a self-contained environment under ~/.vidsight/, and registers the MCP server for detected agent environments. Supported targets today:

  • Claude Code
  • Codex
  • OpenCode

Installer source: install.sh.

Manual target selection:

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash -s -- --target claude --target codex

Install the local provider stack (ASR/OCR/VLM) up front:

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/install.sh | bash -s -- --local

From a local clone, the same options work:

git clone https://github.com/szepix/vidsight && cd vidsight
bash install.sh --target claude --local

Runtime files live under ~/.vidsight/: a dedicated venv/, model cache/, downloaded video data/, and the vidsight.db timeline. Agent integrations are written to the relevant config directories. For Claude Code that includes the vidsight skill and /watch slash command; for Codex/OpenCode it includes MCP config and a matching skill.

Restart/reconnect your agent after install.

Install size & requirements

The default installer is intentionally small. Use --local only on machines that should process video themselves.

Mode What gets installed Typical first install Disk before videos/model cache Runtime expectations
default Vidsight CLI + MCP server about 1-3 minutes hundreds of MB light CPU/RAM; can query existing timelines
--local default + yt-dlp, ASR, OCR, VLM stack, Torch about 5-20 minutes about 1.1 GB on a clean macOS test before model weights CPU works; GPU/MPS helps vision; 8 GB RAM minimum, 16 GB recommended

First video ingest can take longer than installation because it may download ASR/VLM model weights and transcribe the whole video. Disk usage then includes the downloaded source video, extracted audio, the SQLite timeline, and model caches. Speech-only ingest is the cheapest path; VLM captioning is the heaviest stage. The default local ingest favors complete context; use --scan speech for a fast transcript-only pass.

Uninstall removes all of the above, including the venv/cache/data/db folder:

curl -fsSL https://raw.githubusercontent.com/szepix/vidsight/main/uninstall.sh | bash

Uninstaller source: uninstall.sh. To only unregister integrations while keeping ~/.vidsight, add --keep-data.

Option B — manual (any MCP client / CLI use)

git clone https://github.com/szepix/vidsight && cd vidsight
pip install ".[local,mcp]"     # or ".[mcp]" for a tiny install without the local model stack

Extras:

Extra Pulls in
local local provider stack: yt-dlp, faster-whisper, transformers (<5), torch, accelerate, Pillow, rapidocr-onnxruntime
mcp the MCP server (mcp[cli])
dev pytest, ruff, mypy

(mlx Apple acceleration and cloud providers are on the roadmap, not yet shipped.)

System requirement: ffmpeg (brew install ffmpeg · apt install ffmpeg · winget install ffmpeg). Python: 3.11+.

Windows note: the shell installer is for macOS/Linux/WSL. Native Windows should use the manual pip install path and the MCP config shown below until a .ps1 installer exists.

Use With Your Agent

After installing with --local, just talk to the agent (the bundled skill triggers on video questions):

"What is this video about? https://www.youtube.com/watch?v=…" "Ingest ~/talks/keynote.mp4, then tell me every moment they show a code example and what the code says."

Or use the slash command in Claude Code: /watch <youtube-url | path | vid_id> [question].

With the lightweight default install, the MCP server can start and query an existing timeline, but ingesting new videos needs a provider such as faster-whisper. Install the local stack with --local when you want the agent to process videos on this machine.

When local providers are installed, agents should prefer the high-level watch tool/CLI command. It ingests the video, collects metadata, stage status, segment counts, transcript text, timeline samples, and cited retrieval context in one response. Lower-level tools remain available for diagnostics and follow-up searches.

A plain ingest uses the auto scan preset. auto means a full scan for normal profiles, but with profile="tiny" it resolves to screen so it does not silently claim VLM while the profile disables visual captions. Agents should choose a cheaper scan when the request is narrow:

Scan preset Runs Best for
auto / full ASR + OCR + VLM broad summaries, "what happens", reading visible text, mixed questions
screen ASR + OCR talks, slides, demos, trailers with important on-screen text
speech ASR only fast transcript-only summaries or long videos where the user wants a cheap first pass

For long videos, batches, or unclear cost expectations, the agent should ask before running auto/full.

Optional dependencies. If an optional visual capability needs a package that is not installed, that stage is skipped and reported in ingest warnings. The agent can then, with your consent, call install_dependency(...) to install into the server's own venv. For a clean local setup, prefer reinstalling with --local.

For a manual MCP registration in another client, the default database is ~/.vidsight/vidsight.db:

{
  "mcpServers": {
    "vidsight": {
      "command": "vidsight",
      "args": ["mcp"]
    }
  }
}

How It Works

Vidsight turns one video source into one searchable timeline:

Stage Output
Resolve + probe Local file, metadata, stable vid_... id
Scene detection Time ranges for visual sampling
Audio extraction → ASR speech segments
Frame sampling → VLM caption segments
Frame sampling → OCR ocr segments
Embedding + storage SQLite timeline with baseline text search
  1. Ingest — resolve a local path or URL (yt-dlp) to a normalized file; hash it (sha256, so the video id is content-based).
  2. Probe & scene-detect — read duration/fps; find shot boundaries with ffmpeg-backed sampling.
  3. Scene-aware sampling — caption + OCR run on a representative frame per scene (not every second). This keeps even a slow CPU vision model cheap, so it works on modest hardware.
  4. Extract — audio goes to ASR; sampled frames go to the vision model (caption) and OCR (on-screen text).
  5. Fuse — every signal becomes a time-stamped row in one SQLite timeline (kind = speech | caption | ocr | audio_event), with dependency-free text vectors for baseline search. Provenance is recorded.
  6. Query — your agent searches, zooms into a time window, or asks a question. Processing is idempotent and resumable: re-ingesting the same video skips finished stages, and an optional stage (VLM/OCR) that fails a dependency check is skipped, reported, and retried on the next ingest — never fatal.

Vidsight retrieves and grounds; your agent's own model does the reasoning. No chat LLM is bundled.

Tools (MCP)

Tool What it does
watch(source, question?, profile?, scan?, providers?, k?) One-call agent workflow: ingest, then return video metadata, segment counts, stage states, transcript text, timeline excerpt, warnings, and cited context for the question.
ingest(source, profile?, scan?, providers?) Lower-level ingest into the timeline store; streams progress and returns {video_id, status, warnings}. scan="auto" chooses the local preset from the profile unless providers is supplied.
list_videos() List processed videos and their status.
search(query, video_id?, kinds?, k?) Baseline text-vector search across speech / captions / OCR; returns time-stamped hits.
get_timeline(video_id, t_start, t_end) Everything in a time window — "what's on screen at 2:00–2:30".
get_transcript(video_id, t_start?, t_end?) Plain transcript, optionally for a range.
ask(question, video_id?, kinds?, k?) Retrieval-only: returns assembled, cited context for your agent to answer from.
get_segment(segment_id) A single segment by id.
install_dependency(package) Self-healing deps: install a package into the server's own venv, then reset its module cache so the new capability works on the next call — no restart. Used (with user consent) when an optional stage reports a missing dependency.

Choose scan presets per ingest with scan, or choose exact backends with providers, e.g. providers={"asr": "faster-whisper", "vlm": "moondream2", "ocr": "rapidocr"}.

Examples

CLI (bash / macOS / Linux / WSL)

vidsight watch "https://www.youtube.com/watch?v=..." --question "summarize this video"

Lower-level CLI:

VIDEO_ID=$(vidsight ingest ~/talks/keynote.mp4)
vidsight search "pricing slide" --video "$VIDEO_ID"
vidsight timeline "$VIDEO_ID" 120 150
vidsight transcript "$VIDEO_ID" > keynote.txt
vidsight providers                             # show registered backends

Fast transcript-only pass:

VIDEO_ID=$(vidsight ingest ~/talks/keynote.mp4 --scan speech)

CLI (PowerShell)

$videoId = vidsight ingest "$HOME\Videos\keynote.mp4"
vidsight search "pricing slide" --video $videoId
vidsight timeline $videoId 120 150
vidsight transcript $videoId | Out-File keynote.txt
vidsight providers

Agent prompts (once the MCP server is connected)

  • "Summarize this lecture and cite the timestamps for each main point."
  • "Find where the speaker demos the dashboard and read the text shown on screen."
  • "What product names appear visually in the first 5 minutes?"

Architecture

Layered: a pure Python core engine with thin CLI and MCP frontends — so the engine is testable in isolation and reusable.

core/      ingest · probe · scenes · audio · frames · pipeline · store · query · paths · config
providers/ asr · vlm · ocr · embed     (swappable backends, discovered via entry points)
cli/       command-line frontend
mcp/       FastMCP server (thin wrapper over core query + ingest, runs ingest off the event loop)

The timeline is the unifying model: every extracted signal is a time-stamped segment row, so search and retrieval are uniform regardless of which provider produced it. Heavy/optional dependencies are imported lazily, so the base install stays tiny and import vidsight pulls in no model framework.

Compatibility & Providers

Providers are pluggable behind ABCs; third parties can ship a provider package with Python entry points without forking the core. The local extra installs the current local model provider set.

Cloud/BYOK providers are not implemented yet. Today there is no built-in support for OpenAI, Anthropic, Groq, or other API keys. The future cloud extra is reserved for bring-your-own-key providers.

Step Default (local) Notes
Speech (ASR) faster-whisper base (CTranslate2) language auto-detected
Vision (caption) moondream2 (transformers) chooses CUDA, Apple MPS, or CPU; weights download lazily
On-screen text (OCR) RapidOCR (ONNX)
Embeddings SimpleTextEmbedding (dependency-free) lexical baseline; model embedders can be added later

Profiles: tiny (CPU, vision off) · balanced (default, vision on) · quality (larger ASR, more frames). Test coverage currently runs on macOS in development. Linux/macOS/Windows CI is planned.

Note: on Apple Silicon the vision model runs on MPS; the MCP server is registered with PYTORCH_ENABLE_MPS_FALLBACK=1 so any op not yet implemented on MPS falls back to CPU instead of erroring.

Roadmap

  • Core pipeline + SQLite timeline + local provider stack
  • MCP server + CLI
  • Agent installer for Claude Code, Codex, and OpenCode
  • Claude Code skill and /watch command
  • Native PowerShell installer for Windows
  • Speaker diarization (who said what)
  • Audio-event tagging beyond speech (music / applause / …)
  • cloud and mlx provider extras
  • Batch mode for many videos
  • Timeline browser (web UI)

Contributing

Issues and PRs welcome. Adding a provider (a new ASR/vision/OCR/embedding backend) is the easiest first contribution — implement the interface and register an entry point; no core fork needed. See docs/DESIGN.md.

License

Source-available under the PolyForm Noncommercial License 1.0.0. Noncommercial use, modification, and sharing are allowed with preserved author credit. Commercial use is not granted by the repo license; it requires a separate paid written license or agreement with Daniel Szepietowski (szepix) that grants consent and defines compensation.

About

Local-first video timelines for AI agents: transcribe, read on-screen text, and query videos over MCP.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors