Your AI coding agent can read every file in your repo.
It just can't see what's on your screen.
Screen-context sidecar for AI coding agents. Captures your screen, voice, and annotations, feeds them to any vision model, and produces structured Context Packs your coding agent can act on — via MCP.
Screen / Voice / Annotations ──> Vision Model + Whisper ──> Context Pack ──> Coding Agent
(Ollama, Gemini, (goal, errors, (Claude Code,
OpenRouter, Hyperbolic) signals, next) Codex, Gemini CLI)
"Debug what I'm looking at" — Capture your screen + voice-describe the bug. Your agent gets full visual context instead of you copy-pasting error messages.
"See all my monitors" — Agent captures your IDE, browser, and terminal simultaneously. Multi-monitor support: active, specific, or all displays at once.
"Annotate the problem" — Freeze your screen, draw arrows and circles on the exact bug. Agent sees precisely what you mean.
"Watch my phone" — Capture your Android device screen via ADB while developing mobile apps.
"Give remote agents eyes" — SSE server lets cloud agents, CI runners, or SSH dev boxes see your local screen. Bearer token auth included.
"Voice-first workflow" — Hold spacebar, describe what you want while looking at your screen. Agent sees + hears simultaneously.
"Monitor dashboards" — Point it at Grafana, production logs, or any dashboard. Agent captures and analyzes what's on screen.
"Context switch instantly" — Capture your screen state when switching tasks. Agent knows your new context without explanation.
"Click what you see" — OCR finds text elements with coordinates, then click/type/scroll to interact. The full see-locate-act loop for agent automation.
"What did I see before?" — Search past Context Packs by keyword. Agent recalls "last time I saw this error, the fix was X" from your observation history.
pip install eye2byte[all]Granular install options
pip install eye2byte # Core + MCP server (Pillow + fastmcp)
pip install eye2byte[voice] # + local voice transcription (openai-whisper)
pip install eye2byte[ui] # + control panel (customtkinter + pystray)
pip install eye2byte[ocr] # + coordinate-aware OCR (easyocr)
pip install eye2byte[interact] # + mouse/keyboard automation (pyautogui)
pip install eye2byte[all] # Everythingffmpeg is required for voice/clips — install via your package manager.
| Provider | Setup | Cost |
|---|---|---|
| Ollama | Install Ollama, ollama pull qwen3-vl:8b |
Free (local) |
| Gemini | Set GEMINI_API_KEY in .env |
Free tier |
| OpenRouter | Set OPENROUTER_API_KEY in .env |
Free models available |
| Hyperbolic | Set HYPERBOLIC_API_KEY in .env |
Pay per use |
# .env file — place in project dir, cwd, or ~/.eye2byte/.env
GEMINI_API_KEY=your-key-hereeye2byte capture # Screenshot + analysis
eye2byte capture --voice # + voice narration
eye2byte capture --mode window # Active window only
eye2byte-ui # Launch control panelOr run the scripts directly:
python eye2byte.py capture
python eye2byte_ui.pyEye2byte sits between your screen and your coding agent.
- Capture — takes a screenshot (full screen, window, region, or all monitors), optionally records voice and annotations
- Process — optimizes the image (~5x smaller, zero quality loss), cleans audio (noise removal + normalization), transcribes speech locally via Whisper
- Analyze — sends everything to your chosen vision model
- Output — produces a structured Context Pack the agent can act on
Every analysis produces a markdown document with structured sections:
## Goal — what the user appears to be doing
## Environment — OS, editor, repo, branch, language
## Screen State — visible panels, files, terminal output
## Signals — verbatim errors, stack traces, warnings
## Likely Situation — what's probably happening
## Suggested Next Info — what a coding agent needs nextThe agent receives this as actionable context — not a raw image dump.
Eye2byte exposes 11 tools via the Model Context Protocol. Any MCP-compatible agent can use them.
| Tool | What it does | Install |
|---|---|---|
capture_and_summarize |
Screenshot + vision analysis (monitor selection, delay, window targeting) | core |
capture_with_voice |
Screenshot + voice recording + transcription + analysis | core |
record_clip_and_summarize |
Screen clip with keyframe extraction and sequence analysis | core |
summarize_screenshot |
Analyze an existing image file | core |
transcribe_audio |
Local Whisper transcription of any audio file | core |
get_recent_context |
Retrieve recent Context Pack summaries | core |
get_screen_elements |
OCR with coordinates — find text elements and their screen positions | [ocr] |
search_context_history |
Full-text search across all past Context Packs | core |
click_element |
Click at screen coordinates (from get_screen_elements output) |
[interact] |
type_text |
Type text at current cursor position | [interact] |
press_key |
Press keyboard key or combo (e.g. "ctrl+a", "enter") | [interact] |
scroll_screen |
Scroll at a screen position | [interact] |
Eye2byte works with OpenClaw out of the box. Add to your openclaw.json:
{
"mcpServers": {
"eye2byte": {
"command": "python",
"args": ["eye2byte_mcp.py"]
}
}
}Now your OpenClaw can see your screen from any channel — WhatsApp, Telegram, Slack, Discord. An Eye2byte skill is also available on ClawHub.
For agents running on the same machine (Claude Code, Codex CLI, etc.). Add to .mcp.json:
{
"mcpServers": {
"eye2byte": {
"command": "python",
"args": ["C:/path/to/eye2byte_mcp.py"]
}
}
}That's it. The agent auto-starts the server. Use full absolute paths.
For agents on a different machine (cloud VM, SSH dev box, CI runner).
On your local machine (the one with the screen):
python eye2byte_mcp.py --sse # No auth (LAN only)
python eye2byte_mcp.py --sse --token mysecret123 # Bearer token auth
python eye2byte_mcp.py --sse --port 9000 --token abc # Custom port + authOn the remote machine (where the agent runs) — add to MCP config:
{
"mcpServers": {
"eye2byte": {
"url": "http://YOUR_LOCAL_IP:8808/sse",
"headers": {"Authorization": "Bearer mysecret123"}
}
}
}Omit headers if the server was started without --token.
Firewall note (Windows)
netsh advfirewall firewall add rule name="Eye2byte MCP" dir=in action=allow protocol=TCP localport=8808Find your local IP: ipconfig (Windows) or ip addr (Linux/macOS).
capture_and_summarize(monitor=0) # active monitor (default)
capture_and_summarize(monitor=1) # first monitor
capture_and_summarize(monitor=2) # second monitor
capture_and_summarize(monitor=-1) # ALL monitors at onceeye2byte-ui # or: python eye2byte_ui.pyA small always-on-top floating panel. Drag it anywhere. Global hotkeys work even when the panel isn't focused.
| Hotkey | Action |
|---|---|
Ctrl+Shift+1 |
Capture screenshot (uses current mode) |
Ctrl+Shift+2 |
Annotate (freeze screen, open drawing overlay) |
Ctrl+Shift+3 |
Toggle voice recording |
Ctrl+Shift+5 |
Grab clipboard image |
All shortcuts are customizable from Settings > Keyboard Shortcuts.
| Control | Action |
|---|---|
Space (hold) |
Push-to-talk — hold to record, release to stop |
| Mode selector | Cycle between Full Screen / Window / Region |
| Settings | Provider, model, image quality, shortcuts, cleanup |
| Copy @path | Copy session path for @-mentioning in chat |
| Tab | What you configure |
|---|---|
| Provider | Vision provider, model selection, API keys |
| Media | Image quality, max size, voice cleaning |
| Shortcuts | All keyboard shortcuts with key capture UI |
| Maintenance | Auto-cleanup days, cache management |
Press Ctrl+Shift+2 or click Annotate to freeze the screen and draw on it.
| Key | Tool | How to use |
|---|---|---|
X |
Arrow | Click and drag |
C |
Circle | Click and drag |
V |
Rectangle | Click and drag |
B |
Freehand | Click and drag |
T |
Text | Click to place, type your text |
| Action | How |
|---|---|
| Save | Enter — commits annotations, sends to vision model |
| Cancel | Escape — discards all annotations |
| Undo | Right-click near an annotation to remove it |
| Newline | Shift+Enter (Enter alone commits) |
| Multi-line | Text box auto-grows up to 6 lines |
Three ways to record:
| Method | How |
|---|---|
| Toggle | Ctrl+Shift+3 to start, press again to stop |
| Push-to-talk | Hold Space while panel is focused |
| Mouse PTT | Hold click on the Record button |
While recording, any captures you take are automatically bundled with the voice note into a single session.
| Platform | Screenshot | Voice | Annotation | Hotkeys |
|---|---|---|---|---|
| Windows | PowerShell .NET | ffmpeg | Pillow | Ctrl+Shift+1-5 |
| macOS | screencapture | ffmpeg | Pillow | — |
| Linux | scrot/maim/flameshot | ffmpeg | Pillow | — |
| Android | ADB (Termux) | Termux:API | — | — |
Config file: ~/.eye2byte/config.json (created on first run or via eye2byte init)
| Setting | Default | Description |
|---|---|---|
provider |
"ollama" |
Vision provider: ollama, gemini, openrouter, hyperbolic |
model |
"auto" |
Model name or "auto" for auto-detection |
voice_clean |
true |
Noise removal + pause trimming + volume normalization |
auto_cleanup_days |
7 |
Delete old captures/summaries after N days (0 = disabled) |
image_max_size |
1920 |
Max image dimension before LLM processing |
image_quality |
90 |
JPEG quality (1-100) |
| File | Purpose |
|---|---|
eye2byte.py |
Core engine — capture, voice, clip, summarize |
eye2byte_ui.py |
Control panel with hotkeys and annotation overlay |
eye2byte_mcp.py |
MCP server for coding agent integration |
eye2byte_ocr.py |
Coordinate-aware OCR via easyocr |
eye2byte_interact.py |
Mouse/keyboard automation via pyautogui |
eye2byte_history.py |
Searchable context history via SQLite FTS5 |
MIT
