Skip to content

wolverin0/Eye2byte

Repository files navigation

Eye2byte icon

Eye2byte

Your AI coding agent can read every file in your repo.
It just can't see what's on your screen.

PyPI Python 3.10+ MIT License Cross-platform Changelog


Screen-context sidecar for AI coding agents. Captures your screen, voice, and annotations, feeds them to any vision model, and produces structured Context Packs your coding agent can act on — via MCP.

Screen / Voice / Annotations  ──>  Vision Model + Whisper  ──>  Context Pack  ──>  Coding Agent
                                   (Ollama, Gemini,              (goal, errors,    (Claude Code,
                                    OpenRouter, Hyperbolic)       signals, next)     Codex, Gemini CLI)

Use Cases

"Debug what I'm looking at" — Capture your screen + voice-describe the bug. Your agent gets full visual context instead of you copy-pasting error messages.

"See all my monitors" — Agent captures your IDE, browser, and terminal simultaneously. Multi-monitor support: active, specific, or all displays at once.

"Annotate the problem" — Freeze your screen, draw arrows and circles on the exact bug. Agent sees precisely what you mean.

"Watch my phone" — Capture your Android device screen via ADB while developing mobile apps.

"Give remote agents eyes" — SSE server lets cloud agents, CI runners, or SSH dev boxes see your local screen. Bearer token auth included.

"Voice-first workflow" — Hold spacebar, describe what you want while looking at your screen. Agent sees + hears simultaneously.

"Monitor dashboards" — Point it at Grafana, production logs, or any dashboard. Agent captures and analyzes what's on screen.

"Context switch instantly" — Capture your screen state when switching tasks. Agent knows your new context without explanation.

"Click what you see" — OCR finds text elements with coordinates, then click/type/scroll to interact. The full see-locate-act loop for agent automation.

"What did I see before?" — Search past Context Packs by keyword. Agent recalls "last time I saw this error, the fix was X" from your observation history.


Quick Start

1. Install

pip install eye2byte[all]
Granular install options
pip install eye2byte             # Core + MCP server (Pillow + fastmcp)
pip install eye2byte[voice]      # + local voice transcription (openai-whisper)
pip install eye2byte[ui]         # + control panel (customtkinter + pystray)
pip install eye2byte[ocr]        # + coordinate-aware OCR (easyocr)
pip install eye2byte[interact]   # + mouse/keyboard automation (pyautogui)
pip install eye2byte[all]        # Everything

ffmpeg is required for voice/clips — install via your package manager.

2. Configure a vision provider

Provider Setup Cost
Ollama Install Ollama, ollama pull qwen3-vl:8b Free (local)
Gemini Set GEMINI_API_KEY in .env Free tier
OpenRouter Set OPENROUTER_API_KEY in .env Free models available
Hyperbolic Set HYPERBOLIC_API_KEY in .env Pay per use
# .env file — place in project dir, cwd, or ~/.eye2byte/.env
GEMINI_API_KEY=your-key-here

3. Run

eye2byte capture                   # Screenshot + analysis
eye2byte capture --voice           # + voice narration
eye2byte capture --mode window     # Active window only
eye2byte-ui                        # Launch control panel

Or run the scripts directly:

python eye2byte.py capture
python eye2byte_ui.py

How It Works

Eye2byte sits between your screen and your coding agent.

  1. Capture — takes a screenshot (full screen, window, region, or all monitors), optionally records voice and annotations
  2. Process — optimizes the image (~5x smaller, zero quality loss), cleans audio (noise removal + normalization), transcribes speech locally via Whisper
  3. Analyze — sends everything to your chosen vision model
  4. Output — produces a structured Context Pack the agent can act on

Context Pack Format

Every analysis produces a markdown document with structured sections:

## Goal           — what the user appears to be doing
## Environment    — OS, editor, repo, branch, language
## Screen State   — visible panels, files, terminal output
## Signals        — verbatim errors, stack traces, warnings
## Likely Situation — what's probably happening
## Suggested Next Info — what a coding agent needs next

The agent receives this as actionable context — not a raw image dump.


MCP Integration

Eye2byte exposes 11 tools via the Model Context Protocol. Any MCP-compatible agent can use them.

Tool What it does Install
capture_and_summarize Screenshot + vision analysis (monitor selection, delay, window targeting) core
capture_with_voice Screenshot + voice recording + transcription + analysis core
record_clip_and_summarize Screen clip with keyframe extraction and sequence analysis core
summarize_screenshot Analyze an existing image file core
transcribe_audio Local Whisper transcription of any audio file core
get_recent_context Retrieve recent Context Pack summaries core
get_screen_elements OCR with coordinates — find text elements and their screen positions [ocr]
search_context_history Full-text search across all past Context Packs core
click_element Click at screen coordinates (from get_screen_elements output) [interact]
type_text Type text at current cursor position [interact]
press_key Press keyboard key or combo (e.g. "ctrl+a", "enter") [interact]
scroll_screen Scroll at a screen position [interact]

OpenClaw

Eye2byte works with OpenClaw out of the box. Add to your openclaw.json:

{
  "mcpServers": {
    "eye2byte": {
      "command": "python",
      "args": ["eye2byte_mcp.py"]
    }
  }
}

Now your OpenClaw can see your screen from any channel — WhatsApp, Telegram, Slack, Discord. An Eye2byte skill is also available on ClawHub.

Local agents (stdio)

For agents running on the same machine (Claude Code, Codex CLI, etc.). Add to .mcp.json:

{
  "mcpServers": {
    "eye2byte": {
      "command": "python",
      "args": ["C:/path/to/eye2byte_mcp.py"]
    }
  }
}

That's it. The agent auto-starts the server. Use full absolute paths.

Remote agents (SSE)

For agents on a different machine (cloud VM, SSH dev box, CI runner).

On your local machine (the one with the screen):

python eye2byte_mcp.py --sse                         # No auth (LAN only)
python eye2byte_mcp.py --sse --token mysecret123     # Bearer token auth
python eye2byte_mcp.py --sse --port 9000 --token abc # Custom port + auth

On the remote machine (where the agent runs) — add to MCP config:

{
  "mcpServers": {
    "eye2byte": {
      "url": "http://YOUR_LOCAL_IP:8808/sse",
      "headers": {"Authorization": "Bearer mysecret123"}
    }
  }
}

Omit headers if the server was started without --token.

Firewall note (Windows)
netsh advfirewall firewall add rule name="Eye2byte MCP" dir=in action=allow protocol=TCP localport=8808

Find your local IP: ipconfig (Windows) or ip addr (Linux/macOS).

Multi-monitor

capture_and_summarize(monitor=0)    # active monitor (default)
capture_and_summarize(monitor=1)    # first monitor
capture_and_summarize(monitor=2)    # second monitor
capture_and_summarize(monitor=-1)   # ALL monitors at once

Control Panel

eye2byte-ui          # or: python eye2byte_ui.py

A small always-on-top floating panel. Drag it anywhere. Global hotkeys work even when the panel isn't focused.

Global Hotkeys (Windows)

Hotkey Action
Ctrl+Shift+1 Capture screenshot (uses current mode)
Ctrl+Shift+2 Annotate (freeze screen, open drawing overlay)
Ctrl+Shift+3 Toggle voice recording
Ctrl+Shift+5 Grab clipboard image

All shortcuts are customizable from Settings > Keyboard Shortcuts.

Panel Controls

Control Action
Space (hold) Push-to-talk — hold to record, release to stop
Mode selector Cycle between Full Screen / Window / Region
Settings Provider, model, image quality, shortcuts, cleanup
Copy @path Copy session path for @-mentioning in chat

Settings Tabs

Tab What you configure
Provider Vision provider, model selection, API keys
Media Image quality, max size, voice cleaning
Shortcuts All keyboard shortcuts with key capture UI
Maintenance Auto-cleanup days, cache management

Features Reference

Annotation Overlay

Press Ctrl+Shift+2 or click Annotate to freeze the screen and draw on it.

Key Tool How to use
X Arrow Click and drag
C Circle Click and drag
V Rectangle Click and drag
B Freehand Click and drag
T Text Click to place, type your text
Action How
Save Enter — commits annotations, sends to vision model
Cancel Escape — discards all annotations
Undo Right-click near an annotation to remove it
Newline Shift+Enter (Enter alone commits)
Multi-line Text box auto-grows up to 6 lines

Voice Recording

Three ways to record:

Method How
Toggle Ctrl+Shift+3 to start, press again to stop
Push-to-talk Hold Space while panel is focused
Mouse PTT Hold click on the Record button

While recording, any captures you take are automatically bundled with the voice note into a single session.

Platforms

Platform Screenshot Voice Annotation Hotkeys
Windows PowerShell .NET ffmpeg Pillow Ctrl+Shift+1-5
macOS screencapture ffmpeg Pillow
Linux scrot/maim/flameshot ffmpeg Pillow
Android ADB (Termux) Termux:API

Configuration

Config file: ~/.eye2byte/config.json (created on first run or via eye2byte init)

Setting Default Description
provider "ollama" Vision provider: ollama, gemini, openrouter, hyperbolic
model "auto" Model name or "auto" for auto-detection
voice_clean true Noise removal + pause trimming + volume normalization
auto_cleanup_days 7 Delete old captures/summaries after N days (0 = disabled)
image_max_size 1920 Max image dimension before LLM processing
image_quality 90 JPEG quality (1-100)

Files

File Purpose
eye2byte.py Core engine — capture, voice, clip, summarize
eye2byte_ui.py Control panel with hotkeys and annotation overlay
eye2byte_mcp.py MCP server for coding agent integration
eye2byte_ocr.py Coordinate-aware OCR via easyocr
eye2byte_interact.py Mouse/keyboard automation via pyautogui
eye2byte_history.py Searchable context history via SQLite FTS5

License

MIT

About

LLM eyes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors