中文版: README.zh.md
A personal AI agent hub that lives on your Mac: chat, voice, camera, and hands-on control of your iTerm2 Claude Code sessions -- from a laptop, a phone on the same Wi-Fi, or anywhere in the world via an end-to-end encrypted relay.
Built for people who already run a lot of Claude Code sessions and want one place to watch and talk to them.
- macOS -- iTerm2 integration is macOS-only
- Python 3.11+
- iTerm2 with the Python API enabled: iTerm2 → Settings → General → Magic → Enable Python API
- At least one LLM API key (DeepSeek recommended -- cheap and stable with tool calling)
One command for most people:
git clone https://github.com/tyxben/roboot.git && cd roboot
./scripts/setup.sh # installs deps + ffmpeg + prewarms Whisper
# edit config.yaml: add providers.deepseek key, optional telegram.bot_token
python server.py # open http://localhost:8765The script checks your Python version, installs the telegram extras (bot + voice I/O), brew installs ffmpeg if missing, copies config.example.yaml → config.yaml, and pre-caches the Whisper model so the first voice message is instant. It's idempotent — safe to re-run. Flags: --with=core|telegram|voice|vision|all, --no-prewarm.
If uv is on your PATH the script uses it automatically (much faster resolver, handles numpy-2-vs-numba collisions better than pip). To install uv first:
curl -LsSf https://astral.sh/uv/install.sh | shPrefer to install manually:
pip install -e . # core: web console + LAN + relay
cp config.example.yaml config.yaml # then edit and add your API key
python server.py # open http://localhost:8765That's it. The welcome message will appear when the WebSocket connects.
pyproject.toml defines four extras you can mix and match — pull them with pip install -e '.[<name>,<name>]':
telegram— Telegram bot with voice input (mlx-whisper) + voice output (Edge TTS → OGG/Opus)voice— local mic STT + macOSsayTTS for CLI--voicemode (needsbrew install portaudiofirst)vision— camera + face recognition (looktool,enroll_facetool)desktop— pywebview standalone app wrapperall— everything above in one shot
pip install -e '.[telegram]' # pulls mlx-whisper + SpeechRecognition
brew install ffmpeg # encodes voice replies to OGG/Opus
python -m adapters.stt prewarm # pre-cache the ASR model (~3 GB, one-time)
python -m adapters.telegram_bot # start the botThe prewarm step is optional but recommended — without it, the first Telegram voice message you send waits ~6 minutes on the model download. Run it once during setup and all future voice messages feel instant.
Inside Telegram you can:
- Send voice → the bot transcribes with Whisper (
~96%Chinese accuracy onlarge-v3), the agent replies, and you hear the reply back as a voice bubble in ~3–4s. /voice— pick from 10 curated voices (male/female Mandarin + two dialects + English).- Just say "换成女声" / "screenshot please" — the agent owns tools like
switch_tts_voice,screenshot,list_sessions,shell, so slash commands are optional.
python run.py # Keyboard-only CLI
python run.py --voice # Local mic + macOS `say` TTS (needs `.[voice]`)
chainlit run chainlit_app.py -w # Alternative Chainlit UIThree ways to reach your Roboot from off-device. See SECURITY.md for the threat model before exposing any of them.
- LAN (zero-config) -- the server binds
0.0.0.0:8765; a QR code is printed at startup. Scan it from a phone on the same Wi-Fi. Uses a self-signed TLS cert with trust-on-first-use. - Telegram bot -- set
telegram.bot_tokeninconfig.yaml, runpython -m adapters.telegram_bot. Gate access withtelegram.allowed_users. - Relay -- a Cloudflare Worker forwards WebSocket traffic between the daemon and a browser pair page. Traffic is end-to-end encrypted (ECDH P-256 → HKDF → AES-GCM); the relay only sees ciphertext envelopes. Pairing tokens rotate every 30 minutes and can be revoked instantly from the local console.
server.py (FastAPI) <- Main entry point
├── WebSocket /ws <- Streaming chat (LLM_CHUNK events)
├── REST /api/sessions/* <- Direct iTerm2 session control
├── REST /api/tts <- Edge TTS (text -> mp3)
├── REST /api/relay-* <- Relay status / refresh / revoke / QR
├── REST /api/network-info <- Local IP addresses + QR
└── Static /static/console.html <- Unified web console
relay_client.py <- Connects to the Cloudflare Worker relay
iterm_bridge.py <- Persistent iTerm2 Python API connection
soul.md <- Self-modifiable assistant identity
config.yaml <- API keys + provider config (gitignored)
text_utils.py <- Shared helpers (extract_spoken_text, …)
tools/
├── shell.py <- Terminal command execution
├── claude_code.py <- iTerm2 session list/read/send/create
├── vision.py <- Camera + screenshot + face recognition
├── face_db.py <- Face encoding storage (.faces/)
├── soul.py <- Self-modification + user memory
└── voice_switch.py <- Agent tool: change Telegram TTS voice
adapters/
├── telegram_bot.py <- Remote control via Telegram (voice I/O)
├── voice.py <- Local mic STT + macOS TTS (CLI --voice)
├── voice_prefs.py <- Per-Telegram-user TTS voice store
├── tts_streamer.py <- Edge TTS → parallel OGG/Opus synthesis
├── keyboard.py <- Terminal text input
└── stt/ <- Pluggable speech-to-text backends
├── mlx.py <- mlx-whisper (Apple Silicon, default)
├── google.py <- speech_recognition → Google Web Speech
└── noop.py <- backend: none
relay/ <- Cloudflare Worker relay
├── src/index.ts <- Worker entry, routing, rate limiting
├── src/relay-session.ts <- Durable Object: daemon↔client session mgmt
├── src/pair-page.ts <- Browser pairing page
└── wrangler.toml <- Cloudflare deployment config
Deeper architecture notes (agent framework, TTS conventions, soul system, E2EE handshake, streaming protocol) live in CLAUDE.md.
- Create
tools/my_tool.py. - Decorate with
@arcana.tool(when_to_use=..., what_to_expect=...). - Import it and add it to
ALL_TOOLSinserver.py.
Arcana handles registration; no other wiring is needed. See the "Adding a New Tool" section in CLAUDE.md for conventions.
Every option is documented inline in config.example.yaml. The assistant can also rewrite parts of its own identity by editing soul.md through the soul tool.
- docs/USAGE.md -- end-user guide: quickstart, config, interfaces, Claude Code integration, memory, auto-upgrade, troubleshooting (中文版)
- docs/REMOTE_VS_LOCAL.md -- capability matrix for local / LAN / Telegram / relay, plus a comparison with Claude Code's built-in remote (bilingual in one file)
- SECURITY.md -- threat model, E2EE trust chain, new-feature risks, pairing-leak recovery
- CHANGELOG.md -- release notes (中文版)
- CONTRIBUTING.md -- scope, dev setup, PR workflow (中文版)
- CLAUDE.md -- contributor notes: architecture, streaming protocol, soul system, adding tools
If you plan to expose Roboot beyond localhost, read SECURITY.md first. It lists what is and isn't protected, known gaps, and how to report vulnerabilities.
MIT -- Copyright (c) 2026 tyxben.
- Arcana -- the agent framework
- DeepSeek -- default LLM provider
- iTerm2 Python API -- terminal integration
- Cloudflare Workers + Durable Objects -- relay infrastructure
- Edge TTS -- neural voice synthesis
