Autoclaw

Ambient AI for macOS — watches your screen, learns your workflows, automates your work.

How it works · Learn mode · Setup · Chrome extension · Roadmap

The Problem

Every AI tool today lives inside an app. You copy context out of one app, paste it into a chatbot, figure out the right prompt, get a result, then manually carry it back.

The human is the integration layer. The human is the bottleneck.

Autoclaw flips this. It operates at the OS level — watching everything you do across every app, understanding your intent, and acting on your behalf. You never switch to it. It's already there.

This is the first implementation of ARIA (Agentic Reality Interface Architecture) — AI as the interface to your computer, not as a chatbot within it.

Built by The Last Prompt.

How It Works

Autoclaw runs as a floating pill widget on your screen. Three modes, each with a specific job:

Ambient Mode — "I'll watch, you work"

Runs continuously in the background. You don't interact with it — it watches and offers help when it spots friction.

What it does every 60 seconds:

Neural Engine (Apple VNFeaturePrint, on-device, zero cost) detects which screen frames had meaningful visual changes
Changed frames are stitched into a grid image (max 6 frames per grid, timestamped)
Grid sent to Haiku with your saved workflow summaries
Haiku analyzes the sequence: what apps, what actions, what intent, does this match a learned workflow?

What it detects:

Cross-app transfers — you copied from Notion and pasted into Sheets. There's an API for that.
File shuttles — you downloaded a CSV from one tool and uploaded it to another. Direct integration exists.
Manual lookups — you switched to Slack, copied a message, switched back. It can pull that inline.
Repetitive navigation — you keep cycling Chrome → Sheets → Slack → Chrome. That's a workflow.
Recognized workflows — "This looks like your 'Weekly Report' workflow. Want me to run it?"

When it spots something: A clean, Cofia-style toast card appears with app icons, a clear description of what you're doing, and a single Automate Now button. No clutter — just the offer and one action.

Learn Mode — "Watch me do this once"

You show autoclaw a workflow by doing it normally. It records everything, then extracts structured steps.

Recording captures (from three sources):

Source	What you get	Quality
OCR + Screenshots	Text near cursor, periodic screen captures, active window crops	Works for all apps. Noisy but gives visual context.
Chrome Extension	Exact CSS selectors, form field names, typed values, URLs, navigation events	Browser only. Precise enough for replay.
Neural Engine	Frame change detection — skips captures when nothing changed	Efficiency gate. Saves API cost.

Extraction: When you stop recording, all three signal sources are merged into a timeline and sent to Sonnet with vision. Sonnet sees the screenshots AND the OCR/DOM data and produces human-readable steps like:

1. Open Gmail and click Compose               [chrome_click]
2. Enter 'sameep@company.com' in To field      [chrome_form_input]
3. Set subject to 'Weekly Status Update'       [chrome_form_input]
4. Write email body with project updates       [chrome_form_input]
5. Click Send                                  [chrome_click]

Not "Clicked 'Co' in unknown app". Actual steps with actual values, including rich details: app name, action type, target element, entered values, URLs, and CSS selectors from the Chrome extension.

After saving: The workflow goes into your library. Next time ambient mode sees you starting the same pattern, it offers to run it.

AI Search Mode — "Here's context, figure it out"

Direct interaction. Capture clipboard entries, screenshots, voice — then ask autoclaw to deduce what you need.

Request modes (cycle with Option+X):

Mode	Behavior
Task	Executes — edits files, runs commands, uses MCP connectors, ships code
Question	Answers — queries meetings, searches web, looks up tasks. Never touches code.
Analyze	Reads deeply — structured assessment, no modifications
Add to Tasks	Creates ClickUp/project management tasks from captured context

Each mode invokes a dedicated Claude Code skill that enforces the right guardrails.

Architecture

┌──────────────────────────────────────────────────────────────┐
│  PERCEPTION                                                  │
│                                                              │
│  ScreenCaptureStream ──→ Neural Engine frame embeddings      │
│  ActiveWindowService ──→ app/window/URL (WebAppResolver)     │
│  ClipboardMonitor ────→ content + source app                 │
│  ScreenOCR ───────────→ cursor-proximate text                │
│  FileActivityMonitor ─→ FSEvents (project dirs only)         │
│  BrowserBridge (WS) ──→ Chrome extension DOM events          │
│  VoiceService ────────→ on-device speech transcription        │
└──────────────────┬───────────────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────────────┐
│  INTELLIGENCE (ARIA)                                         │
│                                                              │
│  KeyFrameAnalyzer                                            │
│    Neural Engine gate → 60s grid stitch → Haiku vision       │
│                                                              │
│  FrictionDetector                                            │
│    Activity buffer → pattern matching → surface offers       │
│    + workflow recognition from Haiku analysis                 │
│                                                              │
│  WorkflowExtractor                                           │
│    OCR + screenshots + DOM events → Sonnet vision → steps    │
│                                                              │
│  CapabilityMap + CapabilityDiscovery                         │
│    MCP tool registry → friction-to-capability matching       │
└──────────────────┬───────────────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────────────┐
│  EXECUTION                                                   │
│                                                              │
│  Claude Code CLI (stream-json or -p mode)                    │
│  ├── MCP connectors (ClickUp, GitHub, Sheets, etc.)          │
│  ├── Filesystem access (project-scoped)                      │
│  ├── Web search                                              │
│  └── Claude Code Skills (per-mode behavior enforcement)      │
│                                                              │
│  Three-gate approval:                                        │
│    Gate 0: "I noticed X" → Gate 1: "Here's my plan"          │
│    → Gate 2: "Approve execution" → Execute                   │
└──────────────────────────────────────────────────────────────┘

Why Claude Code, not the API directly?

Autoclaw doesn't call the Anthropic API. It spawns Claude Code CLI sessions (claude -p for single-shot, --output-format stream-json for interactive). This means:

Every MCP server you have configured in ~/.claude/mcp.json is automatically available
File system access, code execution, git operations — all built in
Skills system for per-mode behavior rules
Session continuity for multi-turn workflows
You don't manage API keys for individual tools — Claude Code handles authentication

Why Neural Engine for frame gating?

Sending every screen frame to an API would be expensive and slow. Apple's VNGenerateImageFeaturePrintRequest runs entirely on the Neural Engine — zero CPU, zero memory, zero API cost. It produces a 768-dim embedding per frame. Cosine distance between consecutive embeddings tells you if the screen actually changed. Only frames that pass the gate get sent to Haiku.

This is the same approach used in video2ai for key frame extraction from video files, adapted for live screen capture.

Chrome Extension

The ChromeExtension/ directory contains a Manifest V3 extension that captures structured DOM events during Learn mode.

What it captures vs OCR

Scenario	OCR alone	With extension
Click a button	`Clicked 'Co' in unknown app`	`Clicked button[aria-label="Compose"] on mail.google.com`
Fill a form field	`Clipboard event near 'To'`	`Typed 'sameep@company.com' in input[name="to"]`
Select a dropdown	`Clicked 'High' in Chrome`	`Selected 'High Priority' in select[name="priority"]`
Navigate	`App switch to Chrome`	`Navigated to notion.so/workspace/sprint-board`
Submit a form	`Clicked 'Sub' near button`	`Submitted form on mail.google.com/mail`

How it connects

Chrome Extension ──── WebSocket (ws://127.0.0.1:9849) ────→ BrowserBridge.swift
                                                              │
                                                    ┌─────────┘
                                                    ▼
                                         AppState.browserEventBuffer
                                                    │
                                         WorkflowExtractor (merged with OCR timeline)

The extension is passive until autoclaw tells it to record. Autoclaw sends start_recording / stop_recording commands over the WebSocket. A keepalive ping every 20s prevents the MV3 service worker from sleeping.

Install

Open chrome://extensions
Enable Developer mode (top right)
Click Load unpacked
Select the ChromeExtension/ folder
Badge shows ON when connected to autoclaw, REC when recording

Connectors

Autoclaw inherits every MCP server configured in your Claude Code setup (~/.claude/mcp.json). These are used both for direct execution (when you ask autoclaw to do something) and for capability matching (when ARIA detects friction and checks if a tool exists to solve it).

Connector	What autoclaw uses it for
ClickUp	Create/search/update tasks, time tracking, sprint management
Granola	Query meeting notes, transcripts, action items
Google Sheets	Read/write spreadsheet data
GitHub	Issues, PRs, code search, branch management
Web Search	Real-time information lookup
Filesystem	Project file access (Task mode only, scoped to project dir)

Add any MCP server and autoclaw can use it immediately — no configuration in autoclaw needed.

Setup

git clone https://github.com/thelastprompt/autoclaw.git
cd autoclaw
make run

Requirements

Requirement	Why
macOS 13+	SwiftUI, Vision framework, Neural Engine APIs
Claude Code CLI	`npm install -g @anthropic-ai/claude-code` — autoclaw spawns this for all AI calls
Anthropic API key	Set in autoclaw settings or `ANTHROPIC_API_KEY` env var
Accessibility permission	Global hotkeys, active window detection
Screen Recording permission	Screen capture for key frame analysis

Optional


Chrome extension	Load unpacked from `ChromeExtension/` for richer Learn mode
MCP servers	Configure in `~/.claude/mcp.json` for connectors

Keyboard Shortcuts

Shortcut	Action
Fn	Start session / pause / resume / cycle through states
Left Option ×2	End session
Option + Z	Capture screenshot into current session
Option + X	Cycle request mode (Task → Question → Analyze → Add to Tasks → Learn)
Caps Lock + Option ×2	Toggle voice transcription

Project Structure

Sources/
│
│  Perception
├── ScreenCaptureStream.swift   — screen capture, active window cropping, click detection
├── ActiveWindowService.swift   — app/window/URL tracking + WebAppResolver
├── ClipboardMonitor.swift      — clipboard content polling
├── ScreenOCR.swift             — Vision OCR with cursor proximity ranking
├── FileActivityMonitor.swift   — FSEvents file watcher
├── WorkflowRecorder.swift      — event recording (explicit + passive modes)
│
│  ARIA Intelligence
├── KeyFrameAnalyzer.swift      — Neural Engine embeddings, 60s grid analysis, Haiku vision
├── FrictionDetector.swift      — activity pattern matching, workflow recognition
├── WorkflowExtractor.swift     — Sonnet vision extraction from recordings
├── WorkflowMatcher.swift       — NLEmbedding similarity for workflow matching
├── CapabilityMap.swift         — MCP tool registry
├── CapabilityDiscovery.swift   — web search for new integrations
├── WebAppResolver.swift        — browser URLs → semantic app identities
│
│  Browser Integration
├── BrowserBridge.swift         — WebSocket server for Chrome extension
│
│  Execution
├── ClaudeCodeRunner.swift      — Claude Code CLI (stream-json + single-shot)
├── TaskDeductionService.swift  — ambient task analysis and intent extraction
│
│  Interface
├── PillView.swift              — floating pill widget (5 collapse levels)
├── TaskApprovalView.swift      — toast UI with full session thread
├── MainPanelView.swift         — dashboard (home, threads, settings)
├── ToastWindow.swift           — floating glass panel with drag & drop
├── AppState.swift              — central state, mode routing, ARIA wiring
├── AppDelegate.swift           — window management, hotkey setup
└── ...

ChromeExtension/
├── manifest.json               — Manifest V3
├── content.js                  — DOM event capture
├── background.js               — WebSocket client + recording state
├── popup.html/js               — connection status UI
└── icon*.png

Current Status

This is early alpha. The intelligence pipeline works — perception, analysis, friction detection, workflow extraction. But there are significant gaps in the user-facing product. See PLAN.md for the full roadmap.

What works today:

Ambient friction detection (cross-app transfers, file shuttles, manual lookups, workflow recognition)
Learn mode recording with OCR + screenshots + Chrome extension DOM events
Workflow extraction via Sonnet vision with rich step details (app, action, target, value, selector)
Screenshots passed to extraction model via CLI Read tool for real vision analysis
Cofia-style friction toast UI with app icons and single-action "Automate Now" button
60-second grid analysis via Haiku vision
Resolved web app names in recordings (Gmail not "Google Chrome")
Task deduction from clipboard + voice context
Claude Code execution with MCP connectors
Global hotkeys, voice transcription, multi-project support

What's broken or missing:

Workflow execution engine (step-by-step progress, parameterization)
Thread message persistence (conversations lost on restart)
Several UI elements are cosmetic only (see PLAN.md)
Chrome extension reconnection needs work

Contributing

See PLAN.md for prioritized work items. P0 items block the core product; P3 items are future vision. Pick anything, open a PR.

The codebase is pure Swift (no Xcode project — builds with make via swiftc) plus a small Chrome extension in vanilla JS.

License

MIT — The Last Prompt, 2025.

AI shouldn't live in an app. It should be the interface.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
ChromeExtension		ChromeExtension
References		References
Resources		Resources
Sources		Sources
assets		assets
.gitignore		.gitignore
Autoclaw.entitlements		Autoclaw.entitlements
Info.plist		Info.plist
LICENSE		LICENSE
Makefile		Makefile
PLAN.md		PLAN.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autoclaw

The Problem

How It Works

Ambient Mode — "I'll watch, you work"

Learn Mode — "Watch me do this once"

AI Search Mode — "Here's context, figure it out"

Architecture

Why Claude Code, not the API directly?

Why Neural Engine for frame gating?

Chrome Extension

What it captures vs OCR

How it connects

Install

Connectors

Setup

Requirements

Optional

Keyboard Shortcuts

Project Structure

Current Status

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Autoclaw

The Problem

How It Works

Ambient Mode — "I'll watch, you work"

Learn Mode — "Watch me do this once"

AI Search Mode — "Here's context, figure it out"

Architecture

Why Claude Code, not the API directly?

Why Neural Engine for frame gating?

Chrome Extension

What it captures vs OCR

How it connects

Install

Connectors

Setup

Requirements

Optional

Keyboard Shortcuts

Project Structure

Current Status

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages