Skip to content

ruchit-p/macwright

MacWright

MacWright is an MCP server for reliable native macOS desktop control from AI agents. It exposes 72 tools for screenshots, mouse, keyboard, scroll, clipboard, window management, native UI automation via the Accessibility API, Safari browser automation, AppleScript, shell commands, and visual UI parsing.

Use it when you want Hermes Agent, Claude Desktop, or another MCP client to operate a Mac like a careful power user: read the UI, choose the most semantic control path, act, and verify the result.

Demo

Macwright.Demo.mp4

The demo shows MacWright control a real macOS desktop through MCP.

The demo shows the core workflow MacWright is built for:

  • Inspecting the live screen and native app state.
  • Choosing actions through semantic UI tools before falling back to coordinates.
  • Driving mouse, keyboard, window, and app interactions on macOS.
  • Verifying changes after actions instead of assuming the desktop responded.

Requirements

  • macOS (tested on Apple Silicon)
  • Node.js >= 18
  • cliclick (brew install cliclick)
  • Optional: OmniParser server on port 8650 for screen_parse

Platform support

MacWright is macOS-only desktop automation. It depends on macOS Accessibility, Screen Recording, AppleScript/JXA, Safari Apple Events, screencapture, pbcopy/pbpaste, and cliclick.

Linux and Windows are not supported targets for the native desktop-control tools. The schema tests may run on other platforms, but real mouse, keyboard, screenshot, window, Safari, AppleScript, and Accessibility behavior requires a configured Mac.

npm test runs portable MCP schema validation. Native desktop smoke coverage is explicit:

npm run test:macos-smoke

How agents should use MacWright

Prefer the highest-level reliable interface before falling back to coordinates:

  1. Safari/web pages: use page_snapshot, find_element, click_element, fill_form, and wait tools.
  2. Native apps with Accessibility support: use read_ui, then ax_click or ax_action.
  3. Visual fallback: use screenshot or screen_parse, then click/type with verification.
  4. Raw input: use mouse, keyboard, and clipboard tools only when semantic tools are unavailable.
  5. Always verify: use wait_for_ui, wait_for_change, screenshot, screenshotAfterClick, or verifyChange after important actions.

Coordinates passed to mouse tools are logical macOS screen coordinates. If you are clicking based on a downscaled screenshot returned by MacWright, pass screenshotCoords: true so MacWright scales them back to logical screen coordinates.

Installation

git clone https://github.com/ruchit-p/macwright.git
cd macwright
npm run setup

npm run setup handles the local bootstrap: installs cliclick via Homebrew when needed, installs npm dependencies, builds the project, checks macOS permissions, opens System Settings if approvals are missing, and prints ready-to-paste MCP config for Claude Desktop, Hermes Agent, and other stdio MCP clients.

After granting Accessibility or Screen Recording permissions, restart the app that launches MacWright, then run:

./setup.sh --doctor

Manual steps (if you prefer)

npm install && npm run build
brew install cliclick

Then grant Accessibility permission to your terminal in System Settings > Privacy & Security > Accessibility.

Configure with Hermes Agent

Add MacWright to ~/.hermes/config.yaml and restart Hermes:

mcp_servers:
  macwright:
    command: "node"
    args: ["/absolute/path/to/macwright/dist/index.js"]
    timeout: 120
    connect_timeout: 30

Then ask Hermes to use the mcp_macwright_* tools. On macOS, grant Accessibility and Screen Recording permissions to the app or service that launches Hermes/MacWright, then restart that app after changing permissions.

npm run setup prints a ready-to-copy MCP config with the absolute dist/index.js path.

Configure with Claude Desktop

Paste into ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "macwright": {
      "command": "node",
      "args": ["/absolute/path/to/macwright/dist/index.js"]
    }
  }
}

npm run setup prints this with the correct absolute path filled in.

Permissions

macOS privacy permissions are enforced by TCC and cannot be bypassed silently by scripts. Grant permissions to the controlling app that launches MacWright, for example Hermes runner, Claude Desktop, Terminal, iTerm, or Ghostty. Restart that app after changing permissions.

Permission Needed for How to grant
Accessibility mouse, keyboard, scroll, drag, read_ui, ax_click, window focus/resize, System Events UI reads System Settings > Privacy & Security > Accessibility
Screen Recording screenshot, screen_parse, wait_for_change, visual verification System Settings > Privacy & Security > Screen Recording
Automation prompts AppleScript, System Events, app-specific control, Safari automation macOS prompts when first used; approve the controlling app
Input Monitoring may be requested for keyboard/input events on some systems System Settings > Privacy & Security > Input Monitoring
Safari JavaScript from Apple Events safari_js and DOM automation helpers Safari > Settings > Advanced > enable developer features, then Safari > Develop > Allow JavaScript from Apple Events

Run ./setup.sh --doctor to check the current controlling app. Run ./setup.sh --open-settings to open the relevant Privacy panes.

Safety note

MacWright can control your desktop and includes powerful tools such as run_shell and run_applescript. Treat it like local admin automation. Do not expose the stdio server over an unauthenticated network bridge, and only connect it to MCP clients you trust. See SECURITY.md.

Tools

Screenshots

screenshot — Capture the screen or a region. Returns JPEG by default (131KB vs 11.8MB for PNG).

{ "format": "jpg", "maxWidth": 1280 }
{ "format": "png" }
{ "x": 100, "y": 100, "width": 800, "height": 600 }

Avg: 431ms (JPEG 1280px) | 4110ms (PNG full)

screen_parse — Use OmniParser V2 to detect all UI elements on screen via YOLO + OCR + Florence-2. Returns structured elements with labels, types, and screen coordinates ready for clicking. Use for native app automation or anti-bot websites where DOM tools can't work. Requires OmniParser server on port 8650.

{}
{ "x": 100, "y": 100, "width": 800, "height": 600 }
{ "timeout": 90000 }

Returns elements with screen-pixel coordinates:

{ "elements": [{ "id": 0, "type": "text", "label": "File", "x": 126, "y": 13 },
               { "id": 34, "type": "icon", "label": "Allow", "x": 924, "y": 375, "interactive": true }],
  "total": 220, "textElements": 89, "iconElements": 131, "elapsed": 8.09 }

Avg: 8-20s depending on screen complexity


Mouse

click — Left click at coordinates. Avg: 284ms.

{ "x": 864, "y": 476 }
{ "x": 864, "y": 476, "screenshot": true }

double_click — Double-click. Avg: 469ms.

right_click — Right-click (context menu). Avg: 280ms.

triple_click — Triple-click (select line/paragraph).

move_mouse — Move cursor without clicking. Supports smooth movement with steps/duration. Avg: 266ms.

drag — Click-drag between two points. Avg: 362ms.

{ "startX": 100, "startY": 100, "endX": 500, "endY": 300 }

get_mouse_position — Returns current cursor position as {"x": N, "y": N}. Avg: 255ms.

hover_and_wait — Move mouse to coordinates or an AX element by label, wait, then screenshot. Great for revealing tooltips.

{ "x": 500, "y": 300, "delay": 1000 }
{ "text": "Wi-Fi", "app": "System Settings" }

inspect_element — Inspect the accessibility element at screen coordinates. Returns role, title, value, and available actions.

All mouse tools accept optional wait (ms) and screenshot (bool) params to chain actions. All support screenshotCoords: true to auto-scale coordinates from screenshot pixels to screen pixels.


Scroll

scroll — Scroll at screen coordinates using CGEvents. Moves mouse to position first, then scrolls. Target app must be frontmost. CGEvents scroll whatever element is under the cursor — use this for custom overflow containers (divs with overflow:auto/scroll). For page-level scrolling in Safari, prefer scroll_page.

{ "x": 500, "y": 500, "amount": 5 }
{ "x": 500, "y": 500, "amount": -5 }
{ "x": 500, "y": 500, "amount": 3, "horizontalAmount": 2 }
  • amount: lines to scroll (positive = down, negative = up)
  • horizontalAmount: optional horizontal scroll (positive = right, negative = left)

Keyboard

Requires Accessibility permission for your terminal app.

type_text — Type a string at the current cursor. Avg: 644ms (AppleScript) / 1204ms (cliclick fallback). Unicode/emoji auto-detected and pasted via clipboard.

{ "text": "Hello, World!" }
{ "text": "search query", "slowly": true }
{ "text": "search term", "submit": true, "wait": 1000 }
{ "text": "hello", "field": "Search", "app": "Finder" }
  • slowly: type one character at a time (50ms delay) — triggers autocomplete/keystroke handlers
  • submit: press Enter after typing
  • clear: select all (Cmd+A) before typing — replaces existing content
  • field: target a specific input field by its accessibility label
  • fields: batch mode — fill multiple fields in one call

press_key — Press a key or key combination. Avg: 342–762ms.

{ "key": "a", "modifiers": ["cmd"] }
{ "key": "return" }
{ "key": "tab" }

Named keys: return, tab, esc, delete, space, arrow-up/down/left/right, f1f16

Modifiers: cmd, ctrl, alt, shift


Clipboard

get_clipboard — Read clipboard text. Supports text, HTML, image, and file formats. Avg: 205ms.

set_clipboard — Write text to clipboard. Avg: 202ms.

{ "text": "https://example.com" }

Windows & Apps

get_screen_size — Returns screen dimensions, display info, dock position, dark mode status. Avg: 509ms.

get_frontmost_app — Returns frontmost app name, window title, bounds, menus, and focused element. Avg: 340ms.

list_windows — Lists all visible apps with names, titles, and window bounds. Avg: 375ms.

open_app — Launch or activate an app by name. Supports opening files/URLs, waiting for windows. Avg: 236ms.

{ "name": "Safari" }
{ "name": "TextEdit", "url": "/path/to/file.txt" }

focus_window — Bring an app's window to front. Supports title filter and bounds. Avg: 320ms.

{ "app": "Finder" }

resize_window — Resize and/or move an application window. Supports snapping to halves, thirds, quarters.

{ "app": "Safari", "x": 100, "y": 100, "width": 1200, "height": 800 }
{ "app": "Safari", "half": "left" }
{ "app": "Safari", "maximize": true }

close_window — Close an app's window. Supports title filter, close all, and save options.

quit_app — Quit an application. Supports force quit.


Native UI (Accessibility API)

These tools interact with native macOS applications using the Accessibility API — no screenshots or coordinates needed.

read_ui — Read all UI elements (buttons, inputs, checkboxes, menus, etc.) from a native app. Returns labels, roles, coordinates, and state.

{ "app": "System Settings" }
{ "app": "Finder", "role": "button" }
{ "app": "System Settings", "text": "Wi-Fi", "exact": true }
{ "app": "Notes", "interactiveOnly": true }

ax_click — Click a native UI element by its text label. Auto-retries until found.

{ "text": "General", "app": "System Settings" }
{ "text": "Allow", "app": "Safari", "timeout": 5000 }

ax_action — Perform accessibility actions: press, confirm, showMenu, increment, decrement, setValue, pick (for popups/dropdowns), focus, getValue.

{ "text": "Dark Mode", "action": "press", "app": "System Settings" }
{ "text": "Font Size", "action": "pick", "value": "14", "app": "TextEdit" }
{ "text": "Volume", "action": "getValue", "app": "System Settings" }

ax_drag — Drag from one AX element to another by label.

ax_read_table — Read native table/outline data. Supports row search, click/double-click on matched rows.

wait_for_ui — Poll the accessibility tree until an element appears (or disappears with gone: true).

{ "text": "Connected", "app": "System Settings", "timeout": 10000 }
{ "text": "Loading", "app": "Safari", "gone": true }

get_selected_text — Read selected text from any app via AXSelectedText.

click_menu_item — Click a menu bar item by path, list shortcuts, or execute keyboard shortcuts.

{ "menuPath": ["File", "Save"], "app": "TextEdit" }
{ "menuPath": ["Edit"], "list": true }

dismiss_sheet — Detect and dismiss modal sheets/dialogs.

open_spotlight — Open Spotlight, type a query, and launch apps.

navigate_file_dialog — Navigate macOS file open/save dialogs via Cmd+Shift+G.

navigate_system_pref — Open and navigate to a System Settings section.


Safari

safari_url — Navigate Safari to a URL or get the current URL. No Accessibility needed. Avg: 331–372ms. Supports waitUntil: 'load' or 'domcontentloaded' for reliable page load waiting (polls readyState, 15s max).

{ "url": "https://example.com", "waitUntil": "load", "screenshot": true }
{ "url": "https://example.com", "wait": 1000, "screenshot": true }
{}

safari_js — Execute JavaScript in the current Safari tab. Requires "Allow JavaScript from Apple Events" in Safari Developer settings. Multi-line code with return is auto-wrapped in an IIFE. Objects/arrays auto-serialized as JSON. Avg: ~350ms.

{ "code": "document.title" }
{ "code": "window.scrollY" }
{ "code": "(function(){ const x = 42; return x; })()" }
{ "code": "document.body.textContent", "iframe": "#editor_iframe" }
  • iframe: CSS selector of a same-origin iframe to execute code inside

safari_navigate_back — Go back in Safari history (equivalent to pressing the back button).

safari_navigate_forward — Go forward in Safari history (Cmd+]).

safari_tabs — Manage Safari tabs: list, open, close, or switch.

{ "action": "list" }
{ "action": "new", "url": "https://example.com" }
{ "action": "select", "index": 2 }
{ "action": "close" }

safari_reload — Reload the current Safari page. Equivalent to Cmd+R.

scroll_page — Scroll the current Safari page. Focus-free — works without making Safari frontmost. Returns new scroll position. Prefer over scroll for web page content.

{ "y": 500 }
{ "x": 300, "y": 0 }
{ "y": 0, "absolute": true }
  • absolute: use window.scrollTo() instead of scrollBy() — scroll to exact position

find_element — Resolve a CSS selector to screen coordinates using the viewport formula. Returns center point ready for click().

{ "selector": "[name='custname']" }
{ "selector": "button.submit", "all": true }
{ "selector": "a", "href": "/issues" }

click_element — Click a DOM element by CSS selector in one step (scrolls into view, then clicks). Supports auto-waiting.

{ "selector": "a.btn-primary", "wait": 1000 }
{ "selector": "button", "doubleClick": true }
{ "selector": "a[data-hovercard-type='issue']", "jsClick": true }
{ "selector": "#finish", "timeout": 10000 }
  • timeout: auto-wait for element to appear before clicking (polls every 200ms)
  • doubleClick: double-click instead of single click
  • modifiers: hold modifier keys (Alt, Control, Meta, Shift)
  • jsClick: use DOM .click() instead of native mouse — needed for SPAs (GitHub, React apps) where native clicks don't trigger JS navigation
  • href: filter by URL pattern (substring match on href attribute)

click_text — Click a visible element by its text content. No CSS selector needed. Supports auto-waiting.

{ "text": "Submit" }
{ "text": "Sign in", "elementType": "button" }
{ "text": "Exact match", "exact": true }
{ "text": "Issues", "jsClick": true }

hover_element — Move mouse over a DOM element without clicking. Triggers CSS :hover states, dropdown menus, tooltips.

wait_for_element — Wait until a CSS selector appears in the DOM. Useful after navigation or AJAX calls.

wait_for_text — Wait until specific text appears (or disappears with gone: true) on the page.

wait_for_url — Wait until the current URL contains (or stops containing with gone:true) a pattern.

wait_for_function — Wait until a JavaScript expression returns truthy. Most flexible wait tool.

get_page_text — Extract clean visible text from the page (strips scripts/styles/nav). Optionally scoped to a CSS selector.

fill_form — Fill multiple form fields at once. Supports text inputs, textareas, checkboxes, radio buttons, and select dropdowns. Triggers input/change events for React/Vue/Angular compatibility.

{
  "fields": [
    { "selector": "[name='custname']", "value": "Alice" },
    { "selector": "[name='size']", "value": "large", "type": "select" },
    { "selector": "[name='topping']", "value": "true", "type": "checkbox" }
  ]
}

get_element_attr — Get a property or attribute of a DOM element.

screenshot_element — Take a cropped screenshot of a specific DOM element by CSS selector.

get_page_info — Get the current page's title, URL, and document ready state in one call.

get_form_fields — Discover all form fields on the page with their types, names, and current values.

get_links — Extract all links from the current page with their text and URLs.

is_element_visible — Check whether an element exists AND is visible in the current viewport.

scroll_to_element — Scroll the page until an element is visible using scrollIntoView().

focus_element — Focus a DOM element without clicking (triggers focus event).

page_snapshot — Get a comprehensive snapshot of the current Safari page in one call: URL, title, h1, meta description, visible text preview, link count, form field count, image count, scroll position, viewport size, page height, and heading outline (h1-h3).

scroll_element — Scroll a specific DOM element (overflow container) in Safari.

drag_element — Drag one DOM element to another in Safari using CSS selectors.

get_table_data — Extract data from an HTML table as JSON rows keyed by header text.


Shell

run_shell — Execute a shell command and return stdout/stderr/exit code. Much faster than Terminal+clipboard. Homebrew PATH is included (jq, python3, git, node all available).

{ "command": "git log --oneline -10", "cwd": "/path/to/repo" }
{ "command": "jq '.field' file.json" }
{ "command": "python3 -c 'import json; ...'", "timeout": 10000 }

Non-zero exit codes are shown in output text but do NOT set isError (grep/diff/test return 1 legitimately). Only timeouts set isError.


AppleScript / JXA

run_applescript — Run AppleScript or JavaScript for Automation (JXA). Avg: 228ms.

{ "script": "return \"hello\"" }
{ "script": "Application('Finder').name()", "language": "JavaScript" }

Multiline scripts use a temp file approach to avoid escaping issues.


Utility

wait — Pause for N milliseconds. ~190ms server overhead.

{ "ms": 1000 }

wait_for_change — Take a baseline screenshot and poll until the screen changes. Useful for waiting on animations or loading states.

{ "app": "Safari", "timeout": 10000 }
{ "app": "Finder", "stable": true }
  • stable: wait until the screen stops changing (two consecutive identical screenshots)

send_notification — Show a macOS notification banner. No permissions needed.

{ "title": "Done", "message": "Task complete", "sound": true }

Coordinate System

MacWright uses logical pixels (not physical pixels). On a Retina display, screen coordinates are typically 1728x952.

Screenshots default to 1280px wide. To pass screenshot coordinates directly to tools, use screenshotCoords: true — this auto-scales from screenshot pixels to screen pixels. No manual math needed.

{ "x": 640, "y": 400, "screenshotCoords": true }

When combined with app, uses window-based scaling for higher accuracy.


Real-World Workflow Examples

Web form automation

safari_url({ "url": "https://example.com/form", "wait": 2000 })
page_snapshot()              // understand the page
get_form_fields()            // discover all form fields
fill_form({ "fields": [
  { "selector": "[name='email']", "value": "user@example.com" },
  { "selector": "[name='plan']", "value": "Pro", "type": "select" },
  { "selector": "[name='agree']", "value": "true", "type": "checkbox" }
]})
click_element({ "selector": "button[type='submit']", "wait": 2000 })
wait_for_url({ "pattern": "/success" })
get_page_text({ "selector": ".confirmation" })

Data extraction pipeline

safari_url({ "url": "https://news.ycombinator.com", "wait": 2000 })
safari_js({ "code": "Array.from(document.querySelectorAll('.titleline a')).map(a => ({title: a.textContent, url: a.href}))" })
// Process with shell:
run_shell({ "command": "python3 -c 'import json,sys; data=json.load(sys.stdin); [print(f\"{i+1}. {d[\"title\"]}\") for i,d in enumerate(data)]'", "input": "<output from safari_js>" })

Navigate and interact by visible text

safari_url({ "url": "https://example.com", "wait": 2000 })
click_text({ "text": "Sign in" })
wait_for_element({ "selector": "#login-form" })
fill_form({ "fields": [{ "selector": "#email", "value": "user@test.com" }] })
click_text({ "text": "Continue", "elementType": "button" })

Native app automation

open_app({ "name": "System Settings" })
ax_click({ "text": "Wi-Fi", "app": "System Settings" })
wait_for_ui({ "text": "Wi-Fi", "app": "System Settings" })
read_ui({ "app": "System Settings" })

System Settings automation

navigate_system_pref({ "section": "Displays" })
read_ui({ "app": "System Settings", "interactiveOnly": true })
ax_click({ "text": "Resolution", "app": "System Settings" })

Performance Summary

Tool Avg Time Notes
screenshot (JPEG 1280px) 431ms ~131KB — recommended default
screenshot (PNG full res) 4110ms ~11.8MB — use only when needed
get_screen_size 509ms JXA NSScreen
get_frontmost_app 340ms AppleScript
list_windows 375ms AppleScript visible apps
get_clipboard 205ms pbpaste
set_clipboard 202ms pbcopy via spawn
click 284ms cliclick, 10-click avg
move_mouse 266ms cliclick
double_click 469ms cliclick
right_click 280ms cliclick
drag 362ms cliclick
type_text 329–644ms AppleScript primary, cliclick fallback
press_key 342–762ms AppleScript primary, cliclick fallback
scroll 346ms cliclick + JXA CGEvent
open_app 236ms /usr/bin/open
focus_window 320ms AppleScript activate
run_applescript 228ms osascript
get_mouse_position 255ms cliclick p:
wait ~190ms overhead setTimeout + MCP round-trip
safari_url 331–372ms AppleScript
safari_js ~350ms Needs Safari Developer setting

Architecture

src/
├── index.ts              # MCP server entry point
└── tools/
    ├── screenshot.ts      # screenshot, screen_parse
    ├── screen.ts          # get_screen_size
    ├── mouse.ts           # click, double_click, right_click, triple_click, move_mouse, drag, get_mouse_position, hover_and_wait, inspect_element
    ├── keyboard.ts        # type_text, press_key
    ├── scroll.ts          # scroll
    ├── clipboard.ts       # get_clipboard, set_clipboard
    ├── shell.ts           # run_shell
    ├── applescript.ts     # run_applescript
    ├── notification.ts    # send_notification
    ├── wait.ts            # wait, wait_for_change
    ├── window/
    │   ├── management.ts  # get_frontmost_app, list_windows, get_screen_size helpers
    │   ├── window-actions.ts  # open_app, focus_window, resize_window, close_window, quit_app
    │   ├── dialog.ts      # dismiss_sheet, navigate_file_dialog, open_spotlight
    │   ├── system.ts      # click_menu_item, navigate_system_pref
    │   ├── native-ui.ts   # wait_for_ui, ax_click, ax_action, ax_drag
    │   └── read-ui.ts     # read_ui, ax_read_table, get_selected_text
    └── safari/
        ├── navigation.ts  # safari_url, safari_navigate_back/forward, safari_tabs, safari_reload
        ├── elements.ts    # find_element, click_element, click_text, hover_element, focus_element, scroll_to_element
        ├── waiting.ts     # wait_for_element, wait_for_text, wait_for_url, wait_for_function
        ├── forms.ts       # fill_form, get_form_fields
        ├── page-reading.ts  # get_page_text, get_page_info, get_links, get_element_attr, is_element_visible, get_table_data
        ├── page-snapshot.ts # page_snapshot, screenshot_element
        ├── scrolling.ts   # scroll_page, scroll_element, drag_element
        └── tabs.ts        # safari_tabs

Testing

npm test                  # portable schema/tool-catalog validation
npm run test:macos-smoke  # native desktop smoke suite on a configured Mac

All tools verified on Apple Silicon Mac (M-series), screen resolution 1728x952. Safari-specific tools require "Allow JavaScript from Apple Events" in Safari developer settings.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes in src/tools/
  4. Run npm run build to compile
  5. Run npm test to verify the MCP schema/tool catalog
  6. Run npm run test:macos-smoke when native desktop behavior changed
  7. Submit a pull request

License

MIT

About

MCP server for reliable native macOS desktop control from AI agents

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors