UI Automata

Declarative workflow engine for Windows UI automation, built for reliable unattended execution and AI agent use.

Automata.Demo.mp4

The problem

Windows work that is too interactive for a script and too tedious to do by hand — checking a service status in Event Viewer, extracting a value from a legacy desktop app, configuring a settings dialog with no API — is exactly what AI agents should handle. But every step in a Windows UI workflow can fail in ways a script cannot see:

Timing: a button click completes before the app finishes processing. The UI looks ready; it is not.
Transient disabled state: the element exists and is visible, but temporarily disabled. The click fires, nothing happens, no error is returned.
Popup interruptions: a modal dialog captures focus mid-step. The script waits for an outcome that will never come.
Stale handles: the app rebuilds its UI after a navigation. Cached references point to the wrong place; clicks land silently on the wrong target.
Focus loss: a keypress meant for one field lands on another.

These are not edge cases — they are routine in any real Windows application.

Why existing tools fall short

	UI Automata	AutoHotkey	UIPath	Selenium	Vision agents
Native Windows apps	✓	✓	✓	✗	✓
Structured recovery	✓	✗	partial	✗	✗
Agent-native	✓	✗	✗	✗	✓
Audit trail	✓	✗	✓	partial	✗
Execution speed	fast	fast	fast	fast	slow
Cost per run	low	low	high	low	high
Resolution-stable	✓	✗	partial	—	✗

AutoHotkey clicks pixel coordinates and Sleeps — no observable outcomes, no recovery. UIPath requires every edge case scripted in advance by a specialist. Selenium covers browsers only. pywinauto wraps UIA in Python but is fully imperative. Vision-based agents are slow, expensive per run, and fragile to layout changes.

The approach

Every action is an intent, not a command.

Each step declares an action, an expect condition, and an optional recovery handler. The engine runs the same lifecycle for every step:

Execute the action
Poll the expect condition every 100ms
Condition passes → advance; timeout → check recovery handlers, then retry, skip, or fail

- intent: click Open button
  action:
    type: Click
    scope: main_window
    selector: ">> [role=button][name=Open]"
  expect:
    type: DialogPresent
    scope: main_window
  timeout: 10s

No sleeps. No guessing. No silent failures. Recovery handlers are declared once and fire wherever their trigger condition is met; known failure modes are handled in one place, not scattered through step logic.

Elements are identified by role and name, not pixel coordinates — selectors survive resolution changes and most app updates.
Wrong-window interactions are caught at the execution layer before any action fires.
Every action, condition check, and recovery handler is logged; failures include the full execution trace and a UI tree dump.

The shadow DOM

Windows UI Automation is a cross-process RPC protocol — every element query is a round-trip to the target process. Walking a nested element path issues one call per level; a 20-step workflow that re-queries handles on every step pays that cost repeatedly.

ui-automata maintains a shadow DOM: a cached mirror of the live element tree. Handles are resolved once and reused. When a handle goes stale (the app rebuilt its UI), the engine walks up to the nearest live ancestor, re-queries downward, and continues without failing the step. This is the inverse of React's virtual DOM — instead of pushing changes into a UI, we efficiently read from one we do not control.

HWND locking: when a Root anchor is first resolved, the engine records the exact OS window handle. All subsequent lookups go directly to that HWND. Focus theft, title changes, new windows with similar names — none cause the anchor to drift. If the original window is destroyed, the workflow fails explicitly rather than silently attaching to something else.

Tier	Lifetime	On stale
Root	Process lifetime	Fatal — window is gone
Session	One open/close cycle	Re-resolved on next use
Stable	While parent window is open	Re-queried from nearest live ancestor
Ephemeral	Single phase	Released on phase exit

Selectors

CSS-like paths over Windows UI Automation properties:

Attribute	UIA property
`role`	Control type / accessibility role
`name`	Accessible name
`id`	UIA AutomationId (survives localization)

>> [role=edit][name='File name:']            # descendant edit field
>  [role=button][name^=Don][name$=Save]      # direct child: "Don't Save"
> [role=list item]:nth(0)                    # first list item
>> [role=list item][name~=Wing]:parent       # parent of matching item
>> [role=tab item]:nth(2) > [role=button]    # button inside third tab
>> [id=SettingsPageAbout_New]                # by AutomationId
>> [role=button|menu item]                   # OR: matches either role

String operators: = exact, ~= contains, ^= starts with, $= ends with. Combinators: > immediate child, >> any descendant. Modifiers: :nth(n), :parent, :ancestor(n). Button[name=Open] is shorthand for [role=button][name=Open].

Works across Win32, WPF, WinForms, WinUI, and UWP.

Capabilities

Conditions — usable as expect, precondition, recovery trigger, or flow-control predicate:

Element: ElementFound, ElementEnabled, ElementVisible, ElementHasText (exact / contains / starts_with / regex / non_empty), ElementHasChildren
Window: WindowWithAttribute (title, PID, automation ID), WindowWithState (active / visible), WindowClosed, DialogPresent, DialogAbsent, ForegroundIsDialog
Browser: TabWithAttribute (title / URL match on a CDP tab anchor), TabWithState (JS expression evaluated in a tab — truthy = true; use to wait for page readiness)
System: ProcessRunning, FileExists, ExecSucceeded (exit code 0), EvalCondition (boolean expression against outputs/locals), Always
Logic: AllOf, AnyOf, Not

Actions:

Interaction: Click, DoubleClick, Hover, ClickAt (fractional coordinates), Invoke (IInvokePattern — works on off-screen / virtualised elements), TypeText, SetValue (ValuePattern, no keystroke simulation), PressKey, Focus, ScrollIntoView, ActivateWindow, MinimizeWindow, CloseWindow, DismissDialog
Data: Extract (UIA attribute → output variable), Eval (expression → output variable), WriteOutput (output variable → file)
System: Exec (external process, capture stdout), MoveFile, Sleep
Browser: BrowserNavigate (navigate a tab anchor to a URL), BrowserEval (evaluate JS in a tab, store result)
Control: NoOp (wait for a condition without acting)

Control flow: phases can jump to any named phase — loops, branches, early exits. finally phases run unconditionally.

Composition: workflows declare input params and named outputs and call other workflows as subroutines.

Tooling: JSON Schema for autocomplete and inline validation; built-in linter catches unknown types, invalid selectors, missing fields, and undeclared references before the workflow runs.

Install (Windows)

PowerShell -ExecutionPolicy Bypass -Command "iwr https://raw.githubusercontent.com/visioncortex/ui-automata/refs/heads/main/install/install-windows.ps1 | iex"

Example (Windows 11)

# yaml-language-server: $schema=https://raw.githubusercontent.com/visioncortex/ui-automata/main/workflow-schema.json
name: notepad_hello
description: Open Notepad, type a message, and save the file.

defaults:
  timeout: 5s

launch:
  exe: notepad.exe
  wait: new_window

anchors:
  notepad:
    type: Root
    process: Notepad
    selector: "[name~=Notepad]"

  editor:
    type: Stable
    parent: notepad
    selector: ">> [role=document][name='Text editor']"

  saveas_dialog:
    type: Ephemeral
    parent: notepad
    selector: "> [role=dialog][name^=Save]"

phases:

  - name: type_text
    mount: [notepad, editor]  # mounted before steps run
    steps:
      - intent: type text into editor
        action:
          type: TypeText
          scope: editor
          selector: "*"
          text: "Hello Automata"
        expect:
          type: ElementHasText
          scope: editor
          selector: "*"
          pattern:
            contains: "Hello Automata"

  - name: save_file
    mount: [saveas_dialog]
    unmount: [saveas_dialog]
    steps:
      - intent: activate keyboard shortcut for Save As
        action:
          type: PressKey
          scope: notepad
          selector: "*"
          key: "ctrl+shift+s"
        expect:
          type: DialogPresent
          scope: notepad

      - intent: type filename in Save As dialog
        action:
          type: SetValue
          scope: saveas_dialog
          selector: ">> [role=edit][name='File name:']"
          value: "hello-world"
        expect:
          type: ElementHasText
          scope: saveas_dialog
          selector: ">> [role=edit][name='File name:']"
          pattern:
            contains: "hello-world"

      - intent: click Save button
        action:
          type: Invoke
          scope: saveas_dialog
          selector: ">> [role=button][name=Save]"
        expect:
          type: DialogAbsent
          scope: notepad

Example (Windows 10)

# yaml-language-server: $schema=https://raw.githubusercontent.com/visioncortex/ui-automata/main/workflow-schema.json
name: notepad_hello
description: Open Notepad, type a message, and save the file.

defaults:
  timeout: 5s

launch:
  exe: notepad.exe
  wait: new_pid

anchors:
  notepad:
    type: Root
    selector: "[name~=Notepad]"

  editor:
    type: Stable
    parent: notepad
    selector: ">> [role=edit][name='Text Editor']"

  saveas_dialog:
    type: Ephemeral
    parent: notepad
    selector: ">> [role=dialog][name='Save As']"

phases:

  - name: type_text
    mount: [notepad, editor]
    steps:
      - intent: type text into editor
        action:
          type: TypeText
          scope: editor
          selector: "*"
          text: "Hello Automata"
        expect:
          type: ElementHasText
          scope: editor
          selector: "*"
          pattern:
            contains: "Hello Automata"

  - name: save_file
    mount: [saveas_dialog]
    unmount: [saveas_dialog]
    steps:
      - intent: activate keyboard shortcut for Save As
        action:
          type: PressKey
          scope: notepad
          selector: "*"
          key: "ctrl+shift+s"
        expect:
          type: DialogPresent
          scope: notepad

      - intent: type filename in Save As dialog
        action:
          type: SetValue
          scope: saveas_dialog
          selector: ">> [role=combo box][name='File name:'] > [role=edit]"
          value: "hello-world"
        expect:
          type: ElementHasText
          scope: saveas_dialog
          selector: ">> [role=combo box][name='File name:'] > [role=edit]"
          pattern:
            contains: "hello-world"

      - intent: click Save button
        action:
          type: Invoke
          scope: saveas_dialog
          selector: ">> [role=button][name=Save]"
        expect:
          type: DialogAbsent
          scope: notepad

Built for the AI agent era

The project includes an MCP server (automata-agent) that exposes the full automation engine to AI agents. This is a separate component, not part of the open-source library.

Agent-driven authoring

The MCP server gives an agent access to the live desktop: it queries the element tree, tests selectors, runs individual actions, and observes results — the same discovery loop a human would do with an inspector tool. From that exploration it writes the workflow. A human provides intent and reviews the result.

What the agent can do

desktop: list windows, walk the UIA element tree, test selectors live
vision: OCR and visual layout capture for apps that do not fully expose UIA
app: launch apps, list installed apps, manage windows via the taskbar
window: minimize, maximize, restore, reposition, or screenshot a window by HWND
run_actions: run ad-hoc UI automation steps without a workflow file
start_workflow: run a named workflow and stream per-phase progress until completion
workflow: list workflows, check status, cancel runs, browse run history, lint YAML
input: raw mouse and keyboard input — works on any window regardless of UIA support
clipboard: read or write the Windows clipboard
browser: control Microsoft Edge via CDP — navigate, evaluate JavaScript, read the DOM
file: read, write, copy, move, delete, glob, stat
system: shell execution, process management, system diagnostics
resources: browse the embedded workflow library

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
automata-browser		automata-browser
automata-windows		automata-windows
install		install
ui-automata		ui-automata
workflows		workflows
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
workflow-schema.json		workflow-schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UI Automata

The problem

Why existing tools fall short

The approach

The shadow DOM

Selectors

Capabilities

Install (Windows)

Example (Windows 11)

Example (Windows 10)

Built for the AI agent era

Agent-driven authoring

What the agent can do

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UI Automata

The problem

Why existing tools fall short

The approach

The shadow DOM

Selectors

Capabilities

Install (Windows)

Example (Windows 11)

Example (Windows 10)

Built for the AI agent era

Agent-driven authoring

What the agent can do

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages