Skip to content

feat(tool-server): implement await-ui-element tool#396

Merged
hubgan merged 4 commits into
mainfrom
feat/wait-tool
Jun 24, 2026
Merged

feat(tool-server): implement await-ui-element tool#396
hubgan merged 4 commits into
mainfrom
feat/wait-tool

Conversation

@hubgan

@hubgan hubgan commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a await-ui-element tool that blocks until a UI condition is satisfied or a timeout elapses.

This is one of the most impactful missing tools. Without it, agents have to poll manually with screenshot → describe → check loops — slow, token-heavy, and unreliable (fixed sleep delays either over-wait or fire before the screen settles). await-ui-element moves the poll loop server-side: one call in, a definitive verdict out.

Works on iOS, Android, and Chromium (CDP), polling the same accessibility / DOM tree as describe.

Conditions

  • visible <selector> — wait until an element appears on screen
  • hidden <selector> — wait until an element disappears
  • exists <selector> — wait until an element is in the tree
  • text <selector> <expectedText> — wait until the matched element contains the text

selector is { text?, identifier?, role? } — every field provided must match (case-insensitive substring). Polls internally; default timeoutMs 5000, default pollIntervalMs 400. Returns { success: boolean, elapsed: number }, with a note explaining what was seen on timeout.

{ "udid": "<UDID>", "condition": "visible", "selector": { "text": "Continue" } }

Also usable as a step inside run-sequence, so a tap → await-ui-element → tap chain runs in a single call.

Notable details

  • Multi-match aware — a substring selector can hit several nodes, so conditions evaluate the whole match set: visible holds if any match is on-screen, hidden only if none is (a zero-area container can't flip the verdict).
  • hidden guard — a selector that matches nothing counts as already-hidden; the note flags when the selector never matched, so a typo doesn't read as a silent success.
  • Deadline-clamped poll — a large pollIntervalMs can't overshoot timeoutMs.
  • Abort-aware — both the time sleep and the poll loop stop promptly on cancellation.

Changes

  • tools/await-ui-element/index.ts — new tool (factory, mirrors describe's service resolution)
  • utils/setup-registry.ts — registers the tool
  • tools/run-sequence/index.ts — allows await-ui-element as a step + arg-doc + example
  • test/await-ui-element.test.ts — new test suite
  • skills/rules/argent.md + argent-device-interact/SKILL.md — document await-ui-element, replace the "don't poll screenshot" guidance with it

Test plan

  • time sleeps and succeeds without touching a device
  • visible succeeds once the element appears; times out with a diagnostic note when it doesn't
  • exists / hidden / text happy paths; text timeout note reports last-seen text
  • hidden instant success and the "selector never matched" flag
  • poll sleep clamped to the deadline (large pollIntervalMs can't overshoot timeoutMs)
  • prompt cancellation aborts both the poll loop and the time sleep
  • multi-match evaluation (findAll / evaluateMatches)
  • schema validation (required fields per condition, durationMs cap)

@hubgan hubgan marked this pull request as ready for review June 23, 2026 10:18

@latekvo latekvo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a nitpick but i think it might be quite imporant for the agent - let's name this tool one of these names whichever you believe to be the most fitting: wait-for, wait-for-ui or await-ui-element

The reason is that by default the only thing the agent sees is tool name.
The agent then has to manually call ToolSearch tool to view descriptions of the tools it wants to call.
Given that context, wait alone may be too broad or misleading to the agent.

Note that we already use await in our await_user_selection tool, so following that tool naming convention would make some sense.

Comment thread packages/skills/skills/argent-device-interact/SKILL.md Outdated
Comment thread packages/tool-server/src/tools/run-sequence/index.ts Outdated
Comment thread packages/tool-server/src/tools/run-sequence/index.ts Outdated
Comment thread packages/tool-server/src/tools/wait/index.ts Outdated
Comment thread packages/tool-server/src/tools/wait/index.ts Outdated
Comment thread packages/tool-server/src/tools/wait/index.ts Outdated
Comment thread packages/tool-server/src/tools/wait/index.ts Outdated
Comment thread packages/tool-server/test/wait.test.ts Outdated

@latekvo latekvo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed end-to-end, including running the built tool-server against a live iOS simulator over HTTP: visible / text succeed once the element appears, hidden correctly refuses to fire while the element is still on screen, an unmet wait inside run-sequence stops the sequence so the trailing tap never runs, and the schema rejects text without expectedText. Cross-platform tree handling, abort/cancellation propagation, and the per-fetch deadline all check out. All eight earlier review threads are addressed in this revision (replied + resolved individually).

The inline notes below are minor and don't block.

One item that can't be anchored to a line: the PR title and description still document the previous wait tool and its time / durationMs sleep condition, which were removed in the rename to await-ui-element. The test-plan checkboxes and the Changes file paths (tools/wait/index.ts, test/wait.test.ts) are stale relative to the merged code and would carry into the squash-commit message — worth refreshing before merge.

exists — the selector matches an element anywhere in the tree.
visible — the selector matches an element with a non-zero on-screen frame.
hidden — the selector matches nothing, or only a zero-area element (e.g. a spinner that disappeared).
text — the element matched by the selector contains expectedText (case-insensitive substring).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This describes text as checking "the element matched by the selector," but the selector is a substring match that can hit several nodes and the condition only inspects the first match in reading order. When a broad selector matches more than one element and a lower one — not the topmost — is the one containing expectedText, the wait reports failure even though an on-screen matching element does contain the text. The single-element phrasing doesn't signal that the verdict is keyed to the topmost match only, so an agent using a loose selector can be surprised by a false negative.


// Every node matching the selector in the subtree, EXCLUDING `root` itself.
// `root` is the synthetic full-screen container every describe adapter injects
// (iOS `AXGroup`, Android `hierarchy`/`Screen`, Chromium `html`; frame

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rationale states every adapter injects a synthetic full-screen root with frame 0,0,1,1, but on Chromium the root is the real <html> element (describeChromium walks document.documentElement) and its frame comes from getBoundingClientRect, not a synthetic 0,0,1,1. Excluding it is still correct, but the stated reasoning misdescribes the Chromium case. A side effect worth noting: because <html>'s own id / aria-label / author role sit on the excluded root, a selector targeting those attributes matches nothing on Chromium.

// describe prunes off-screen / zero-size nodes on Chromium and the compressed
// Android dump, and iOS AX only returns on-screen leaves — so a non-zero frame
// area is a cheap, reliable proxy for "visible".
function isVisible(node: DescribeNode): boolean {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findAll matches against every node in the tree, while describe's rendered body only emits nodes that pass its content/role filter. As a result a role- or identifier-only selector can match a structural container (e.g. an unlabeled AXGroup) that describe never lists, so visible / exists can hold for an element the agent never saw in describe's output. The comment frames the match set as mirroring format-tree, but only the root exclusion is shared — the per-node visibility gate is not.

// such a tree is not evidence the element is gone, so we must not let `hidden`
// resolve positively off it — otherwise "AX is down" reads as "element hidden".
function isBlindRead(data: DescribeTreeData): boolean {
return data.tree.children.length === 0 && Boolean(data.hint || data.should_restart);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isBlindRead only treats an empty tree as unreliable when hint or should_restart is set, and only the iOS path ever sets those. On Android and Chromium an empty tree is always treated as a confirmed read, so a hidden (or text-absence) wait resolves immediately on any empty tree, including a momentary blank frame mid-navigation. When the element matched on an earlier poll (everMatched true) and a later poll lands on a transient empty tree, hidden returns success with no caveat — which can release a gated tap against a screen that only briefly went blank.

@hubgan hubgan changed the title feat(tool-server): implement wait tool feat(tool-server): implement await-ui-element tool Jun 24, 2026
@hubgan hubgan merged commit 5558774 into main Jun 24, 2026
7 checks passed
@hubgan hubgan deleted the feat/wait-tool branch June 24, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants