feat(tool-server): implement await-ui-element tool#396
Conversation
latekvo
left a comment
There was a problem hiding this comment.
This sounds like a nitpick but i think it might be quite imporant for the agent - let's name this tool one of these names whichever you believe to be the most fitting: wait-for, wait-for-ui or await-ui-element
The reason is that by default the only thing the agent sees is tool name.
The agent then has to manually call ToolSearch tool to view descriptions of the tools it wants to call.
Given that context, wait alone may be too broad or misleading to the agent.
Note that we already use await in our await_user_selection tool, so following that tool naming convention would make some sense.
latekvo
left a comment
There was a problem hiding this comment.
Reviewed end-to-end, including running the built tool-server against a live iOS simulator over HTTP: visible / text succeed once the element appears, hidden correctly refuses to fire while the element is still on screen, an unmet wait inside run-sequence stops the sequence so the trailing tap never runs, and the schema rejects text without expectedText. Cross-platform tree handling, abort/cancellation propagation, and the per-fetch deadline all check out. All eight earlier review threads are addressed in this revision (replied + resolved individually).
The inline notes below are minor and don't block.
One item that can't be anchored to a line: the PR title and description still document the previous wait tool and its time / durationMs sleep condition, which were removed in the rename to await-ui-element. The test-plan checkboxes and the Changes file paths (tools/wait/index.ts, test/wait.test.ts) are stale relative to the merged code and would carry into the squash-commit message — worth refreshing before merge.
| exists — the selector matches an element anywhere in the tree. | ||
| visible — the selector matches an element with a non-zero on-screen frame. | ||
| hidden — the selector matches nothing, or only a zero-area element (e.g. a spinner that disappeared). | ||
| text — the element matched by the selector contains expectedText (case-insensitive substring). |
There was a problem hiding this comment.
This describes text as checking "the element matched by the selector," but the selector is a substring match that can hit several nodes and the condition only inspects the first match in reading order. When a broad selector matches more than one element and a lower one — not the topmost — is the one containing expectedText, the wait reports failure even though an on-screen matching element does contain the text. The single-element phrasing doesn't signal that the verdict is keyed to the topmost match only, so an agent using a loose selector can be surprised by a false negative.
|
|
||
| // Every node matching the selector in the subtree, EXCLUDING `root` itself. | ||
| // `root` is the synthetic full-screen container every describe adapter injects | ||
| // (iOS `AXGroup`, Android `hierarchy`/`Screen`, Chromium `html`; frame |
There was a problem hiding this comment.
This rationale states every adapter injects a synthetic full-screen root with frame 0,0,1,1, but on Chromium the root is the real <html> element (describeChromium walks document.documentElement) and its frame comes from getBoundingClientRect, not a synthetic 0,0,1,1. Excluding it is still correct, but the stated reasoning misdescribes the Chromium case. A side effect worth noting: because <html>'s own id / aria-label / author role sit on the excluded root, a selector targeting those attributes matches nothing on Chromium.
| // describe prunes off-screen / zero-size nodes on Chromium and the compressed | ||
| // Android dump, and iOS AX only returns on-screen leaves — so a non-zero frame | ||
| // area is a cheap, reliable proxy for "visible". | ||
| function isVisible(node: DescribeNode): boolean { |
There was a problem hiding this comment.
findAll matches against every node in the tree, while describe's rendered body only emits nodes that pass its content/role filter. As a result a role- or identifier-only selector can match a structural container (e.g. an unlabeled AXGroup) that describe never lists, so visible / exists can hold for an element the agent never saw in describe's output. The comment frames the match set as mirroring format-tree, but only the root exclusion is shared — the per-node visibility gate is not.
| // such a tree is not evidence the element is gone, so we must not let `hidden` | ||
| // resolve positively off it — otherwise "AX is down" reads as "element hidden". | ||
| function isBlindRead(data: DescribeTreeData): boolean { | ||
| return data.tree.children.length === 0 && Boolean(data.hint || data.should_restart); |
There was a problem hiding this comment.
isBlindRead only treats an empty tree as unreliable when hint or should_restart is set, and only the iOS path ever sets those. On Android and Chromium an empty tree is always treated as a confirmed read, so a hidden (or text-absence) wait resolves immediately on any empty tree, including a momentary blank frame mid-navigation. When the element matched on an earlier poll (everMatched true) and a later poll lands on a transient empty tree, hidden returns success with no caveat — which can release a gated tap against a screen that only briefly went blank.
Summary
Adds a
await-ui-elementtool that blocks until a UI condition is satisfied or a timeout elapses.This is one of the most impactful missing tools. Without it, agents have to poll manually with
screenshot → describe → checkloops — slow, token-heavy, and unreliable (fixedsleepdelays either over-wait or fire before the screen settles).await-ui-elementmoves the poll loop server-side: one call in, a definitive verdict out.Works on iOS, Android, and Chromium (CDP), polling the same accessibility / DOM tree as
describe.Conditions
visible <selector>— wait until an element appears on screenhidden <selector>— wait until an element disappearsexists <selector>— wait until an element is in the treetext <selector> <expectedText>— wait until the matched element contains the textselector is
{ text?, identifier?, role? }— every field provided must match (case-insensitive substring). Polls internally; defaulttimeoutMs5000, defaultpollIntervalMs400. Returns{ success: boolean, elapsed: number }, with anoteexplaining what was seen on timeout.{ "udid": "<UDID>", "condition": "visible", "selector": { "text": "Continue" } }Also usable as a step inside
run-sequence, so a tap → await-ui-element → tap chain runs in a single call.Notable details
visibleholds if any match is on-screen,hiddenonly if none is (a zero-area container can't flip the verdict).hiddenguard — a selector that matches nothing counts as already-hidden; thenoteflags when the selector never matched, so a typo doesn't read as a silent success.pollIntervalMscan't overshoottimeoutMs.timesleep and the poll loop stop promptly on cancellation.Changes
tools/await-ui-element/index.ts— new tool (factory, mirrorsdescribe's service resolution)utils/setup-registry.ts— registers the tooltools/run-sequence/index.ts— allowsawait-ui-elementas a step + arg-doc + exampletest/await-ui-element.test.ts— new test suiteskills/rules/argent.md+argent-device-interact/SKILL.md— documentawait-ui-element, replace the "don't pollscreenshot" guidance with itTest plan
timesleeps and succeeds without touching a devicevisiblesucceeds once the element appears; times out with a diagnostic note when it doesn'texists/hidden/texthappy paths;texttimeout note reports last-seen texthiddeninstant success and the "selector never matched" flagpollIntervalMscan't overshoottimeoutMs)timesleepfindAll/evaluateMatches)durationMscap)