Phase 1 LLM-driven Android smoke — fork-test bundle (green on run 25659967543) by rustam-callstack · Pull Request #1 · rustam-callstack/App

rustam-callstack · 2026-05-11T10:18:01Z

Summary

Single fork PR that captures every code change exercised by the green
fork-test runs that validate Phase 1 end-to-end. This is intentionally
fork-only review surface — the upstream-bound subset lives on
feat/agent-device-smoke-llm-driver (linked to
Expensify/App#89896);
the fork-test workflow + auxiliary commits in here do not go to
upstream.

What's inside

Phase 1 LLM-driven smoke engine (.github/scripts/):

File	Role
`agent-device-cli.ts`	Typed wrapper around the `agent-device` CLI (snapshot text → JSON, fill/press/etc.); per-command timeout — 30s default, 90s for `fill`
`agent-device-snapshot-signature.ts`	Structural SHA used as cache key (kinds + roles, no text content); filters RN dev-warning bubbles
`agent-device-expect.ts`	Predicate DSL — `snapshot.contains_text`, `snapshot.field_with_text(...).exists`, `appstate.foreground`
`agent-device-replay-cache.ts`	Cache load/lookup/diff helpers
`agent-device-llm-client.ts`	Anthropic `/v1/messages` caller with prompt cache, 3-retry backoff, token-budget kill switch, `DEBUG_LLM=1` verbose trace
`agent-device-llm-driver.ts`	Orchestrator: boot dance + per-step ladder (cache → LLM → bash fallback); refreshes `snap`+`app` after state-changing actions

Test case + cache (tests/smoke/):

android-signin.testcase.txt — 4 numbered plain-English steps + expect: postconditions
cache/android-signin.testcase.json — committed seed cache (steps 1–3 deterministic, step 4 falls through to LLM by design)

Workflows (.github/workflows/):

smokeAndroidLLM.yml — upstream-bound canary (PR + dispatch trigger, Blacksmith runner, AWS S3 cache lookup via Rock)
smokeAndroidLLMForkTest.yml — fork-only sibling (ubuntu-latest, --local Rock build, no AWS)

Script alias (package.json):

smoke:android:llm → ts-node .github/scripts/agent-device-llm-driver.ts

Validation runs

Run	Result	Cache	Tokens
25553622590	✅ first green	seeded	42,980
25659967543	✅ green after snap-refresh+fill-timeout fixes	re-recorded	29,477
25664772390	✅ final cache-hit verified	`cache_hits=3 llm_runs=1`	5,649

The final run is the gold standard: 3 of 4 steps replayed from the committed cache with zero LLM cost; step 4's wait_for(magic_code) re-fires the LLM by design (post-state varies). ~5× token reduction vs cache-miss path.

Fixes shipped on this branch (12 commits in order)

Boot-timeout diagnostics + bumped SignIn budget to 600s
Pixel Launcher ANR detection + dismiss-and-relaunch (force-stop on recovery)
Pre-emptive settings put global hide_error_dialogs 1
Disable Android Autofill at boot (autofill_service=null) — autofill was silently filling fields and producing incomplete cache records
verifyPostState accepts expect-pass even when signature drifts
Refresh snap+app after every state-changing tool batch — LLM was seeing stale UI
90s CLI timeout specifically for fill — 30s wasn't enough for 30-char string on 2-core runner
500ms settle gap after bash fallback before verifyPostState
Filter RN dev-warning nodes from signature — !, … bubbles appear non-deterministically and were rotating signatures
Cache re-signed with filter applied (computed locally from prior run artifacts, no extra CI cycle)

Reproduce

gh workflow run smokeAndroidLLMForkTest.yml \
  --ref feat/agent-device-smoke-llm-driver-fork-test \
  --repo rustam-callstack/App

Requires `ANTHROPIC_API_KEY` and `MAPBOX_SDK_DOWNLOAD_TOKEN` secrets
set on the fork.

🤖 Generated with Claude Code

Replaces the brittle bash assertion logic of Phase 0 with an LLM runner that takes plain-text test cases (numbered English steps with optional `expect:` postconditions) and uses Claude Sonnet to figure out the right agent-device CLI calls. A committed replay cache at tests/smoke/cache/<test>.json keeps the happy path deterministic and ~\$0 in API spend; cache misses fall back to the LLM, and final-tier failures fall back to a Phase-0-style bash recipe so an Anthropic outage doesn't fail the build. Phase 0 stays untouched. Phase 1 ships as `smokeAndroidLLM.yml` with `continue-on-error: true` for the first 2 weeks; flip to required once flake rate is at parity. Files added: - .github/scripts/agent-device-cli.ts (typed wrapper around the CLI) - .github/scripts/agent-device-snapshot-signature.ts (structural cache key) - .github/scripts/agent-device-expect.ts (postcondition DSL) - .github/scripts/agent-device-replay-cache.ts (cache load/lookup/diff) - .github/scripts/agent-device-llm-client.ts (Anthropic /v1/messages with prompt cache + backoff) - .github/scripts/agent-device-llm-driver.ts (orchestrator) - .github/workflows/smokeAndroidLLM.yml (PR + dispatch trigger) - tests/smoke/android-signin.testcase.txt (4 numbered steps for SignIn flow) - package.json: smoke:android:llm script See plan: $(printf '~/.claude/plans/buzzing-mixing-dusk.md') Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors smokeAndroidLLM.yml on ubuntu-latest with --local Rock build so the LLM-driven smoke can be exercised on the personal fork before merging upstream. Hard-guarded by github.repository to never run on the upstream repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1's first fork-test run timed out at the 360s SignIn-wait budget without uploading any diagnostics — the runner exited via fail() before writing snapshots/screenshots, so post-mortem only had logcat. - 360s -> 600s. Phase 0 saw 294s on a warm AVD; the first run of a new workflow can't reuse that cache (key includes the workflow filename), so it pays the cold-prime cost and needs more headroom. - Every 30s during the wait, dump probe snapshot text to artifacts so we can see the timeline of UI states the app traversed. - On final timeout, capture snapshot + appstate + PNG screenshot before failing so the failure is debuggable from a single artifact upload. - Don't let a transient snapshot exception kill the whole wait — log and retry. The agent-device CLI occasionally times out under emulator load and the next poll usually succeeds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous fork-test runs showed every probe stuck on a system "Pixel Launcher isn't responding" dialog with `Close app` / `Wait` buttons, sitting on top of our (correctly-foregrounded) Expensify activity. The 2-core ubuntu-latest runner can't keep up with Metro + APK launch + launcher init simultaneously, so the launcher ANRs and the accessibility tree gets captured by the dialog overlay. Two fixes: 1. Pre-emptively `settings put global hide_error_dialogs 1` so the OS suppresses ANR dialogs system-wide (the underlying ANR still happens but the foreground app stays uncovered). 2. In-loop recovery: if the snapshot looks like an ANR dialog (exactly two buttons labelled "Close app" + "Wait"), press Wait to dismiss, then `am start` our activity to force-foreground, and continue polling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captured from run 25553622590 (51m, 4/4 LLM steps green, magic-code reached). All 4 step entries plus structural pre/post signatures are committed so future PR runs can replay the happy path without a Claude API call. Known caveat: step 2's recorded actions only contain `press` though the field gets typed end-to-end. The runner's recording path drops the fill action somewhere; the committed cache will not perfectly replay step 2, so cache-hit will fail expect-verification on that step and fall through to LLM. Tracking a fix; the smoke remains correct because expect runs against the live UI, not the cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…drift The cache-hit run 25556053751 failed at step 3 because verifyPostState required *both* signature match AND expect-pass. The replay had pressed Continue successfully — the app advanced to magic-code, but the post- signature differed from what was recorded (cosmetic re-render, slightly different node count). Runner treated it as drift, fell through to LLM, LLM exhausted budget, bash fallback ran "press Continue" against the magic-code screen (no Continue button), step failed. The signature is a structural hash; the expect predicate is an intentional deterministic check over the live UI. When expect passes the step has succeeded by the test author's own definition, even if the structural hash drifted. Re-prioritize: expect first, signature becomes advisory (warning, not failure). Steps with no `expect:` clause still fail on signature drift — that's the only post-state check available there, and it stays useful as a "did anything visibly change?" tripwire. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Diagnostic mode for tracking down the step-2 cache-recording bug (cache stores `press(text-field)` even though the email gets typed into the field). With DEBUG_LLM=1: - llm-client trace adds a `request` entry per call with the last user text + every prior tool_use in the thread. Each `response` entry now includes the LLM's tool_use blocks (id, name, full input args) and any text preview. - driver dispatchTool fill/press log entry args, refToLocator result, executed-array length after the push, and surface throws separately so a silent CLI failure becomes visible. Off by default (env-gated) so production runs stay slim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Temporary diagnostic config to capture the LLM's exact tool_use sequence in step 2. Revert both env+cache-delete once the recording bug is fixed. Fork-test only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two fixes for cache-hit reliability traced from run 25553622590's logcat: 1. Android autofill silently filled the email field after the LLM pressed it (FillRequestEventLogger entry at the exact moment of step 2's press, BeginSignIn API fired with the email a second later — the LLM never called fill). Cache then recorded only the press; replay on a different AVD snapshot where autofill state had rotated broke deterministically. Disabling autofill via `settings put secure autofill_service null` at boot forces the LLM to call fill explicitly so both record and replay are self-contained. 2. ANR recovery via `am start` brought a half-loaded MainActivity to the foreground (run 25560886459 stuck on splash for 600s after recovery). force-stop + agent-device open --relaunch guarantees a clean process spawn so the next launch re-runs JS init. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eout Run 25568731827's LLM trace revealed the core failure pattern: every step-2 user message carried `snapshot.node_count=10` with the same pre-step text-field — `snap` is never refreshed after fill/press, so the LLM sees its own actions had "no effect", retries the same fill, gets caught by seen-hash dedup, then burns the wall-clock budget. Three fixes: 1. After every batch of tool calls in runLLMStep that contains fill/press/wait, refresh `snap` + `app` so the next round sees the live state. snapshot/wait_for/back/dismiss_keyboard already refreshed via dispatchTool's onSnap callback; fill/press didn't. 2. agent-device fill gets its own 90s CLI timeout (was 30s). The 30-char email took >30s to type on the 2-core ubuntu-latest; adCli.fill threw, the action wasn't pushed to executed[], and the device did get partially-typed text but the runner thought the call failed. Read-only commands keep the 30s tripwire. 3. 500ms settle gap after bash fallback before verifyPostState so the typed text propagates through React Native's onChange before the predicate snapshot reads back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captured from run 25659967543 (green end-to-end, all 4 steps reached magic-code with proper LLM tool sequence). This cache supersedes the prior seed from 25553622590, which was recorded with Android Autofill active — its step 2 stored a stale `press(text-field)` action that worked only because the framework was silently filling the field on focus, breaking cache-hit replay on AVD snapshots where autofill state had rotated. This cache contains the correct `fill(text-field, "rustam.zeinalov@…")` recorded against an autofill-disabled emulator. Signatures rotate relative to the old cache (autofill-related accessibility nodes are gone), but the role-based locators stay portable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Now that step 2's fill-recording bug is fixed and a clean cache is committed, restore the workflow to its production shape so the next dispatch exercises cache-hit happy path against the committed cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Compared step-1-pre snapshots of runs 25659967543 and 25662443061 on the same SignIn screen: one had 3 extra dev-warning nodes ("!, The result of getSnapshot should be cached...") the other didn't. Structural signature included those nodes, so the cache key rotated between runs and replay never matched even though the user-visible UI was identical. Drop those transient dev-mode bubbles from the signature: any group whose text starts with "!, ", any "!" indicator, and the specific warning text strings that pair with them. Dev-only by construction — they never reach release builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-computed pre/post signatures locally from run 25662443061's step-N-{pre,post}.txt artifacts with the new filter (transient RN dev-warning nodes excluded). Verified the same signatures compute from run 25659967543's artifacts on the same UI despite that run having different dev-warning node counts — filter is doing its job. Action sequences unchanged (filter affects only signature, not locator resolution). Next dispatch should land cache_hits>=3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rustam-callstack and others added 12 commits May 8, 2026 11:22

rustam-callstack closed this May 11, 2026

rustam-callstack reopened this May 11, 2026

rustam-callstack and others added 2 commits May 11, 2026 12:28

rustam-callstack mentioned this pull request May 11, 2026

[NoQA] Add LLM-driven Android emulator smoke (agent-device · Phase 1) Expensify/App#90181

Draft

24 tasks

rustam-callstack force-pushed the feat/agent-device-smoke-llm-driver-fork-test branch from a9080fd to e899dae Compare May 11, 2026 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1 LLM-driven Android smoke — fork-test bundle (green on run 25659967543)#1

Phase 1 LLM-driven Android smoke — fork-test bundle (green on run 25659967543)#1
rustam-callstack wants to merge 14 commits into
mainfrom
feat/agent-device-smoke-llm-driver-fork-test

rustam-callstack commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rustam-callstack commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's inside

Validation runs

Fixes shipped on this branch (12 commits in order)

Reproduce

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rustam-callstack commented May 11, 2026 •

edited

Loading