Skip to content

Add CI script and hardened skill for AI-driven E2E tests#25443

Draft
iangmaia wants to merge 21 commits intotrunkfrom
iangmaia/ci-ai-e2e-tests
Draft

Add CI script and hardened skill for AI-driven E2E tests#25443
iangmaia wants to merge 21 commits intotrunkfrom
iangmaia/ci-ai-e2e-tests

Conversation

@iangmaia
Copy link
Copy Markdown
Contributor

@iangmaia iangmaia commented Mar 24, 2026

Summary

  • CI entry point (Scripts/ci/run-ai-e2e-tests.sh) that manages simulator, WDA lifecycle, and runs Claude Code with a locked-down --allowedTools allowlist
  • Wrapper scripts (wda-curl.sh, wp-api.sh, launch-app.sh) replace raw curl — validate methods, reject path traversal, read credentials from env vars
  • CI-specific skill (ci-test-runner) teaches Claude how to drive E2E tests using only the wrapper scripts

Existing local dev skills (ai-test-runner, ios-sim-navigation) are not modified.

Ref: AINFRA-2176

Test plan

  • Run ./Scripts/ci/run-ai-e2e-tests.sh locally with a booted simulator and test site credentials
  • Verify wrapper scripts reject bad input (wp-api.sh GET "../../etc/passwd" → error)
  • Run a simple test case (users-screen-loads.md) end-to-end
  • Verify results.md is written with correct pass/fail status

🤖 Generated with Claude Code

Introduce a locked-down Claude Code setup for running AI E2E tests in CI:

- CI entry point (Scripts/ci/run-ai-e2e-tests.sh) that manages the full
  lifecycle: simulator, WDA, Claude Code with --allowedTools, results
- Wrapper scripts (wda-curl.sh, wp-api.sh, launch-app.sh) that replace
  raw curl — validate methods, reject path traversal, read credentials
  from env vars so Claude never sees them in commands
- CI-specific skill (ci-test-runner) with all WDA interaction patterns
  using wrapper scripts instead of raw curl

Ref: AINFRA-2176

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dangermattic
Copy link
Copy Markdown
Collaborator

1 Warning
⚠️ This PR is larger than 500 lines of changes. Please consider splitting it into smaller PRs for easier and faster reviews.
1 Message
📖 This PR is still a Draft: some checks will be skipped.

Generated by 🚫 Danger

@wpmobilebot
Copy link
Copy Markdown
Contributor

wpmobilebot commented Mar 24, 2026

App Icon📲 You can test the changes from this Pull Request in WordPress by scanning the QR code below to install the corresponding build.
App NameWordPress
ConfigurationRelease-Alpha
Build Number31845
VersionPR #25443
Bundle IDorg.wordpress.alpha
Commit23c8799
Installation URL1p8cigq5n80og
Automatticians: You can use our internal self-serve MC tool to give yourself access to those builds if needed.

@wpmobilebot
Copy link
Copy Markdown
Contributor

wpmobilebot commented Mar 24, 2026

App Icon📲 You can test the changes from this Pull Request in Jetpack by scanning the QR code below to install the corresponding build.
App NameJetpack
ConfigurationRelease-Alpha
Build Number31845
VersionPR #25443
Bundle IDcom.jetpack.alpha
Commit23c8799
Installation URL1abo8k6s94618
Automatticians: You can use our internal self-serve MC tool to give yourself access to those builds if needed.

Merge the CI entry point into a single .buildkite/commands script that:
- Checks for "Testing" label on PR (skips early if missing)
- Downloads build artifacts and installs app on simulator
- Runs Claude Code with locked-down --allowedTools

Added as an inline step in pipeline.yml (depends on build_jetpack,
soft_fail, 30min timeout). Remove the separate Scripts/ci entry point.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@iangmaia iangmaia added the Testing Unit and UI Tests and Tooling label Mar 25, 2026
iangmaia and others added 3 commits March 25, 2026 21:33
- BUILDKITE_PULL_REQUEST_LABELS is comma-separated, not semicolons
- Fix missing spaces after [[ in conditional tests
- Install Node.js via brew if npm is not available
- Add explicit return to get_booted_udid function

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract WDA build to a separate build-wda.sh script for clarity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Comment on lines +3 to 14
"deny": [
"Read(./.env)",
"Read(./.env.*)",
"Read(./.git/**)",
"Read(./DerivedData/**)",
"Read(./build/**)",
"Read(./build-products-*.tar)",
"Read(./**/*.mobileprovision)",
"Read(./**/*.p12)",
"Read(./**/*secret*)"
]
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we should do in all repos. Nice!

What's the rationale behind blocking DerivedData? Performance or security?

@mokagio
Copy link
Copy Markdown
Contributor

mokagio commented Mar 27, 2026

Interesting Claude-related failure:

image

I wonder how to deal with this? Can we change the tests so that they call less tools? Or, should we bump the tools threshold?

iangmaia and others added 2 commits March 27, 2026 18:37
- New tap-element.sh combines find+click into a single call, cutting
  turns per tap from 2-3 to 1. Tries accessibility ID first, falls
  back to label.
- Reduce CLAUDE_MAX_TURNS from 120 to 80 so failed tests bail out
  faster (gem completes most tests in 15-55 turns).
- Extend Buildkite timeout from 60 to 90 minutes to ensure all 11
  tests can complete.
- Update ci-test-runner skill to promote tap-element.sh as the
  preferred tap method.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase CLAUDE_MAX_TURNS from 80 to 100 — 80 was too tight for
  complex tests like scheduled post that need date picker interaction.
- Hard-cap screenshots at 3 per test in take-ai-test-screenshot.sh.
  After the limit, the script returns a message instead of capturing.
- Strengthen the skill to make clear that screenshots are only for
  recording failures, never for UI inspection during normal flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Status] DO NOT MERGE Testing Unit and UI Tests and Tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants