Add CI script and hardened skill for AI-driven E2E tests#25443
Draft
Add CI script and hardened skill for AI-driven E2E tests#25443
Conversation
Introduce a locked-down Claude Code setup for running AI E2E tests in CI: - CI entry point (Scripts/ci/run-ai-e2e-tests.sh) that manages the full lifecycle: simulator, WDA, Claude Code with --allowedTools, results - Wrapper scripts (wda-curl.sh, wp-api.sh, launch-app.sh) that replace raw curl — validate methods, reject path traversal, read credentials from env vars so Claude never sees them in commands - CI-specific skill (ci-test-runner) with all WDA interaction patterns using wrapper scripts instead of raw curl Ref: AINFRA-2176 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
Generated by 🚫 Danger |
Contributor
|
| App Name | WordPress | |
| Configuration | Release-Alpha | |
| Build Number | 31845 | |
| Version | PR #25443 | |
| Bundle ID | org.wordpress.alpha | |
| Commit | 23c8799 | |
| Installation URL | 1p8cigq5n80og |
Contributor
|
| App Name | Jetpack | |
| Configuration | Release-Alpha | |
| Build Number | 31845 | |
| Version | PR #25443 | |
| Bundle ID | com.jetpack.alpha | |
| Commit | 23c8799 | |
| Installation URL | 1abo8k6s94618 |
Merge the CI entry point into a single .buildkite/commands script that: - Checks for "Testing" label on PR (skips early if missing) - Downloads build artifacts and installs app on simulator - Runs Claude Code with locked-down --allowedTools Added as an inline step in pipeline.yml (depends on build_jetpack, soft_fail, 30min timeout). Remove the separate Scripts/ci entry point. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1bf7265 to
e99456e
Compare
3 tasks
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- BUILDKITE_PULL_REQUEST_LABELS is comma-separated, not semicolons - Fix missing spaces after [[ in conditional tests - Install Node.js via brew if npm is not available - Add explicit return to get_booted_udid function Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract WDA build to a separate build-wda.sh script for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mokagio
reviewed
Mar 27, 2026
Comment on lines
+3
to
14
| "deny": [ | ||
| "Read(./.env)", | ||
| "Read(./.env.*)", | ||
| "Read(./.git/**)", | ||
| "Read(./DerivedData/**)", | ||
| "Read(./build/**)", | ||
| "Read(./build-products-*.tar)", | ||
| "Read(./**/*.mobileprovision)", | ||
| "Read(./**/*.p12)", | ||
| "Read(./**/*secret*)" | ||
| ] | ||
| } |
Contributor
There was a problem hiding this comment.
This is something we should do in all repos. Nice!
What's the rationale behind blocking DerivedData? Performance or security?
Contributor
- New tap-element.sh combines find+click into a single call, cutting turns per tap from 2-3 to 1. Tries accessibility ID first, falls back to label. - Reduce CLAUDE_MAX_TURNS from 120 to 80 so failed tests bail out faster (gem completes most tests in 15-55 turns). - Extend Buildkite timeout from 60 to 90 minutes to ensure all 11 tests can complete. - Update ci-test-runner skill to promote tap-element.sh as the preferred tap method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase CLAUDE_MAX_TURNS from 80 to 100 — 80 was too tight for complex tests like scheduled post that need date picker interaction. - Hard-cap screenshots at 3 per test in take-ai-test-screenshot.sh. After the limit, the script returns a message instead of capturing. - Strengthen the skill to make clear that screenshots are only for recording failures, never for UI inspection during normal flow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.






Summary
Scripts/ci/run-ai-e2e-tests.sh) that manages simulator, WDA lifecycle, and runs Claude Code with a locked-down--allowedToolsallowlistwda-curl.sh,wp-api.sh,launch-app.sh) replace raw curl — validate methods, reject path traversal, read credentials from env varsci-test-runner) teaches Claude how to drive E2E tests using only the wrapper scriptsExisting local dev skills (
ai-test-runner,ios-sim-navigation) are not modified.Ref: AINFRA-2176
Test plan
./Scripts/ci/run-ai-e2e-tests.shlocally with a booted simulator and test site credentialswp-api.sh GET "../../etc/passwd"→ error)users-screen-loads.md) end-to-endresults.mdis written with correct pass/fail status🤖 Generated with Claude Code