[e2e] Separate event-log-race-repro harness noise from real regressions#2190
Conversation
The event-log-race-repro job classified every non-completed run into an `other` grab-bag and gated on it. Hook-resume timing races (HOOK_RESUME_FAILED, NO_WAKE_BRANCH) and transport errors in the repro driver — none of which are event-log corruption — landed there and failed the job, drowning the signal it exists to detect. - Route harness-side, non-corruption outcomes to a dedicated non-gating `infra` bucket (test + renderer). The job now gates only on regression-class outcomes (CORRUPTED_EVENT_LOG / USER_ERROR / RUNTIME_ERROR / stuck / other). - Sort regressions ahead of infra rows so the 20-row cap never hides a real failure behind a flood of harness-timing rows. - Raise the default hook/sleep iteration ceiling (5 -> 8). Because the run short-circuits via returnOnWake the moment the hook wins, this widens the window for the delayed resume to land before the sleep budget is exhausted, at ~no runtime cost — restoring the wake-branch coverage the scenario exists to exercise. Add a beforeAll guard that warns when the resume ceiling is not below the sleep budget. - Make the renderer's pure helpers unit-testable and add node:test coverage, wired into the Lint workflow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: e100a60 The changes in this PR will be included in the next version bump. This PR includes changesets to release 0 packagesWhen changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (3 failed)express (1 failed):
nextjs-webpack (2 failed):
💻 Local Development (1 failed)express-stable (1 failed):
📋 Other (1 failed)e2e-vercel-prod-tanstack-start (1 failed):
Details by Category❌ ▲ Vercel Production
❌ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
workflow with 1 step💻 Local Development
▲ Production (Vercel)
workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
* origin/main: [world-vercel] Retry transient response-body parse failures in the HTTP client (#2204) Add virtualization to the trace viewer (#2205) Trace viewer: scroll-load events past an auto-load cap (#2200) fix(core): resolve forwarded stream keys across deployments (#2191) [e2e] Improve error labeling in event-log-race-repro CI job (#2190) [core] Harden event pagination response parsing (#2180) (#2179) Add loading skeleton to the new trace viewer (#2164) Add tooltip components + apply on up/down detail pane (#2163) fix(swc-plugin): allow wasm host imports during link (#2174) [test] Forward-port reused-sleep replay divergence test (#2172) [e2e] Add `event-log-race-repro` label for triggering CI stress-test (#2159) Version Packages (beta) (#2162) fix(world-local): skip Nov 2025 ghost versions on npm (#2168) fix(core,errors): classify SDK encryption failures as RUNTIME_ERROR (#2145) [web-shared][web] Fix events tab search (#2107) Version Packages (beta) (#2147) Allow setting workflow attributes from steps (#2157) Better search handling on the trace viewer (#2144) [docs] Document experimental attributes feature (#2141)
Summary
The
event-log-race-reprojob is currently red for the wrong reason. In a recent run, it reported 1231 of 2000 runs "did not complete cleanly" — yetCORRUPTED_EVENT_LOG/USER_ERROR/RUNTIME_ERRORwere all 0. The entire count was harness-side noise.What the
othercolumn wasThe repro sorts each run into
completed,CORRUPTED_EVENT_LOG,USER_ERROR,RUNTIME_ERROR,stuck, orother. Only the three named error codes mapped to their own bucket; everything else fell intoother, and the--checkgate failed the job on any non-completed run. Sootherconflated genuine corruption with:HOOK_RESUME_FAILED— the run completed fine, but the harness's ownresumeHook()rejected (typically because the sleep branch already finished the run and disposed the hook).NO_WAKE_BRANCH— the run completed but the sleep branch won every iteration, so the wake path was never taken.What was actually going wrong
A timing race in the test harness, not an SDK bug. The workflow's sleep budget is
iterations × sleepMs = 5 × 5000 = 25000ms; the harness resumes the hook atresumeDelayMs + rand(resumeJitterMs)= up to 25000ms — the resume ceiling exactly equals the budget. On a fast/warm deployment the workflow exhausts its sleeps before the resume lands, so the resume hits a disposed hook and/or the wake branch is never taken.The smoking gun: the #2 retry against the same already-warm deployment jumped 767→1231
otherwhilestuckdropped to 0 and the step scenarios stayed perfect. Faster execution = the resume misses the window more often. It also means most "green" runs weren't even exercising the wake-branch shape the repro is meant to stress.Changes
infrabucket (test + renderer). The job now gates only on regression-class outcomes (CORRUPTED_EVENT_LOG/USER_ERROR/RUNTIME_ERROR/stuck/other);infrais reported but never fails the job. The PR comment now reads "No event-log regressions… N runs hit harness-sideinfraoutcomes (… do not fail the job)."iterationsceiling (5 → 8). Because the run short-circuits viareturnOnWakethe moment the hook wins, a higher ceiling widens the window for the resume to land before sleep-budget exhaustion at ~no runtime cost. Added abeforeAllguard that warns when the resume ceiling is not below the sleep budget.node:test(run via a new lightweight Lint job).Tests
node --test .github/scripts/**/*.test.js(8 tests) covers:infrais non-gating;regressionCountignorescompleted/infra; an all-infra results file (mirroring the production comment) yields 0 regressions; regressions sort ahead of infra;--checkexits 0 on infra-only and 1 on a corruption-class outcome.Note: the full 2000-run repro only runs against a deployment behind the
event-log-race-reprolabel, so the reclassification's end-to-end effect is best confirmed by adding that label to this PR.🤖 Generated with Claude Code