Skip to content

devlow-bench: percentile-based comparison and run retries#93950

Merged
wbinnssmith merged 1 commit into
canaryfrom
wbinnssmith/devlow-bench-retries-percentiles
May 20, 2026
Merged

devlow-bench: percentile-based comparison and run retries#93950
wbinnssmith merged 1 commit into
canaryfrom
wbinnssmith/devlow-bench-retries-percentiles

Conversation

@wbinnssmith
Copy link
Copy Markdown
Member

A few changes to devlow-bench so the comparison output is easier to read and so a couple of flaky page-load runs don't quietly poison the stats.

compare

  • Show p50 / p90 / p99 instead of mean / p50 / p90, with Δ p50 and a single Mann–Whitney p-value (Welch's t-test and Δ mean are gone — the mean was dragging on bad runs).
  • Hoist the modal sample count into an n = 7 per metric banner so only the outlier rows carry (n).

runner

  • Each attempt's measurements are buffered locally and only merged on success. Failed runs are retried (capped at 2× warmup+n) until we have a clean n samples.
  • Per-variant retry line plus an end-of-run summary noting which variants recovered and which fell short.

browser

  • hardNavigation / reload now throw when the navigation response is missing or non-2xx. Previously a 200-committed-but-broken page was being recorded as a real sample.

The trigger was a run where 2 of 7 cold-build attempts hit an error page, dragging the mean response size from 45 MB to 33 MB while the p50 was unchanged.

…etection

- compare: drop mean and Welch's p; show p50/p90/p99, Δ p50, and a single
  Mann–Whitney p-value. Hoist the modal n into a banner so only outlier rows
  carry (n).
- runner: buffer each attempt's samples and merge only on success. Retry
  failed runs until n clean samples are collected (capped at 2× warmup+n).
  Per-variant and end-of-run failure summaries.
- browser: hardNavigation/reload throw when the response is missing or
  non-2xx, so error pages become failed runs instead of polluted samples.
@wbinnssmith wbinnssmith requested review from lukesandberg and sokra May 19, 2026 22:45
@wbinnssmith wbinnssmith marked this pull request as ready for review May 19, 2026 22:52
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

Tests Passed

Commit: bea3012

@wbinnssmith wbinnssmith merged commit 56825e5 into canary May 20, 2026
286 of 294 checks passed
@wbinnssmith wbinnssmith deleted the wbinnssmith/devlow-bench-retries-percentiles branch May 20, 2026 21:27
wbinnssmith added a commit that referenced this pull request May 20, 2026
Version bump so we can publish the changes from #93950 (and the
previously-unpublished 0.3.5 work) to npm.

Stacked on top of #93950.

<!-- NEXT_JS_LLM_PR -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants