feat(dashboard): level-profile chart (6 lines) + HTML legend + sign-gated parity band#193
Conversation
…band
- Add Level Profile chart: per-level speed (MiB/s, left axis) + ratio
(right axis) with filled+smoothed lines, 4 toggle checkboxes for the
Rust/FFI × speed/ratio cross-product, and target/scenario/stage
filters. Toggles use dataset.hidden + chart.update("none") so the
chart is not recreated on every interaction.
- Replace Chart.js built-in legend on both charts with a custom HTML
legend: full-row click target (fixes the misaligned click box where
only the colour swatch responded while wrapped labels drifted), and
a scrollable container so no series is silently hidden when there
are too many to fit. Click toggles dataset visibility via the same
Chart.js meta.hidden mechanism, preserving the prior UX.
- Sign-gate "Outside parity band" classification by metric direction:
ratio regressions only when delta > BAND_HIGH (Rust produced larger
output), throughput regressions only when delta < BAND_LOW (Rust
slower), peak-alloc regressions only when delta > BAND_HIGH (Rust
allocated more). Wins (deviation in our favour) are surfaced as a
separate "notable wins not counted" tally and no longer inflate the
regression count.
Closes #183
…n state
- renderProfileChart now patches labels + datasets in place and calls
chart.update("none") on every selector change. The previous path
destroyed and recreated the chart on each switch, contradicting the
"repopulate without recreating the chart" acceptance criterion and
dropping internal Chart.js state (animation continuity, scale state)
between scenario/stage/target switches.
- Custom HTML legend now uses chart.isDatasetVisible(i) to decide
whether to apply the hidden-series style. The previous check on
meta.hidden missed datasets constructed with dataset.hidden = true
(where Chart.js leaves meta.hidden === null), so the legend row
rendered as visible even when the series was hidden.
- Profile chart legend clicks now flow through profileSeriesVisibility
and sync the matching checkbox, so toggling a series via the legend
no longer produces a checkbox/series UI mismatch and survives the
next scenario switch. Delta-chart legend keeps the simpler direct
toggle path.
- Removed bandLabelForRow — was unused.
Adds a fourth dropdown ("Snapshot") that pins the 4 profile lines to a
single benchmark run rather than always picking the latest record per
(level, metric). Default value is "latest" (previous behaviour). The
dropdown lists every generated_at value (or commit_sha when timestamp
is missing) that produced data for the currently chosen target /
scenario / stage, and repopulates when those upstream filters change
while preserving the active selection where possible.
Use case: compare two specific commits side-by-side by opening the
dashboard in two browser tabs and pinning each tab to a different
snapshot — the 4 smoothed lines then describe exactly one commit per
view, with no implicit "newest wins" merging across runs.
- Add a Source dropdown to the Level Profile chart, mirroring the delta chart's Source filter. Without it, stages that have multiple source values (e.g. decompress benches emitting rust_stream and c_stream) would silently collide on the same level keys and render a mixed/incorrect profile. The source value is now part of the filter cascade (target/scenario/stage → source → snapshot), and the row-selection key includes source so decompress profiles are unambiguous. - Render custom HTML legend rows as <button type="button"> instead of <div>, with aria-pressed reflecting series visibility. Native buttons give keyboard activation (Enter/Space), focus styling, and a button role for screen readers without any extra JS, so the legend is operable without a mouse.
Replace the 4-line per-stage profile (which needed stage + source dropdowns) with a 6-line strategy-comparison view rendered for a single snapshot: - Rust compress speed (MiB/s, left axis) - FFI compress speed (MiB/s, left axis) - Rust decompress speed (MiB/s, left axis, dashed) - FFI decompress speed (MiB/s, left axis, dashed) - Rust output ratio (right axis, dashed) — compressed/input, lower=better - FFI output ratio (right axis, dashed) Decompress rows collapse across the source dimension (rust_stream / c_stream) because each row already carries both rust_value and ffi_value — source is bench bookkeeping, not a comparison side. Stage no longer needs a dropdown because both compress and decompress live on the chart at once. Right axis title is now explicit about ratio semantics (compressed/input, lower=better) so readers don't assume the conventional input/output definition where higher is better. Filters: Target + Scenario + Snapshot. Six independent toggle checkboxes for the series. Legend rows remain accessible buttons with aria-pressed.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a Level Profile chart (per-level compress/decompress speed and ratio), a screen-reader-friendly HTML legend replacing Chart.js's legend, persistent profile-series visibility and snapshot semantics, and metric-directional regression detection integrated with existing filters and initialization. ChangesBenchmark Dashboard UI & Logic Enhancement
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Adds a redesigned bench dashboard view that complements the existing delta-over-time chart with a per-compression-level “Level Profile” visualization, plus a custom HTML legend and corrected (sign-aware) “outside parity band” regression classification.
Changes:
- Replace Chart.js built-in legends with a custom HTML legend (button rows with
aria-pressed) and add scrollable legend containers for both charts. - Add a new “Level Profile” chart plotting Rust vs FFI compress/decompress throughput (MiB/s) and compression output ratio (compressed/input) across levels, with target/scenario/snapshot filters and per-series checkboxes.
- Update status/regression classification to be metric-direction aware (wins no longer count as “outside parity band”).
buildProfileData() previously stored only one row per
${level}|decompress_speed and let the most-recent-by-generated_at row
win. In practice both decompress source variants (rust_stream and
c_stream) share the same generated_at within a single snapshot, so the
"winner" was iteration-order-dependent — typically c_stream — and the
chart silently reflected only one stream variant instead of both.
Replace the "keep one row" path with a sum/count accumulator and emit
the arithmetic mean of rust_value and ffi_value across all matching
rows. For compress and ratio there is exactly one source per (level,
snapshot), so the mean degenerates to the single observation and the
displayed value is unchanged. For decompress, both stream variants now
contribute equally, giving a deterministic Rust/FFI comparison that
doesn't depend on which source row arrived first.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
Drop the profile chart's own Target / Scenario selectors and read those values (plus Level) from the existing page-wide filter row that drives the delta chart. The top selectors already support an `__all__` sentinel which the profile chart interprets as "average across that dimension": - Target = all → mean across every reported target - Scenario = all → mean across every corpus - Level = all → every level on the X-axis (no narrowing) When one of those is pinned to a specific value, the profile narrows to that value. Level = specific renders a single X-axis point. Snapshot stays on the profile section because the top chart's From/To pair defines a time window, not a single-snapshot pin — the two controls express different intents. The top-filter change handler now fans out to both `rerender()` (delta chart) and the profile chart, repopulating the profile's snapshot list when target/scenario/level change. Profile-status text reflects all three dimensions so it's clear when a view is cumulative vs narrowed.
buildProfileData() previously summed every matching row regardless of
generated_at when the snapshot dropdown was on "latest", turning the
"most recent record per (level, logical-metric)" intent into a
time-average across the entire history of that target+scenario+level.
Replace the single-pass sum with a two-pass aggregation:
Pass 1 — collect every matching row and, when snapshot=latest,
remember the most recent generated_at per (level, lm).
Pass 2 — fold each row into the running mean only if its
generated_at matches the chosen snapshot for that key. For
"latest" this restricts the mean to one snapshot per key,
then averages across source variants (rust_stream +
c_stream for decompress) and across any "all" dimension
(scenarios when scenario=all, etc.) within that snapshot.
For a pinned snapshot, matchesFilter() already narrowed us
down, so this just collapses source.
Also pass "none" to chart.update() in the HTML legend click handler
so toggling a series is instant — the default 1s animation was
visibly janky with six datasets on the profile chart.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/bench-dashboard/index.html:
- Around line 749-751: The legend rerender replaces buttons
(container.replaceChildren()) causing keyboard focus loss; update
renderHtmlLegend to save the currently focused legend button (e.g., detect if
document.activeElement is within container and capture a stable identifier like
a data-index, aria-label, or dataset key) before calling
container.replaceChildren(), then after creating new buttons restore focus to
the matching new button (querySelector by that identifier) and ensure it remains
focusable. Apply the same focus-preservation pattern to the other
legend-rerender blocks at the other chart legend functions (the blocks around
the other replaceChildren calls) and ensure aria-pressed/aria-checked and
tabindex are preserved on the restored button so keyboard users can continue
toggling with Enter/Space.
- Around line 931-955: The bug is that latestPerKey is keyed only by
`${row.level}|${lm}` so when filter.target or filter.scenario is `__all__` you
must make the cohort-local key include those dimensions; update the code that
builds key (used in latestPerKey, matchingRows, and the winner lookup) to append
cohort identifiers such as row.target and row.scenario (or other cohort fields
that `matchesFilter` uses) when the filter for that dimension is `__all__`,
keeping the existing behavior when a specific target/scenario is selected;
adjust any logic that compares snapshots (sameSnapshotLabel,
compareByGeneratedAt) to still operate on the chosen winner row.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b5d6ac02-0911-495d-bbba-6fe17081a5b6
📒 Files selected for processing (1)
.github/bench-dashboard/index.html
Two correctness/UX fixes in the bench dashboard:
* renderHtmlLegend() now captures document.activeElement's data-
legendIndex before replaceChildren() and restores focus to the
matching new button after rebuild. Previously every keyboard
toggle (Enter/Space) dropped focus out of the legend, forcing the
user to Tab back in between every series flip.
* buildProfileData() now keys latestPerKey by (level, logical-metric)
PLUS any cumulative ("__all__") dimension (target, scenario,
level), giving each cohort its own most-recent snapshot before the
cross-cohort mean. The previous key collapsed every cohort onto a
single global snapshot per (level, lm), silently dropping any
cohort whose timestamp didn't match — biasing the result toward
whichever target/scenario ran most recently instead of producing
the intended unbiased mean across cohorts.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
renderHtmlLegend() now snapshots container.scrollTop before replaceChildren() and restores it after focus restoration. When a toggle was triggered from outside the legend (a profile-series checkbox above the chart, or a top-filter change) the existing focus-restore path never fired and a long legend silently jumped back to the top on every re-render. Restoring scrollTop keeps the visible window stable regardless of which control fired the rebuild. Done AFTER focus() because focus() on a partially-visible element can itself scroll the container.
PROFILE_LATEST mode chose the per-key "most recent" row via compareByGeneratedAt() which only inspects generated_at. When records lack a parseable timestamp but carry commit_sha (loaded payloads from older CI runs, local-run artifacts, etc.) every pairwise comparison returned 0 and the "latest" winner was whichever row appeared first in iteration order — arbitrary and not stable across renders. Subsequent filtering by sameSnapshotLabel() (which uses generated_at || commit_sha || "local-run" as identity) could then silently restrict to an inconsistent cohort. Add compareSnapshotIdentity() — same primary ordering as compareByGeneratedAt() (timestamp), with a lexicographic commit_sha fallback when both rows have no parseable timestamp. Not chronologically meaningful in the fallback case, but deterministic across runs and consistent with snapshotLabel()'s identity, which is what the cohort-locking logic actually requires. No regression test added: the dashboard is a single-file HTML view with no JS test harness, so this is verified manually by loading a payload with timestamp-less records.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
.github/bench-dashboard/index.html:1209
PROFILE_LATESTstatus text currently reports “latest per level”, but the selection logic is actually “latest per cohort and per (level, logical metric)” (and then averaged across sources/cohorts). Consider updating the status string to reflect that it’s not necessarily a single pinned snapshot, to avoid confusing readers when points come from different runs.
const snapshotNote = filter.snapshot === PROFILE_LATEST
? "latest per level"
: `snapshot=${filter.snapshot}`;
el("profile-status").textContent =
`Showing ${levelCount} level point(s) — target=${filter.target}, scenario=${filter.scenario}, level=${filter.level} (${snapshotNote}).`;
The Level Profile section header said "Six lines for a single snapshot" but the default PROFILE_LATEST mode picks each (level, logical-metric, cohort) point from its own most-recent snapshot independently, so the displayed points can mix generated_at values across the X-axis. Rewrite the section description to spell out both modes explicitly, and amend the profile-status note so the user always sees which mode is active and what it implies for cross-point provenance.
Reopens the work from #192 (closed) after a redesign of the Level Profile chart per maintainer feedback.
Summary
compressed/input, lower = better__all__(default), which the profile chart interprets as "average across that dimension" (cumulative view across every cohort). A profile-specific Snapshot dropdown pins the six lines to a singlegenerated_at/commit_sharun, or tolatest(most-recent snapshot per cohort+level+metric, picked per-cohort so cumulative mode stays unbiased toward whichever target/scenario ran most recently).rust_stream+c_stream) per side — both stream variants measure valid decompression numbers, the arithmetic mean is deterministic regardless of iteration order. Compress and ratio have one source per (level, snapshot), so the mean degenerates to the single observation.chart.update("none"). The chart instance is reused across every filter/snapshot/toggle change — no destroy/recreate.<button type="button">witharia-pressedreflecting visibility, full-row click target (fixes the misaligned click box where wrapped labels drifted off the swatch), keyboard activation (Enter/Space), focus-visible outline, screen-reader role.compression_ratio: only flag whendelta > BAND_HIGH(Rust produced larger output).throughput_bytes_per_sec: only flag whendelta < BAND_LOW(Rust slower than FFI).peak_alloc_bytes: only flag whendelta > BAND_HIGH(Rust allocated more).Closes #183
Supersedes #192
Test plan
benchmark-relative.json; existing delta-time chart still renders and respects every filter.__all__on Target / Scenario / Level produces a cumulative arithmetic mean across that dimension; pinning a specific value narrows to that cohort.generated_at/commit_sha;latestuses each cohort's own most-recent snapshot (no global-latest bias when Target or Scenario is__all__).Summary by CodeRabbit
New Features
Improvements