ci(bench): decouple regression alert from publish — fixes stuck baseline by polaz · Pull Request #159 · structured-world/structured-zstd

polaz · 2026-05-17T21:23:18Z

Summary

Splits the bench regression alert out of the benchmark-aggregate save path into its own benchmark-regression-check job that runs only on regular developer PRs.

Before: Store benchmark results step had fail-on-alert: true (on main push) and save-data-file: true in the same action call. Alert → step exits 1 → save never runs → baseline freezes → next main push hits the same alert vs the same stale baseline → freeze forever.

After:

benchmark-aggregate keeps a save-only step (auto-push + save-data-file for main push, no alert/fail). Baseline always advances when bench shards pass.
New benchmark-regression-check job runs read-only comparison + comment + fail-on-alert. Triggers ONLY when ALL of:
- event is pull_request
- head repo == base repo (excludes fork PRs that have no token access)
- author is not release-plz[bot]
- head branch does not start with release-plz-

Why this design

Main push is the immutable historical record. Regressions are visible on the dashboard at dev/bench over time — no need to red-CI them on main.
release-plz PRs are version-bump only, perf comparison is pure noise — no gate.
Fork PRs have no gh-pages token access so the alert step would emit a confusing comparison-failed line anyway.
A red alert on a regular developer PR posts a comment on the head commit + pings @polaz so the regression is acknowledged before merge.

Currently broken since

SHA 1e0f6954 (16 May 02:28 KYIV) — compress/level_9_lazy/decodecorpus-z000033/matrix/pure_rust ratio 1.67 vs baseline fa2e9ff6. Every main push since either failed at the same alert or got cancelled by concurrency. Bench dashboard at dev/bench has been stale for two days.

After this lands, the next main push that completes will advance the baseline regardless of regression magnitude. The level_9_lazy regression itself still needs investigation as a separate issue.

Test plan

YAML syntax validated (python3 -c yaml.safe_load)
New job's if: evaluates correctly:
- This PR: pull_request event, same-repo head, polaz author, branch fix/#158-... → runs gate
- release-plz PRs: author check excludes → skipped
- main push: event check excludes → skipped
benchmark-aggregate save step now strictly gated on push && main so PR runs don't try to call it (the save action only runs once main merge happens)

Closes #158

Summary by CodeRabbit

Chores
- Implemented automated benchmark regression testing on pull requests to detect performance regressions, with alerts posted to PR comments and CI failure when thresholds are exceeded.
- Enhanced benchmark baseline saving on main branch pushes with improved alert management, now separating baseline storage from regression detection workflows.

Previously the bench regression alert was wired into the `benchmark-aggregate` job alongside `save-data-file` and `auto-push`. On a main push that triggered the alert, the step exited with code 1 before `save-data-file` ran — freezing the baseline in gh-pages and cascading the same alert across every subsequent main push. Split the responsibilities: - `benchmark-aggregate` keeps the save/auto-push path, gated on `github.event_name == 'push' && github.ref == 'refs/heads/main'`. No `fail-on-alert`, no `comment-on-alert` here. - New `benchmark-regression-check` job runs ONLY on regular developer PRs from this repo (excludes release-plz PRs by author + branch prefix; excludes push events; excludes fork PRs). Read-only comparison: alerts via commit comment + `@polaz` ping, fail-on-alert red, no save, no push. Effect: main pushes always publish + advance the baseline; regression alerts surface in PR conversation where they can be discussed before merge; release-plz PRs (version-bump only) get no perf gate at all.

coderabbitai · 2026-05-17T21:23:28Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6e46a42c-3e0e-4496-8f47-72ded8216c79

📥 Commits

Reviewing files that changed from the base of the PR and between 740ff2a and ad33451.

📒 Files selected for processing (1)

.github/workflows/ci.yml

📝 Walkthrough

Walkthrough

CI workflow refactors benchmark baseline handling by separating main-push baseline save from regression alerting. Main pushes now unconditionally persist baseline data to gh-pages without failing on regressions. A new regression-check job runs only on developer PRs to enforce alert thresholds and comment on commits, unblocking baseline publication.

Changes

Benchmark Baseline and Regression Check Separation

Layer / File(s)	Summary
Main-push baseline save (no fail gate) `.github/workflows/ci.yml`	The "Store benchmark results" step is replaced with "Save benchmark baseline (main push only)" gated to `push` on `refs/heads/main`. It enables `auto-push` and `save-data-file` while disabling alert comments and fail-on-alert, ensuring baseline advancement on main regardless of regression thresholds.
PR regression check gate `.github/workflows/ci.yml`	New `benchmark-regression-check` job runs only for developer PRs (excluding pushes, release-plz bot, and release-plz-prefixed refs). It downloads aggregated benchmark artifacts and runs comparison-only mode with alert comments enabled and fail-on-alert true, gating only the regression-check job without blocking baseline publication.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A baseline that flows, unobstructed and true,
Main branches push forward with regressions in view.
While PRs stand guard with alerts ringing clear,
The dashboard marches on—no stuck gates here! 🔔✨

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/#158-bench-regression-not-a-gate

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-17T21:25:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR separates benchmark baseline publishing from regression alerting so main-branch benchmark data can continue advancing even when a regression is detected.

Changes:

Converts the benchmark aggregate action call into a main-push-only baseline save path.
Adds a separate PR-only benchmark regression check job for same-repo developer PRs.
Excludes main pushes, fork PRs, and release-plz PRs from the alert gate.

+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          # Read-only comparison: no baseline write, no gh-pages push.
+          # The save path lives in `benchmark-aggregate` and runs on
+          # main push only.
+          auto-push: false
+          save-data-file: false
+          # Surface regressions to the PR author + reviewer. Comment is
+          # posted on the head commit so it shows in the PR conversation.
+          comment-on-alert: true


Release-plz PRs are version-bump only (no source-code changes), so running the full bench matrix (build × 3 targets + 27 strategy shards on main / 3 on PR + aggregate + pages + regression) is pure CI waste — ~30 min wall + GitHub-hosted runner minutes consumed for zero usable output. Add an `if:` filter on `bench-matrix` that excludes: - `release-plz[bot]` as PR author - branch names starting with `release-plz-` The skip cascades through `needs:` so every downstream bench job (`bench-build`, `benchmark`, `benchmark-aggregate`, `benchmark-pages`, `benchmark-regression-check`) is automatically gated out by GH Actions' default skip-when-needs-skipped semantics. `benchmark-regression-check` already had its own equivalent filter from PR #159; this commit moves the gate one job upstream so the whole pipeline noops cleanly instead of running shards and then discarding the merged report at the gh-pages step (which is gated on `push && main` only).

…bles + tune CI bench budgets (#165) * perf(fse): replace next_state linear search with donor-parity flat tables `FSETable::next_state(symbol, idx)` previously called `SymbolStates::get` which scanned `Vec<State>` per symbol via `.iter().find(state.contains(idx))`. On a typical level_3_dfast encode of decodecorpus-z000033 this fired ~3 × N_sequences times (LL + ML + OF per sequence), with each call walking through 5–30 states. `encode_sequences` showed up at 13.7% Rust self-time vs `ZSTD_encodeSequences` 6.4% on the C donor. Root cause: the donor-parity precomputed tables (`deltaNbBits`, `deltaFindState`, flat `nextStateTable`) were already built in `build_table_from_probabilities` for the donor `FSE_encodeSymbol` arithmetic — used to populate `Vec<State>` and then discarded. Change: - Add `state_table_flat: Box<[u16]>` and `symbol_tt: [SymbolTT; 256]` to `FSETable`. Populated in `build_table_from_probabilities` from the same intermediates that fed the legacy `Vec<State>` push loop. Donor parity: `state_table_flat` mirrors `nextStateTable` byte for byte (u16, table_size entries); `SymbolTT` mirrors `FSE_symbolCompressionTransform`. - `FSETable::next_state` now returns `State` by value, computed via donor arithmetic: ```text value = (1 << acc_log) + idx nb_bits = (value + delta_nb_bits) >> 16 baseline = idx & !((1 << nb_bits) - 1) next_index = state_table_flat[(value >> nb_bits) + delta_find_state] ``` O(1) lookup, no Vec deref, no Option wrap, no linear scan. - `FSETable::start_state` returns `State` by value (was `&State`) to match the new shape so callers don't juggle lifetimes; still backed by the existing `Vec<State>` storage (called once per stream, not hot). - `State` gains `Copy` (4 fields, all Copy). - Retired methods: `SymbolStates::get` and `State::contains` (callers removed). `State.last_index` kept (used by the BTreeSet dedup in the builder) with `#[allow(dead_code)]` since it is no longer read on the encode hot path. - Caller-side: `encode_sequences` (`blocks/compressed.rs`) and the internal `FSEEncoder` glue (`fse/fse_encoder.rs`) now store `Option<State>` instead of `Option<&State>` — natural fit for the new by-value return shape. Measurement (standalone, 200 iters, level_3_dfast / z000033, same session A/B): | fixture | baseline | after | delta | |---------|---------:|------:|------:| | z000033 (target, compressible) | 20543 µs | 18286 µs | **−11.0%** | | 1 MiB pseudo-random | 699 µs | 712 µs | +2% noise | | 1 MiB repeating pattern | 1179 µs | 1183 µs | neutral | z000033 is the canonical Phase-7 ratio-gap fixture — the encode_sequences path is hot exactly there. Random / RLE-shape inputs barely touch encode_sequences (raw fast-path / single short block) so the change is correctly neutral. Gates green: - `cargo nextest run -p structured-zstd --features dict_builder bench_internals` 526/526 - `cargo test --doc --features dict_builder bench_internals` 12/12 - `cargo clippy --features dict_builder bench_internals --all-targets -- -D warnings` clean - `level22_sequences_match_donor_on_corpus_proxy` ratio gate PASS * ci(bench): tune criterion budgets + split fast/lazy shards (#164) criterion 0.8 hard-asserts `sample_size >= 10` (`benchmark_group.rs:97`, `lib.rs:519`) so cutting the sample count is not an option without forking criterion. Two complementary changes here drop the worst-case shard wall under the 120-min CI cap while preserving (or improving) measurement quality. ## 1. `configure_group` budget tuning (`zstd/benches/compare_ffi.rs`) | Class | Old | New | Δ per side (×2) | |-------|-----|-----|----------------| | Small (1–10 KiB) | 3 s + ~3 s default warmup | 1 s + 0.2 s warmup | -8 s | | Corpus / Entropy (1 MiB) | 8 s + ~3 s default warmup | 3 s + 0.5 s warmup | -15 s | | Large / Silesia (16–100 MiB) | 10 s + 0.5 s | **20 s** + 0.5 s | +20 s | Small / Corpus / Entropy: the old budgets were wall-bound by the measurement window (criterion fit samples in less time than allotted). Shrinking the budget reclaims that headroom; sample count stays at 30 / 10 respectively so measurement quality is unchanged on fast-per-iter benches. Large / Silesia: the old 10 s was too tight. i686 / level_22_btultra2 / 100 MiB takes ~2 s per iter × 10 samples ≈ 20 s wall, so the budget was producing "increase target time" warnings + occasional flaky measurements on the slowest combos. Widening to 20 s removes the warning envelope without affecting wall on faster combos (criterion exits the budget early once samples complete). ## 2. Split `fast` and `lazy` shards (`.github/workflows/ci.yml`) `lazy` carried 11 levels (5–15) and `fast` 8 levels (-7..=-1, 1) — together ~50% of the main-push bench surface. The `lazy` shard at 120 min remained the consistent CI bottleneck. New split: - `fast-neg` (-7..=-3), `fast-pos` (-2..=-1, 1) - `lazy-lower` (5..=9), `lazy-upper` (10..=15) Other shards unchanged: dfast (2,3), greedy (4), btopt (16,17), btultra (18,19), btultra2 (20..=22). Total strategy groups now 9 (was 7) × 3 targets = **27 main-push shards (was 21)**. ## 3. .gitignore: add `CLAUDE.md` Project-local Claude rules file (created in an earlier session) is private to the maintainer's setup, mirrors `.claude/` and `AGENTS.md` which are already ignored. Stops `CLAUDE.md` from showing up in `git status` after every session. ## Expected impact - Worst-case shard wall: 120-min cap → ~30–40 min headroom (lazy now split in half + ~60% measurement savings on Small/Corpus/Entropy). - Large/Silesia measurement-quality regression: fixed. - Total CI bench parallel jobs go from 21 to 27 — same `runs-on: ubuntu-latest` matrix expands, GitHub-hosted runner quota accommodates fine. * ci(bench): skip entire bench pipeline on release-plz PRs (#164) Release-plz PRs are version-bump only (no source-code changes), so running the full bench matrix (build × 3 targets + 27 strategy shards on main / 3 on PR + aggregate + pages + regression) is pure CI waste — ~30 min wall + GitHub-hosted runner minutes consumed for zero usable output. Add an `if:` filter on `bench-matrix` that excludes: - `release-plz[bot]` as PR author - branch names starting with `release-plz-` The skip cascades through `needs:` so every downstream bench job (`bench-build`, `benchmark`, `benchmark-aggregate`, `benchmark-pages`, `benchmark-regression-check`) is automatically gated out by GH Actions' default skip-when-needs-skipped semantics. `benchmark-regression-check` already had its own equivalent filter from PR #159; this commit moves the gate one job upstream so the whole pipeline noops cleanly instead of running shards and then discarding the merged report at the gh-pages step (which is gated on `push && main` only). * fix(fse): switch SymbolTT.delta_nb_bits to u32 for 16-bit-target safety CodeRabbit caught a 16-bit-target overflow in the new `FSETable::next_state` flat-table arithmetic. The donor 16.16 fixed-point value `delta_nb_bits` was stored as `usize`, but the `<<16` / `>>16` shifts assume at least 32-bit width. On 16-bit targets (AVR, MSP430, no-atomic Cortex-M0 — explicitly supported profiles for this crate per the `critical-section` feature docs) `usize` is 16 bits and those shifts silently overflow to zero, breaking the encode arithmetic. Donor `fse_compress.c` uses `U32` throughout the same fixed-point math for the same reason — this aligns us with that invariant. Change: - `SymbolTT.delta_nb_bits: usize` → `u32` - Build-time arithmetic in `build_table_from_probabilities` now computes `delta_nb_bits` in `u32` (`-1 | 1` and `probability > 1` arms both updated to `<< 16` on u32 operands). - The dedup-loop fixed-point math (`current_value + delta_nb_bits`, `current_value >> num_bits`) switched to u32 to keep the same invariant. - `FSETable::next_state` casts `value` to u32 at the start of the arithmetic, then back to `usize` only at the final indexing sites (`1usize << nb_bits`, `value >> nb_bits` slot lookup). Behavior unchanged on 32-bit / 64-bit targets; only the silent-wrap bug on 16-bit `usize` targets is fixed.

Copilot AI review requested due to automatic review settings May 17, 2026 21:23

polaz merged commit bbc9db4 into main May 17, 2026
5 of 6 checks passed

polaz deleted the fix/#158-bench-regression-not-a-gate branch May 17, 2026 21:23

Copilot started reviewing on behalf of polaz May 17, 2026 21:24 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

This was referenced May 17, 2026

ci(bench): bump per-target timeout_minutes to 120 — current 45/60 caps still hit by lazy/fast shards #160

Closed

ci(bench): bump bench shard timeout_minutes to 120 — fixes stuck publish #161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(bench): decouple regression alert from publish — fixes stuck baseline#159

ci(bench): decouple regression alert from publish — fixes stuck baseline#159
polaz merged 1 commit into
mainfrom
fix/#158-bench-regression-not-a-gate

polaz commented May 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 17, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

codecov Bot commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

polaz commented May 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this design

Currently broken since

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

codecov Bot commented May 17, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented May 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading