ci(bench): decouple regression alert from publish — fixes stuck baseline#159
Conversation
Previously the bench regression alert was wired into the `benchmark-aggregate` job alongside `save-data-file` and `auto-push`. On a main push that triggered the alert, the step exited with code 1 before `save-data-file` ran — freezing the baseline in gh-pages and cascading the same alert across every subsequent main push. Split the responsibilities: - `benchmark-aggregate` keeps the save/auto-push path, gated on `github.event_name == 'push' && github.ref == 'refs/heads/main'`. No `fail-on-alert`, no `comment-on-alert` here. - New `benchmark-regression-check` job runs ONLY on regular developer PRs from this repo (excludes release-plz PRs by author + branch prefix; excludes push events; excludes fork PRs). Read-only comparison: alerts via commit comment + `@polaz` ping, fail-on-alert red, no save, no push. Effect: main pushes always publish + advance the baseline; regression alerts surface in PR conversation where they can be discussed before merge; release-plz PRs (version-bump only) get no perf gate at all.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughCI workflow refactors benchmark baseline handling by separating main-push baseline save from regression alerting. Main pushes now unconditionally persist baseline data to gh-pages without failing on regressions. A new regression-check job runs only on developer PRs to enforce alert thresholds and comment on commits, unblocking baseline publication. ChangesBenchmark Baseline and Regression Check Separation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR separates benchmark baseline publishing from regression alerting so main-branch benchmark data can continue advancing even when a regression is detected.
Changes:
- Converts the benchmark aggregate action call into a main-push-only baseline save path.
- Adds a separate PR-only benchmark regression check job for same-repo developer PRs.
- Excludes main pushes, fork PRs, and release-plz PRs from the alert gate.
| github-token: ${{ secrets.GITHUB_TOKEN }} | ||
| # Read-only comparison: no baseline write, no gh-pages push. | ||
| # The save path lives in `benchmark-aggregate` and runs on | ||
| # main push only. | ||
| auto-push: false | ||
| save-data-file: false | ||
| # Surface regressions to the PR author + reviewer. Comment is | ||
| # posted on the head commit so it shows in the PR conversation. | ||
| comment-on-alert: true |
Release-plz PRs are version-bump only (no source-code changes), so running the full bench matrix (build × 3 targets + 27 strategy shards on main / 3 on PR + aggregate + pages + regression) is pure CI waste — ~30 min wall + GitHub-hosted runner minutes consumed for zero usable output. Add an `if:` filter on `bench-matrix` that excludes: - `release-plz[bot]` as PR author - branch names starting with `release-plz-` The skip cascades through `needs:` so every downstream bench job (`bench-build`, `benchmark`, `benchmark-aggregate`, `benchmark-pages`, `benchmark-regression-check`) is automatically gated out by GH Actions' default skip-when-needs-skipped semantics. `benchmark-regression-check` already had its own equivalent filter from PR #159; this commit moves the gate one job upstream so the whole pipeline noops cleanly instead of running shards and then discarding the merged report at the gh-pages step (which is gated on `push && main` only).
…bles + tune CI bench budgets (#165) * perf(fse): replace next_state linear search with donor-parity flat tables `FSETable::next_state(symbol, idx)` previously called `SymbolStates::get` which scanned `Vec<State>` per symbol via `.iter().find(state.contains(idx))`. On a typical level_3_dfast encode of decodecorpus-z000033 this fired ~3 × N_sequences times (LL + ML + OF per sequence), with each call walking through 5–30 states. `encode_sequences` showed up at 13.7% Rust self-time vs `ZSTD_encodeSequences` 6.4% on the C donor. Root cause: the donor-parity precomputed tables (`deltaNbBits`, `deltaFindState`, flat `nextStateTable`) were already built in `build_table_from_probabilities` for the donor `FSE_encodeSymbol` arithmetic — used to populate `Vec<State>` and then discarded. Change: - Add `state_table_flat: Box<[u16]>` and `symbol_tt: [SymbolTT; 256]` to `FSETable`. Populated in `build_table_from_probabilities` from the same intermediates that fed the legacy `Vec<State>` push loop. Donor parity: `state_table_flat` mirrors `nextStateTable` byte for byte (u16, table_size entries); `SymbolTT` mirrors `FSE_symbolCompressionTransform`. - `FSETable::next_state` now returns `State` by value, computed via donor arithmetic: ```text value = (1 << acc_log) + idx nb_bits = (value + delta_nb_bits) >> 16 baseline = idx & !((1 << nb_bits) - 1) next_index = state_table_flat[(value >> nb_bits) + delta_find_state] ``` O(1) lookup, no Vec deref, no Option wrap, no linear scan. - `FSETable::start_state` returns `State` by value (was `&State`) to match the new shape so callers don't juggle lifetimes; still backed by the existing `Vec<State>` storage (called once per stream, not hot). - `State` gains `Copy` (4 fields, all Copy). - Retired methods: `SymbolStates::get` and `State::contains` (callers removed). `State.last_index` kept (used by the BTreeSet dedup in the builder) with `#[allow(dead_code)]` since it is no longer read on the encode hot path. - Caller-side: `encode_sequences` (`blocks/compressed.rs`) and the internal `FSEEncoder` glue (`fse/fse_encoder.rs`) now store `Option<State>` instead of `Option<&State>` — natural fit for the new by-value return shape. Measurement (standalone, 200 iters, level_3_dfast / z000033, same session A/B): | fixture | baseline | after | delta | |---------|---------:|------:|------:| | z000033 (target, compressible) | 20543 µs | 18286 µs | **−11.0%** | | 1 MiB pseudo-random | 699 µs | 712 µs | +2% noise | | 1 MiB repeating pattern | 1179 µs | 1183 µs | neutral | z000033 is the canonical Phase-7 ratio-gap fixture — the encode_sequences path is hot exactly there. Random / RLE-shape inputs barely touch encode_sequences (raw fast-path / single short block) so the change is correctly neutral. Gates green: - `cargo nextest run -p structured-zstd --features dict_builder bench_internals` 526/526 - `cargo test --doc --features dict_builder bench_internals` 12/12 - `cargo clippy --features dict_builder bench_internals --all-targets -- -D warnings` clean - `level22_sequences_match_donor_on_corpus_proxy` ratio gate PASS * ci(bench): tune criterion budgets + split fast/lazy shards (#164) criterion 0.8 hard-asserts `sample_size >= 10` (`benchmark_group.rs:97`, `lib.rs:519`) so cutting the sample count is not an option without forking criterion. Two complementary changes here drop the worst-case shard wall under the 120-min CI cap while preserving (or improving) measurement quality. ## 1. `configure_group` budget tuning (`zstd/benches/compare_ffi.rs`) | Class | Old | New | Δ per side (×2) | |-------|-----|-----|----------------| | Small (1–10 KiB) | 3 s + ~3 s default warmup | 1 s + 0.2 s warmup | -8 s | | Corpus / Entropy (1 MiB) | 8 s + ~3 s default warmup | 3 s + 0.5 s warmup | -15 s | | Large / Silesia (16–100 MiB) | 10 s + 0.5 s | **20 s** + 0.5 s | +20 s | Small / Corpus / Entropy: the old budgets were wall-bound by the measurement window (criterion fit samples in less time than allotted). Shrinking the budget reclaims that headroom; sample count stays at 30 / 10 respectively so measurement quality is unchanged on fast-per-iter benches. Large / Silesia: the old 10 s was too tight. i686 / level_22_btultra2 / 100 MiB takes ~2 s per iter × 10 samples ≈ 20 s wall, so the budget was producing "increase target time" warnings + occasional flaky measurements on the slowest combos. Widening to 20 s removes the warning envelope without affecting wall on faster combos (criterion exits the budget early once samples complete). ## 2. Split `fast` and `lazy` shards (`.github/workflows/ci.yml`) `lazy` carried 11 levels (5–15) and `fast` 8 levels (-7..=-1, 1) — together ~50% of the main-push bench surface. The `lazy` shard at 120 min remained the consistent CI bottleneck. New split: - `fast-neg` (-7..=-3), `fast-pos` (-2..=-1, 1) - `lazy-lower` (5..=9), `lazy-upper` (10..=15) Other shards unchanged: dfast (2,3), greedy (4), btopt (16,17), btultra (18,19), btultra2 (20..=22). Total strategy groups now 9 (was 7) × 3 targets = **27 main-push shards (was 21)**. ## 3. .gitignore: add `CLAUDE.md` Project-local Claude rules file (created in an earlier session) is private to the maintainer's setup, mirrors `.claude/` and `AGENTS.md` which are already ignored. Stops `CLAUDE.md` from showing up in `git status` after every session. ## Expected impact - Worst-case shard wall: 120-min cap → ~30–40 min headroom (lazy now split in half + ~60% measurement savings on Small/Corpus/Entropy). - Large/Silesia measurement-quality regression: fixed. - Total CI bench parallel jobs go from 21 to 27 — same `runs-on: ubuntu-latest` matrix expands, GitHub-hosted runner quota accommodates fine. * ci(bench): skip entire bench pipeline on release-plz PRs (#164) Release-plz PRs are version-bump only (no source-code changes), so running the full bench matrix (build × 3 targets + 27 strategy shards on main / 3 on PR + aggregate + pages + regression) is pure CI waste — ~30 min wall + GitHub-hosted runner minutes consumed for zero usable output. Add an `if:` filter on `bench-matrix` that excludes: - `release-plz[bot]` as PR author - branch names starting with `release-plz-` The skip cascades through `needs:` so every downstream bench job (`bench-build`, `benchmark`, `benchmark-aggregate`, `benchmark-pages`, `benchmark-regression-check`) is automatically gated out by GH Actions' default skip-when-needs-skipped semantics. `benchmark-regression-check` already had its own equivalent filter from PR #159; this commit moves the gate one job upstream so the whole pipeline noops cleanly instead of running shards and then discarding the merged report at the gh-pages step (which is gated on `push && main` only). * fix(fse): switch SymbolTT.delta_nb_bits to u32 for 16-bit-target safety CodeRabbit caught a 16-bit-target overflow in the new `FSETable::next_state` flat-table arithmetic. The donor 16.16 fixed-point value `delta_nb_bits` was stored as `usize`, but the `<<16` / `>>16` shifts assume at least 32-bit width. On 16-bit targets (AVR, MSP430, no-atomic Cortex-M0 — explicitly supported profiles for this crate per the `critical-section` feature docs) `usize` is 16 bits and those shifts silently overflow to zero, breaking the encode arithmetic. Donor `fse_compress.c` uses `U32` throughout the same fixed-point math for the same reason — this aligns us with that invariant. Change: - `SymbolTT.delta_nb_bits: usize` → `u32` - Build-time arithmetic in `build_table_from_probabilities` now computes `delta_nb_bits` in `u32` (`-1 | 1` and `probability > 1` arms both updated to `<< 16` on u32 operands). - The dedup-loop fixed-point math (`current_value + delta_nb_bits`, `current_value >> num_bits`) switched to u32 to keep the same invariant. - `FSETable::next_state` casts `value` to u32 at the start of the arithmetic, then back to `usize` only at the final indexing sites (`1usize << nb_bits`, `value >> nb_bits` slot lookup). Behavior unchanged on 32-bit / 64-bit targets; only the silent-wrap bug on 16-bit `usize` targets is fixed.
Summary
Splits the bench regression alert out of the
benchmark-aggregatesave path into its ownbenchmark-regression-checkjob that runs only on regular developer PRs.Before:
Store benchmark resultsstep hadfail-on-alert: true(on main push) andsave-data-file: truein the same action call. Alert → step exits 1 → save never runs → baseline freezes → next main push hits the same alert vs the same stale baseline → freeze forever.After:
benchmark-aggregatekeeps a save-only step (auto-push+save-data-filefor main push, no alert/fail). Baseline always advances when bench shards pass.benchmark-regression-checkjob runs read-only comparison + comment + fail-on-alert. Triggers ONLY when ALL of:pull_requestrelease-plz[bot]release-plz-Why this design
dev/benchover time — no need to red-CI them on main.@polazso the regression is acknowledged before merge.Currently broken since
SHA
1e0f6954(16 May 02:28 KYIV) —compress/level_9_lazy/decodecorpus-z000033/matrix/pure_rustratio 1.67 vs baselinefa2e9ff6. Every main push since either failed at the same alert or got cancelled by concurrency. Bench dashboard atdev/benchhas been stale for two days.After this lands, the next main push that completes will advance the baseline regardless of regression magnitude. The
level_9_lazyregression itself still needs investigation as a separate issue.Test plan
python3 -c yaml.safe_load)if:evaluates correctly:pull_requestevent, same-repo head,polazauthor, branchfix/#158-...→ runs gatebenchmark-aggregatesave step now strictly gated onpush && mainso PR runs don't try to call it (the save action only runs once main merge happens)Closes #158
Summary by CodeRabbit