Skip to content

ci(bench): decouple regression alert from publish — fixes stuck baseline#159

Merged
polaz merged 1 commit into
mainfrom
fix/#158-bench-regression-not-a-gate
May 17, 2026
Merged

ci(bench): decouple regression alert from publish — fixes stuck baseline#159
polaz merged 1 commit into
mainfrom
fix/#158-bench-regression-not-a-gate

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 17, 2026

Summary

Splits the bench regression alert out of the benchmark-aggregate save path into its own benchmark-regression-check job that runs only on regular developer PRs.

Before: Store benchmark results step had fail-on-alert: true (on main push) and save-data-file: true in the same action call. Alert → step exits 1 → save never runs → baseline freezes → next main push hits the same alert vs the same stale baseline → freeze forever.

After:

  • benchmark-aggregate keeps a save-only step (auto-push + save-data-file for main push, no alert/fail). Baseline always advances when bench shards pass.
  • New benchmark-regression-check job runs read-only comparison + comment + fail-on-alert. Triggers ONLY when ALL of:
    • event is pull_request
    • head repo == base repo (excludes fork PRs that have no token access)
    • author is not release-plz[bot]
    • head branch does not start with release-plz-

Why this design

  • Main push is the immutable historical record. Regressions are visible on the dashboard at dev/bench over time — no need to red-CI them on main.
  • release-plz PRs are version-bump only, perf comparison is pure noise — no gate.
  • Fork PRs have no gh-pages token access so the alert step would emit a confusing comparison-failed line anyway.
  • A red alert on a regular developer PR posts a comment on the head commit + pings @polaz so the regression is acknowledged before merge.

Currently broken since

SHA 1e0f6954 (16 May 02:28 KYIV) — compress/level_9_lazy/decodecorpus-z000033/matrix/pure_rust ratio 1.67 vs baseline fa2e9ff6. Every main push since either failed at the same alert or got cancelled by concurrency. Bench dashboard at dev/bench has been stale for two days.

After this lands, the next main push that completes will advance the baseline regardless of regression magnitude. The level_9_lazy regression itself still needs investigation as a separate issue.

Test plan

  • YAML syntax validated (python3 -c yaml.safe_load)
  • New job's if: evaluates correctly:
    • This PR: pull_request event, same-repo head, polaz author, branch fix/#158-... → runs gate
    • release-plz PRs: author check excludes → skipped
    • main push: event check excludes → skipped
  • benchmark-aggregate save step now strictly gated on push && main so PR runs don't try to call it (the save action only runs once main merge happens)

Closes #158

Summary by CodeRabbit

  • Chores
    • Implemented automated benchmark regression testing on pull requests to detect performance regressions, with alerts posted to PR comments and CI failure when thresholds are exceeded.
    • Enhanced benchmark baseline saving on main branch pushes with improved alert management, now separating baseline storage from regression detection workflows.

Review Change Stack

Previously the bench regression alert was wired into the
`benchmark-aggregate` job alongside `save-data-file` and `auto-push`.
On a main push that triggered the alert, the step exited with code 1
before `save-data-file` ran — freezing the baseline in gh-pages and
cascading the same alert across every subsequent main push.

Split the responsibilities:

- `benchmark-aggregate` keeps the save/auto-push path, gated on
  `github.event_name == 'push' && github.ref == 'refs/heads/main'`.
  No `fail-on-alert`, no `comment-on-alert` here.
- New `benchmark-regression-check` job runs ONLY on regular developer
  PRs from this repo (excludes release-plz PRs by author + branch
  prefix; excludes push events; excludes fork PRs). Read-only
  comparison: alerts via commit comment + `@polaz` ping, fail-on-alert
  red, no save, no push.

Effect: main pushes always publish + advance the baseline; regression
alerts surface in PR conversation where they can be discussed before
merge; release-plz PRs (version-bump only) get no perf gate at all.
Copilot AI review requested due to automatic review settings May 17, 2026 21:23
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6e46a42c-3e0e-4496-8f47-72ded8216c79

📥 Commits

Reviewing files that changed from the base of the PR and between 740ff2a and ad33451.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml

📝 Walkthrough

Walkthrough

CI workflow refactors benchmark baseline handling by separating main-push baseline save from regression alerting. Main pushes now unconditionally persist baseline data to gh-pages without failing on regressions. A new regression-check job runs only on developer PRs to enforce alert thresholds and comment on commits, unblocking baseline publication.

Changes

Benchmark Baseline and Regression Check Separation

Layer / File(s) Summary
Main-push baseline save (no fail gate)
.github/workflows/ci.yml
The "Store benchmark results" step is replaced with "Save benchmark baseline (main push only)" gated to push on refs/heads/main. It enables auto-push and save-data-file while disabling alert comments and fail-on-alert, ensuring baseline advancement on main regardless of regression thresholds.
PR regression check gate
.github/workflows/ci.yml
New benchmark-regression-check job runs only for developer PRs (excluding pushes, release-plz bot, and release-plz-prefixed refs). It downloads aggregated benchmark artifacts and runs comparison-only mode with alert comments enabled and fail-on-alert true, gating only the regression-check job without blocking baseline publication.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A baseline that flows, unobstructed and true,
Main branches push forward with regressions in view.
While PRs stand guard with alerts ringing clear,
The dashboard marches on—no stuck gates here! 🔔✨

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/#158-bench-regression-not-a-gate

Comment @coderabbitai help to get the list of available commands and usage tips.

@polaz polaz merged commit bbc9db4 into main May 17, 2026
5 of 6 checks passed
@polaz polaz deleted the fix/#158-bench-regression-not-a-gate branch May 17, 2026 21:23
@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR separates benchmark baseline publishing from regression alerting so main-branch benchmark data can continue advancing even when a regression is detected.

Changes:

  • Converts the benchmark aggregate action call into a main-push-only baseline save path.
  • Adds a separate PR-only benchmark regression check job for same-repo developer PRs.
  • Excludes main pushes, fork PRs, and release-plz PRs from the alert gate.

Comment thread .github/workflows/ci.yml
Comment on lines +698 to +706
github-token: ${{ secrets.GITHUB_TOKEN }}
# Read-only comparison: no baseline write, no gh-pages push.
# The save path lives in `benchmark-aggregate` and runs on
# main push only.
auto-push: false
save-data-file: false
# Surface regressions to the PR author + reviewer. Comment is
# posted on the head commit so it shows in the PR conversation.
comment-on-alert: true
polaz added a commit that referenced this pull request May 18, 2026
Release-plz PRs are version-bump only (no source-code changes), so
running the full bench matrix (build × 3 targets + 27 strategy
shards on main / 3 on PR + aggregate + pages + regression) is pure
CI waste — ~30 min wall + GitHub-hosted runner minutes consumed
for zero usable output.

Add an `if:` filter on `bench-matrix` that excludes:
- `release-plz[bot]` as PR author
- branch names starting with `release-plz-`

The skip cascades through `needs:` so every downstream bench job
(`bench-build`, `benchmark`, `benchmark-aggregate`,
`benchmark-pages`, `benchmark-regression-check`) is automatically
gated out by GH Actions' default skip-when-needs-skipped semantics.

`benchmark-regression-check` already had its own equivalent filter
from PR #159; this commit moves the gate one job upstream so the
whole pipeline noops cleanly instead of running shards and then
discarding the merged report at the gh-pages step (which is gated
on `push && main` only).
polaz added a commit that referenced this pull request May 18, 2026
…bles + tune CI bench budgets (#165)

* perf(fse): replace next_state linear search with donor-parity flat tables

`FSETable::next_state(symbol, idx)` previously called
`SymbolStates::get` which scanned `Vec<State>` per symbol via
`.iter().find(state.contains(idx))`. On a typical level_3_dfast
encode of decodecorpus-z000033 this fired ~3 × N_sequences times
(LL + ML + OF per sequence), with each call walking through
5–30 states. `encode_sequences` showed up at 13.7% Rust self-time
vs `ZSTD_encodeSequences` 6.4% on the C donor.

Root cause: the donor-parity precomputed tables (`deltaNbBits`,
`deltaFindState`, flat `nextStateTable`) were already built in
`build_table_from_probabilities` for the donor `FSE_encodeSymbol`
arithmetic — used to populate `Vec<State>` and then discarded.

Change:

- Add `state_table_flat: Box<[u16]>` and `symbol_tt: [SymbolTT; 256]`
  to `FSETable`. Populated in `build_table_from_probabilities` from
  the same intermediates that fed the legacy `Vec<State>` push loop.
  Donor parity: `state_table_flat` mirrors `nextStateTable` byte for
  byte (u16, table_size entries); `SymbolTT` mirrors
  `FSE_symbolCompressionTransform`.

- `FSETable::next_state` now returns `State` by value, computed via
  donor arithmetic:
  ```text
  value      = (1 << acc_log) + idx
  nb_bits    = (value + delta_nb_bits) >> 16
  baseline   = idx & !((1 << nb_bits) - 1)
  next_index = state_table_flat[(value >> nb_bits) + delta_find_state]
  ```
  O(1) lookup, no Vec deref, no Option wrap, no linear scan.

- `FSETable::start_state` returns `State` by value (was `&State`) to
  match the new shape so callers don't juggle lifetimes; still backed
  by the existing `Vec<State>` storage (called once per stream, not
  hot).

- `State` gains `Copy` (4 fields, all Copy).

- Retired methods: `SymbolStates::get` and `State::contains` (callers
  removed). `State.last_index` kept (used by the BTreeSet dedup in
  the builder) with `#[allow(dead_code)]` since it is no longer read
  on the encode hot path.

- Caller-side: `encode_sequences` (`blocks/compressed.rs`) and the
  internal `FSEEncoder` glue (`fse/fse_encoder.rs`) now store
  `Option<State>` instead of `Option<&State>` — natural fit for the
  new by-value return shape.

Measurement (standalone, 200 iters, level_3_dfast / z000033, same
session A/B):

| fixture | baseline | after | delta |
|---------|---------:|------:|------:|
| z000033 (target, compressible) | 20543 µs | 18286 µs | **−11.0%** |
| 1 MiB pseudo-random | 699 µs | 712 µs | +2% noise |
| 1 MiB repeating pattern | 1179 µs | 1183 µs | neutral |

z000033 is the canonical Phase-7 ratio-gap fixture — the encode_sequences
path is hot exactly there. Random / RLE-shape inputs barely touch
encode_sequences (raw fast-path / single short block) so the change is
correctly neutral.

Gates green:
- `cargo nextest run -p structured-zstd --features dict_builder bench_internals` 526/526
- `cargo test --doc --features dict_builder bench_internals` 12/12
- `cargo clippy --features dict_builder bench_internals --all-targets -- -D warnings` clean
- `level22_sequences_match_donor_on_corpus_proxy` ratio gate PASS

* ci(bench): tune criterion budgets + split fast/lazy shards (#164)

criterion 0.8 hard-asserts `sample_size >= 10` (`benchmark_group.rs:97`,
`lib.rs:519`) so cutting the sample count is not an option without
forking criterion. Two complementary changes here drop the worst-case
shard wall under the 120-min CI cap while preserving (or improving)
measurement quality.

## 1. `configure_group` budget tuning (`zstd/benches/compare_ffi.rs`)

| Class | Old | New | Δ per side (×2) |
|-------|-----|-----|----------------|
| Small (1–10 KiB) | 3 s + ~3 s default warmup | 1 s + 0.2 s warmup | -8 s |
| Corpus / Entropy (1 MiB) | 8 s + ~3 s default warmup | 3 s + 0.5 s warmup | -15 s |
| Large / Silesia (16–100 MiB) | 10 s + 0.5 s | **20 s** + 0.5 s | +20 s |

Small / Corpus / Entropy: the old budgets were wall-bound by the
measurement window (criterion fit samples in less time than allotted).
Shrinking the budget reclaims that headroom; sample count stays at
30 / 10 respectively so measurement quality is unchanged on
fast-per-iter benches.

Large / Silesia: the old 10 s was too tight. i686 / level_22_btultra2 /
100 MiB takes ~2 s per iter × 10 samples ≈ 20 s wall, so the budget
was producing "increase target time" warnings + occasional flaky
measurements on the slowest combos. Widening to 20 s removes the
warning envelope without affecting wall on faster combos (criterion
exits the budget early once samples complete).

## 2. Split `fast` and `lazy` shards (`.github/workflows/ci.yml`)

`lazy` carried 11 levels (5–15) and `fast` 8 levels (-7..=-1, 1) —
together ~50% of the main-push bench surface. The `lazy` shard at
120 min remained the consistent CI bottleneck.

New split:
- `fast-neg` (-7..=-3), `fast-pos` (-2..=-1, 1)
- `lazy-lower` (5..=9), `lazy-upper` (10..=15)

Other shards unchanged: dfast (2,3), greedy (4), btopt (16,17),
btultra (18,19), btultra2 (20..=22). Total strategy groups now 9
(was 7) × 3 targets = **27 main-push shards (was 21)**.

## 3. .gitignore: add `CLAUDE.md`

Project-local Claude rules file (created in an earlier session) is
private to the maintainer's setup, mirrors `.claude/` and `AGENTS.md`
which are already ignored. Stops `CLAUDE.md` from showing up in
`git status` after every session.

## Expected impact

- Worst-case shard wall: 120-min cap → ~30–40 min headroom (lazy now
  split in half + ~60% measurement savings on Small/Corpus/Entropy).
- Large/Silesia measurement-quality regression: fixed.
- Total CI bench parallel jobs go from 21 to 27 — same `runs-on:
  ubuntu-latest` matrix expands, GitHub-hosted runner quota
  accommodates fine.

* ci(bench): skip entire bench pipeline on release-plz PRs (#164)

Release-plz PRs are version-bump only (no source-code changes), so
running the full bench matrix (build × 3 targets + 27 strategy
shards on main / 3 on PR + aggregate + pages + regression) is pure
CI waste — ~30 min wall + GitHub-hosted runner minutes consumed
for zero usable output.

Add an `if:` filter on `bench-matrix` that excludes:
- `release-plz[bot]` as PR author
- branch names starting with `release-plz-`

The skip cascades through `needs:` so every downstream bench job
(`bench-build`, `benchmark`, `benchmark-aggregate`,
`benchmark-pages`, `benchmark-regression-check`) is automatically
gated out by GH Actions' default skip-when-needs-skipped semantics.

`benchmark-regression-check` already had its own equivalent filter
from PR #159; this commit moves the gate one job upstream so the
whole pipeline noops cleanly instead of running shards and then
discarding the merged report at the gh-pages step (which is gated
on `push && main` only).

* fix(fse): switch SymbolTT.delta_nb_bits to u32 for 16-bit-target safety

CodeRabbit caught a 16-bit-target overflow in the new
`FSETable::next_state` flat-table arithmetic. The donor 16.16
fixed-point value `delta_nb_bits` was stored as `usize`, but the
`<<16` / `>>16` shifts assume at least 32-bit width. On 16-bit
targets (AVR, MSP430, no-atomic Cortex-M0 — explicitly supported
profiles for this crate per the `critical-section` feature docs)
`usize` is 16 bits and those shifts silently overflow to zero,
breaking the encode arithmetic.

Donor `fse_compress.c` uses `U32` throughout the same fixed-point
math for the same reason — this aligns us with that invariant.

Change:
- `SymbolTT.delta_nb_bits: usize` → `u32`
- Build-time arithmetic in `build_table_from_probabilities` now
  computes `delta_nb_bits` in `u32` (`-1 | 1` and `probability > 1`
  arms both updated to `<< 16` on u32 operands).
- The dedup-loop fixed-point math (`current_value + delta_nb_bits`,
  `current_value >> num_bits`) switched to u32 to keep the same
  invariant.
- `FSETable::next_state` casts `value` to u32 at the start of the
  arithmetic, then back to `usize` only at the final indexing sites
  (`1usize << nb_bits`, `value >> nb_bits` slot lookup).

Behavior unchanged on 32-bit / 64-bit targets; only the silent-wrap
bug on 16-bit `usize` targets is fixed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci(bench): remove fail-on-alert — regression is informational, not a gate

2 participants