perf(fse): replace next_state linear search with donor-parity flat tables + tune CI bench budgets#165
Conversation
…bles `FSETable::next_state(symbol, idx)` previously called `SymbolStates::get` which scanned `Vec<State>` per symbol via `.iter().find(state.contains(idx))`. On a typical level_3_dfast encode of decodecorpus-z000033 this fired ~3 × N_sequences times (LL + ML + OF per sequence), with each call walking through 5–30 states. `encode_sequences` showed up at 13.7% Rust self-time vs `ZSTD_encodeSequences` 6.4% on the C donor. Root cause: the donor-parity precomputed tables (`deltaNbBits`, `deltaFindState`, flat `nextStateTable`) were already built in `build_table_from_probabilities` for the donor `FSE_encodeSymbol` arithmetic — used to populate `Vec<State>` and then discarded. Change: - Add `state_table_flat: Box<[u16]>` and `symbol_tt: [SymbolTT; 256]` to `FSETable`. Populated in `build_table_from_probabilities` from the same intermediates that fed the legacy `Vec<State>` push loop. Donor parity: `state_table_flat` mirrors `nextStateTable` byte for byte (u16, table_size entries); `SymbolTT` mirrors `FSE_symbolCompressionTransform`. - `FSETable::next_state` now returns `State` by value, computed via donor arithmetic: ```text value = (1 << acc_log) + idx nb_bits = (value + delta_nb_bits) >> 16 baseline = idx & !((1 << nb_bits) - 1) next_index = state_table_flat[(value >> nb_bits) + delta_find_state] ``` O(1) lookup, no Vec deref, no Option wrap, no linear scan. - `FSETable::start_state` returns `State` by value (was `&State`) to match the new shape so callers don't juggle lifetimes; still backed by the existing `Vec<State>` storage (called once per stream, not hot). - `State` gains `Copy` (4 fields, all Copy). - Retired methods: `SymbolStates::get` and `State::contains` (callers removed). `State.last_index` kept (used by the BTreeSet dedup in the builder) with `#[allow(dead_code)]` since it is no longer read on the encode hot path. - Caller-side: `encode_sequences` (`blocks/compressed.rs`) and the internal `FSEEncoder` glue (`fse/fse_encoder.rs`) now store `Option<State>` instead of `Option<&State>` — natural fit for the new by-value return shape. Measurement (standalone, 200 iters, level_3_dfast / z000033, same session A/B): | fixture | baseline | after | delta | |---------|---------:|------:|------:| | z000033 (target, compressible) | 20543 µs | 18286 µs | **−11.0%** | | 1 MiB pseudo-random | 699 µs | 712 µs | +2% noise | | 1 MiB repeating pattern | 1179 µs | 1183 µs | neutral | z000033 is the canonical Phase-7 ratio-gap fixture — the encode_sequences path is hot exactly there. Random / RLE-shape inputs barely touch encode_sequences (raw fast-path / single short block) so the change is correctly neutral. Gates green: - `cargo nextest run -p structured-zstd --features dict_builder bench_internals` 526/526 - `cargo test --doc --features dict_builder bench_internals` 12/12 - `cargo clippy --features dict_builder bench_internals --all-targets -- -D warnings` clean - `level22_sequences_match_donor_on_corpus_proxy` ratio gate PASS
criterion 0.8 hard-asserts `sample_size >= 10` (`benchmark_group.rs:97`, `lib.rs:519`) so cutting the sample count is not an option without forking criterion. Two complementary changes here drop the worst-case shard wall under the 120-min CI cap while preserving (or improving) measurement quality. ## 1. `configure_group` budget tuning (`zstd/benches/compare_ffi.rs`) | Class | Old | New | Δ per side (×2) | |-------|-----|-----|----------------| | Small (1–10 KiB) | 3 s + ~3 s default warmup | 1 s + 0.2 s warmup | -8 s | | Corpus / Entropy (1 MiB) | 8 s + ~3 s default warmup | 3 s + 0.5 s warmup | -15 s | | Large / Silesia (16–100 MiB) | 10 s + 0.5 s | **20 s** + 0.5 s | +20 s | Small / Corpus / Entropy: the old budgets were wall-bound by the measurement window (criterion fit samples in less time than allotted). Shrinking the budget reclaims that headroom; sample count stays at 30 / 10 respectively so measurement quality is unchanged on fast-per-iter benches. Large / Silesia: the old 10 s was too tight. i686 / level_22_btultra2 / 100 MiB takes ~2 s per iter × 10 samples ≈ 20 s wall, so the budget was producing "increase target time" warnings + occasional flaky measurements on the slowest combos. Widening to 20 s removes the warning envelope without affecting wall on faster combos (criterion exits the budget early once samples complete). ## 2. Split `fast` and `lazy` shards (`.github/workflows/ci.yml`) `lazy` carried 11 levels (5–15) and `fast` 8 levels (-7..=-1, 1) — together ~50% of the main-push bench surface. The `lazy` shard at 120 min remained the consistent CI bottleneck. New split: - `fast-neg` (-7..=-3), `fast-pos` (-2..=-1, 1) - `lazy-lower` (5..=9), `lazy-upper` (10..=15) Other shards unchanged: dfast (2,3), greedy (4), btopt (16,17), btultra (18,19), btultra2 (20..=22). Total strategy groups now 9 (was 7) × 3 targets = **27 main-push shards (was 21)**. ## 3. .gitignore: add `CLAUDE.md` Project-local Claude rules file (created in an earlier session) is private to the maintainer's setup, mirrors `.claude/` and `AGENTS.md` which are already ignored. Stops `CLAUDE.md` from showing up in `git status` after every session. ## Expected impact - Worst-case shard wall: 120-min cap → ~30–40 min headroom (lazy now split in half + ~60% measurement savings on Small/Corpus/Entropy). - Large/Silesia measurement-quality regression: fixed. - Total CI bench parallel jobs go from 21 to 27 — same `runs-on: ubuntu-latest` matrix expands, GitHub-hosted runner quota accommodates fine.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR replaces linear-scan FSE next-state lookups with donor-parity O(1) arithmetic and flat tables, updates FSE table construction to emit per-symbol deltas and a flat state table, tunes benchmark timing, and expands CI shard splitting for main; also adds ChangesFSE O(1) state transition optimization
Sequence Diagram(s)N/A Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Replaces FSE encoder's per-sequence linear next_state scan with O(1) donor-parity arithmetic by retaining the flat next-state table and per-symbol coding transform already computed at table build time, and tunes CI bench wall-clock budgets / shard sharding to fit the 120-min cap.
Changes:
- Adds
state_table_flat: Box<[u16]>+symbol_tt: [SymbolTT; 256]toFSETable; rewritesnext_state(and adjustsstart_state) to returnStateby value via donor arithmetic; removesSymbolStates::getandState::contains; makesState: Copy. - Re-tunes
configure_groupbudgets (Small 3→1s, Corpus/Entropy 8→3s, Large/Silesia 10→20s + warmups), and splits CI bench shardsfastandlazyinto smaller sub-shards. - Adds
CLAUDE.mdto.gitignore.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| zstd/src/fse/fse_encoder.rs | Stores donor flat next-state table and per-symbol TT on FSETable; converts hot-path next_state to O(1) arithmetic; retires linear scan. |
| zstd/benches/compare_ffi.rs | Adjusts criterion measurement/warmup budgets per scenario class to fit CI wall-clock cap. |
| .github/workflows/ci.yml | Splits fast and lazy bench shards; updates accompanying comments to reflect 9 strategy groups / 27 shards. |
| .gitignore | Ignores project-local CLAUDE.md. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/fse/fse_encoder.rs`:
- Around line 147-150: The SymbolTT field delta_nb_bits is used as a 16.16
fixed-point value but currently stored as usize causing 16-bit-target shift
overflow; change delta_nb_bits to a u32 (keep delta_find_state as-is) and
perform all fixed-point arithmetic and <<16 / >>16 shifts in u32, only
converting to usize at the final index or table lookup sites (where
delta_nb_bits is used to compute an index/offset). Update all usages in
fse_encoder.rs that do the 16.16 shifts (the same spots mentioned in the review)
to cast operands to u32 before shifting and cast results to usize only for array
indexing or API calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 10256d8d-7b87-4c80-84a3-8bd0c2338cc4
📒 Files selected for processing (4)
.github/workflows/ci.yml.gitignorezstd/benches/compare_ffi.rszstd/src/fse/fse_encoder.rs
Release-plz PRs are version-bump only (no source-code changes), so running the full bench matrix (build × 3 targets + 27 strategy shards on main / 3 on PR + aggregate + pages + regression) is pure CI waste — ~30 min wall + GitHub-hosted runner minutes consumed for zero usable output. Add an `if:` filter on `bench-matrix` that excludes: - `release-plz[bot]` as PR author - branch names starting with `release-plz-` The skip cascades through `needs:` so every downstream bench job (`bench-build`, `benchmark`, `benchmark-aggregate`, `benchmark-pages`, `benchmark-regression-check`) is automatically gated out by GH Actions' default skip-when-needs-skipped semantics. `benchmark-regression-check` already had its own equivalent filter from PR #159; this commit moves the gate one job upstream so the whole pipeline noops cleanly instead of running shards and then discarding the merged report at the gh-pages step (which is gated on `push && main` only).
CodeRabbit caught a 16-bit-target overflow in the new `FSETable::next_state` flat-table arithmetic. The donor 16.16 fixed-point value `delta_nb_bits` was stored as `usize`, but the `<<16` / `>>16` shifts assume at least 32-bit width. On 16-bit targets (AVR, MSP430, no-atomic Cortex-M0 — explicitly supported profiles for this crate per the `critical-section` feature docs) `usize` is 16 bits and those shifts silently overflow to zero, breaking the encode arithmetic. Donor `fse_compress.c` uses `U32` throughout the same fixed-point math for the same reason — this aligns us with that invariant. Change: - `SymbolTT.delta_nb_bits: usize` → `u32` - Build-time arithmetic in `build_table_from_probabilities` now computes `delta_nb_bits` in `u32` (`-1 | 1` and `probability > 1` arms both updated to `<< 16` on u32 operands). - The dedup-loop fixed-point math (`current_value + delta_nb_bits`, `current_value >> num_bits`) switched to u32 to keep the same invariant. - `FSETable::next_state` casts `value` to u32 at the start of the arithmetic, then back to `usize` only at the final indexing sites (`1usize << nb_bits`, `value >> nb_bits` slot lookup). Behavior unchanged on 32-bit / 64-bit targets; only the silent-wrap bug on 16-bit `usize` targets is fixed.
…E-encode per candidate (#168) * perf(huff0): replace FSE-encode-per-candidate with cheap entropy proxy #167. `HuffmanTable::build_from_counts` previously called `try_table_description_size` per `min_table_log..=11` candidate, which ran a full FSE-encode of the weight stream against a freshly-built FSE table just to count bytes (~31 % inclusive on the 4 KiB compress profile after #166). Replace with `cheap_desc_size_proxy(weights)`: an integer entropy estimate that reproduces the donor `HUF_writeCTable_wksp` decision (FSE vs raw nibble) without touching the FSE encoder. - FSE estimate: `sum(c_i * ceil_log2(total / c_i))` over the 13-bin weight histogram + 8 B header overhead (empirical upper bound for the `acc_log = 6` weight FSE table seen in `encode_weight_description`). - Raw nibble: exact `weights.len().div_ceil(2) + 1`, representable when `weights.len() <= 128`. - Return min of the two when both are representable. Validated: - 502/502 lib tests (incl. new `cheap_desc_size_proxy_is_conservative_vs_exact`) - `compare_ffi` ratio REPORT sweep across all scenarios × all levels: no new `rust_bytes > ffi_bytes` cells; `decodecorpus-z000033` L18/L19 *improved* by 442 B each (R=443 434 → 442 992) — the proxy steers selection to a marginally tighter `table_log` on those cells. Every small-4k-log-lines cell preserved (L1_fast R=154 ≤ C=157, L2_dfast R=150 ≤ C=157, …). Speed (`compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust`): 47.4 µs → 34.2 µs (−28 % on top of #166; cumulative −59 % vs the pre-#165 baseline). Gap vs donor 6.7× → 4.8×. * fix(huff0): use ceiling division in cheap_desc_size_proxy entropy bound `cheap_desc_size_proxy` claims to be a conservative entropy upper bound, but used truncating integer division for `total / c` before `ceil_log2`. For non-integer ratios (e.g. `total=10, c=4 → 2.5`) this truncated to `2`, the subsequent `ceil_log2` emitted `1` bit, and the proxy under-shot the real entropy ≥ 2 bits per symbol. Switch to `total.div_ceil(c)` so the ceiling is taken BEFORE the `ceil_log2` step. Ratio sweep across `compare_ffi` corpus × every supported level: no new `rust_bytes > ffi_bytes` cells; small-4k-log unchanged; z000033 L18/L19 *improved* (R=442 992 → 442 863). small-4k-log L2_dfast pure_rust 34.2 µs → 32.8 µs (slightly faster: the tighter estimate cuts off more candidates earlier in the `min_table_log..=11` loop's monotone-increase break). Added `cheap_desc_size_proxy_edge_cases` covering every `(fse_ok, raw_ok)` arm plus the `n == 0` early-out and the `ratio <= 1` clamp branch — the prior test only hit a handful of input shapes, leaving the `(true, false)` / `(false, true)` / `(false, false)` arms and the early-out without coverage. * fix(huff0): cheap_desc_size_proxy off-by-one + drop per-candidate Vec alloc Three threads from PR #168 review. - **fse_ok off-by-one.** `encode_weight_description` rejects only `encoded.len() >= 128`, so `encoded.len() == 127` is the largest accepted FSE payload and the total serialized description (`encoded.len() + 1` length-byte prefix) is exactly 128 B at the boundary. The proxy's `fse_size` includes the length byte — accept `<= 128`, not `<= 127`. As written the proxy would skip a valid candidate at the exact boundary and force a worse fallback. - **Per-candidate `Vec<u8>` alloc.** `build_from_counts` called `table.weights()` per `table_log` candidate just to score `desc_size` — fresh `Vec<u8>` allocation each iteration. Replace with a stack-allocated `[u8; 256]` buffer reused across iterations (counts.len() max is 256). The Vec from `build_donor_limited_weights` already carries the same weight values; just copy them into the buffer. - **Test slice-length mismatch.** `cheap_desc_size_proxy_is_conservative_vs_exact` compared `proxy(weights)` (N items) against `table.try_table_description_size()` (which trims `[..N-1]` internally). Fix: trim `weights` before calling the proxy so both paths score the same serialized slice. All 503/503 lib tests pass, clippy / fmt clean. Targeted ratio sweep (small-4k-log L1-L3, z000033 L1-L3, large-log-stream L1-L3): unchanged — no new R>C cells, no regression. * test(huff0): rebuild conservative-vs-exact fixtures from build_from_counts; align proxy docs `cheap_desc_size_proxy_is_conservative_vs_exact` silently skipped every hand-curated fixture because their weight vectors ([1,2,3,4,5], [1;13], [6;50], (0..13).cycle().take(120), ...) all failed `huffman_weight_sum_is_power_of_two` — Kraft equality must hold for a valid Huffman weight set, and the synthetic arrays never did. The loop body was unreachable; the test passed trivially. - Rebuild fixtures via `HuffmanTable::build_from_counts(counts)` → `table.weights()`. The encoder's own output is Kraft-valid by construction. Counts inputs cover skewed, uniform, geometric, wide alphabets, and near-raw-limit cases. - Add an `exercised > 0` assertion to fail loud if a future refactor silently skips all fixtures again. - Fix `raw_floor`: the writer's raw representation for the trimmed slice is `trimmed.len().div_ceil(2) + 1` (nibbles + length byte), not `weights.len().div_ceil(2)`. Doc fixes: - `cheap_desc_size_proxy` docstring: "representable when result is < 128 bytes" → "<= 128 bytes" — `encode_weight_description` rejects only `encoded.len() >= 128`, so total `encoded.len() + 1` length-byte prefix tops out at exactly 128. The code at the `fse_ok` site already uses `<= 128`; the doc was out of sync. - `cheap_desc_size_proxy` docstring: explicitly document that `n == 0` returns `None` (raw could in principle encode an empty slice as just the length byte, but production callers never hand `n == 0` here — `build_from_counts` short-circuits on `symbol_cardinality <= 1`).
Summary
Two changes in one PR (per scope expansion request).
1. FSE encode hot path: replace linear search with donor-parity flat tables — perf
FSETable::next_statepreviously linear-scanned aVec<State>per symbol to find the entry whose[baseline, last_index]range contained the current state index. For each emitted sequence: 3 such scans (LL, ML, OF). The donor (fse_compress.c::FSE_encodeSymbol) does the same lookup in O(1) via flatnextStateTable+ per-symbolFSE_symbolCompressionTransform.The flat tables were already computed in our
build_table_from_probabilitiesto seed the per-symbolVec<State>, then discarded. This PR stores them onFSETable(state_table_flat: Box<[u16]>+symbol_tt: [SymbolTT; 256]) and switchesnext_stateto donor arithmetic:StatebecomesCopy;next_state/start_statenow returnStateby value;SymbolStates::getretired. DeadState.last_indexkept (used by the builder's BTreeSet dedup key) with#[allow(dead_code)].Standalone A/B (200 iters, level 3, same session):
Random / repeating-pattern inputs barely touch
encode_sequences(raw fast-path / single short block) so the change is correctly neutral there.2. CI bench tuning — wall-time
criterion 0.8 hard-asserts
sample_size >= 10, sosample_size(3)is off the table without forking criterion. Two complementary tweaks bring the worst-case shard (lazy) under the 120-min cap:configure_groupbudgets:Small/Corpus/Entropy were budget-bound — the wall budget exceeded what samples actually needed. Large/Silesia was the opposite — 10 s was too tight on i686 / level_22_btultra2 / 100 MiB (~2 s × 10 samples ≈ 20 s wall), producing persistent "increase target time" warnings. The 20 s bump matches the realistic floor without affecting fast combos (criterion exits the budget early once samples complete).
Shard split (
.github/workflows/ci.yml):fast(8 levels) →fast-neg(-7..=-3) +fast-pos(-2..=-1, 1)lazy(11 levels) →lazy-lower(5..=9) +lazy-upper(10..=15)Other shards unchanged. Total strategy groups go from 7 to 9 → main-push matrix expands from 21 to 27 jobs, each smaller.
3. Bonus: add
CLAUDE.mdto.gitignoreProject-local Claude rules file, mirrors the already-ignored
.claude/andAGENTS.md. Stops untracked file from showing up ingit status.Test plan
cargo nextest run -p structured-zstd --features "dict_builder bench_internals"526/526 ✅cargo test --doc --features "dict_builder bench_internals"12/12 ✅cargo clippy --features "dict_builder bench_internals" --all-targets -- -D warnings✅cargo fmt --check✅level22_sequences_match_donor_on_corpus_proxy(ratio gate) PASS ✅python3 -c "yaml.safe_load(open('.github/workflows/ci.yml'))"✅Closes #164
Summary by CodeRabbit
Refactor
Chores