perf(fse): replace next_state linear search with donor-parity flat tables + tune CI bench budgets by polaz · Pull Request #165 · structured-world/structured-zstd

polaz · 2026-05-18T00:57:20Z

Summary

Two changes in one PR (per scope expansion request).

1. FSE encode hot path: replace linear search with donor-parity flat tables — perf

FSETable::next_state previously linear-scanned a Vec<State> per symbol to find the entry whose [baseline, last_index] range contained the current state index. For each emitted sequence: 3 such scans (LL, ML, OF). The donor (fse_compress.c::FSE_encodeSymbol) does the same lookup in O(1) via flat nextStateTable + per-symbol FSE_symbolCompressionTransform.

The flat tables were already computed in our build_table_from_probabilities to seed the per-symbol Vec<State>, then discarded. This PR stores them on FSETable (state_table_flat: Box<[u16]> + symbol_tt: [SymbolTT; 256]) and switches next_state to donor arithmetic:

let value = (1 << acc_log) + idx;
let nb_bits = (value + tt.delta_nb_bits) >> 16;
let baseline = idx & !((1 << nb_bits) - 1);
let next_index = state_table_flat[(value >> nb_bits) + tt.delta_find_state];

State becomes Copy; next_state / start_state now return State by value; SymbolStates::get retired. Dead State.last_index kept (used by the builder's BTreeSet dedup key) with #[allow(dead_code)].

Standalone A/B (200 iters, level 3, same session):

Fixture	Baseline	After	Δ
z000033 (target, compressible)	20543 µs	18286 µs	−11.0%
1 MiB pseudo-random	699 µs	712 µs	+2% (noise)
1 MiB repeating pattern	1179 µs	1183 µs	neutral

Random / repeating-pattern inputs barely touch encode_sequences (raw fast-path / single short block) so the change is correctly neutral there.

2. CI bench tuning — wall-time

criterion 0.8 hard-asserts sample_size >= 10, so sample_size(3) is off the table without forking criterion. Two complementary tweaks bring the worst-case shard (lazy) under the 120-min cap:

configure_group budgets:

Class	Old	New
Small (1–10 KiB)	3 s + ~3 s warmup	1 s + 0.2 s
Corpus/Entropy (1 MiB)	8 s + ~3 s warmup	3 s + 0.5 s
Large/Silesia (16–100 MiB)	10 s + 0.5 s	20 s + 0.5 s

Small/Corpus/Entropy were budget-bound — the wall budget exceeded what samples actually needed. Large/Silesia was the opposite — 10 s was too tight on i686 / level_22_btultra2 / 100 MiB (~2 s × 10 samples ≈ 20 s wall), producing persistent "increase target time" warnings. The 20 s bump matches the realistic floor without affecting fast combos (criterion exits the budget early once samples complete).

Shard split (.github/workflows/ci.yml):

fast (8 levels) → fast-neg (-7..=-3) + fast-pos (-2..=-1, 1)
lazy (11 levels) → lazy-lower (5..=9) + lazy-upper (10..=15)

Other shards unchanged. Total strategy groups go from 7 to 9 → main-push matrix expands from 21 to 27 jobs, each smaller.

3. Bonus: add `CLAUDE.md` to `.gitignore`

Project-local Claude rules file, mirrors the already-ignored .claude/ and AGENTS.md. Stops untracked file from showing up in git status.

Test plan

cargo nextest run -p structured-zstd --features "dict_builder bench_internals" 526/526 ✅
cargo test --doc --features "dict_builder bench_internals" 12/12 ✅
cargo clippy --features "dict_builder bench_internals" --all-targets -- -D warnings ✅
cargo fmt --check ✅
level22_sequences_match_donor_on_corpus_proxy (ratio gate) PASS ✅
python3 -c "yaml.safe_load(open('.github/workflows/ci.yml'))" ✅
Standalone A/B numbers above

Closes #164

Summary by CodeRabbit

Refactor
- Improved compression encoder internals to reduce overhead and speed up encoding hot paths.
Chores
- Increased CI benchmark sharding on main to broaden coverage across workload types.
- Adjusted benchmark measurement budgets and warm-up timing to better fit CI runtime constraints.
- Added a development file pattern to .gitignore.

…bles `FSETable::next_state(symbol, idx)` previously called `SymbolStates::get` which scanned `Vec<State>` per symbol via `.iter().find(state.contains(idx))`. On a typical level_3_dfast encode of decodecorpus-z000033 this fired ~3 × N_sequences times (LL + ML + OF per sequence), with each call walking through 5–30 states. `encode_sequences` showed up at 13.7% Rust self-time vs `ZSTD_encodeSequences` 6.4% on the C donor. Root cause: the donor-parity precomputed tables (`deltaNbBits`, `deltaFindState`, flat `nextStateTable`) were already built in `build_table_from_probabilities` for the donor `FSE_encodeSymbol` arithmetic — used to populate `Vec<State>` and then discarded. Change: - Add `state_table_flat: Box<[u16]>` and `symbol_tt: [SymbolTT; 256]` to `FSETable`. Populated in `build_table_from_probabilities` from the same intermediates that fed the legacy `Vec<State>` push loop. Donor parity: `state_table_flat` mirrors `nextStateTable` byte for byte (u16, table_size entries); `SymbolTT` mirrors `FSE_symbolCompressionTransform`. - `FSETable::next_state` now returns `State` by value, computed via donor arithmetic: ```text value = (1 << acc_log) + idx nb_bits = (value + delta_nb_bits) >> 16 baseline = idx & !((1 << nb_bits) - 1) next_index = state_table_flat[(value >> nb_bits) + delta_find_state] ``` O(1) lookup, no Vec deref, no Option wrap, no linear scan. - `FSETable::start_state` returns `State` by value (was `&State`) to match the new shape so callers don't juggle lifetimes; still backed by the existing `Vec<State>` storage (called once per stream, not hot). - `State` gains `Copy` (4 fields, all Copy). - Retired methods: `SymbolStates::get` and `State::contains` (callers removed). `State.last_index` kept (used by the BTreeSet dedup in the builder) with `#[allow(dead_code)]` since it is no longer read on the encode hot path. - Caller-side: `encode_sequences` (`blocks/compressed.rs`) and the internal `FSEEncoder` glue (`fse/fse_encoder.rs`) now store `Option<State>` instead of `Option<&State>` — natural fit for the new by-value return shape. Measurement (standalone, 200 iters, level_3_dfast / z000033, same session A/B): | fixture | baseline | after | delta | |---------|---------:|------:|------:| | z000033 (target, compressible) | 20543 µs | 18286 µs | **−11.0%** | | 1 MiB pseudo-random | 699 µs | 712 µs | +2% noise | | 1 MiB repeating pattern | 1179 µs | 1183 µs | neutral | z000033 is the canonical Phase-7 ratio-gap fixture — the encode_sequences path is hot exactly there. Random / RLE-shape inputs barely touch encode_sequences (raw fast-path / single short block) so the change is correctly neutral. Gates green: - `cargo nextest run -p structured-zstd --features dict_builder bench_internals` 526/526 - `cargo test --doc --features dict_builder bench_internals` 12/12 - `cargo clippy --features dict_builder bench_internals --all-targets -- -D warnings` clean - `level22_sequences_match_donor_on_corpus_proxy` ratio gate PASS

criterion 0.8 hard-asserts `sample_size >= 10` (`benchmark_group.rs:97`, `lib.rs:519`) so cutting the sample count is not an option without forking criterion. Two complementary changes here drop the worst-case shard wall under the 120-min CI cap while preserving (or improving) measurement quality. ## 1. `configure_group` budget tuning (`zstd/benches/compare_ffi.rs`) | Class | Old | New | Δ per side (×2) | |-------|-----|-----|----------------| | Small (1–10 KiB) | 3 s + ~3 s default warmup | 1 s + 0.2 s warmup | -8 s | | Corpus / Entropy (1 MiB) | 8 s + ~3 s default warmup | 3 s + 0.5 s warmup | -15 s | | Large / Silesia (16–100 MiB) | 10 s + 0.5 s | **20 s** + 0.5 s | +20 s | Small / Corpus / Entropy: the old budgets were wall-bound by the measurement window (criterion fit samples in less time than allotted). Shrinking the budget reclaims that headroom; sample count stays at 30 / 10 respectively so measurement quality is unchanged on fast-per-iter benches. Large / Silesia: the old 10 s was too tight. i686 / level_22_btultra2 / 100 MiB takes ~2 s per iter × 10 samples ≈ 20 s wall, so the budget was producing "increase target time" warnings + occasional flaky measurements on the slowest combos. Widening to 20 s removes the warning envelope without affecting wall on faster combos (criterion exits the budget early once samples complete). ## 2. Split `fast` and `lazy` shards (`.github/workflows/ci.yml`) `lazy` carried 11 levels (5–15) and `fast` 8 levels (-7..=-1, 1) — together ~50% of the main-push bench surface. The `lazy` shard at 120 min remained the consistent CI bottleneck. New split: - `fast-neg` (-7..=-3), `fast-pos` (-2..=-1, 1) - `lazy-lower` (5..=9), `lazy-upper` (10..=15) Other shards unchanged: dfast (2,3), greedy (4), btopt (16,17), btultra (18,19), btultra2 (20..=22). Total strategy groups now 9 (was 7) × 3 targets = **27 main-push shards (was 21)**. ## 3. .gitignore: add `CLAUDE.md` Project-local Claude rules file (created in an earlier session) is private to the maintainer's setup, mirrors `.claude/` and `AGENTS.md` which are already ignored. Stops `CLAUDE.md` from showing up in `git status` after every session. ## Expected impact - Worst-case shard wall: 120-min cap → ~30–40 min headroom (lazy now split in half + ~60% measurement savings on Small/Corpus/Entropy). - Large/Silesia measurement-quality regression: fixed. - Total CI bench parallel jobs go from 21 to 27 — same `runs-on: ubuntu-latest` matrix expands, GitHub-hosted runner quota accommodates fine.

coderabbitai · 2026-05-18T00:57:31Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 711d762f-16db-4bfe-be5b-0fb02f159dd5

📥 Commits

Reviewing files that changed from the base of the PR and between b0ddfce and ba25219.

📒 Files selected for processing (2)

.github/workflows/ci.yml
zstd/src/fse/fse_encoder.rs

📝 Walkthrough

Walkthrough

This PR replaces linear-scan FSE next-state lookups with donor-parity O(1) arithmetic and flat tables, updates FSE table construction to emit per-symbol deltas and a flat state table, tunes benchmark timing, and expands CI shard splitting for main; also adds CLAUDE.md to .gitignore.

Changes

FSE O(1) state transition optimization

Layer / File(s)	Summary
FSETable redesign with O(1) next_state arithmetic `zstd/src/fse/fse_encoder.rs`	Introduces `SymbolTT` struct; expands `FSETable` to store `state_table_flat` and `symbol_tt`; rewrites `FSETable::next_state`/`start_state` to return `State` by value and compute next state via per-symbol deltas + flat table reads; marks `State` as `Copy` and annotates `last_index` as unused.
FSETable construction with donor-parity tables `zstd/src/fse/fse_encoder.rs`	Builds flat donor `state_table_flat` from intermediate state structures; computes and stores per-symbol `(delta_nb_bits, delta_find_state)` using 16.16 fixed-point arithmetic; replaces O(N²) deduplication with a `BTreeSet`; assembles final `FSETable` with the new fields.
Benchmark timing configuration by scenario class `zstd/benches/compare_ffi.rs`	Tunes Criterion `sample_size`, `measurement_time`, and `warm_up_time` by `ScenarioClass`: reduces measurement budget and adds warm-up for `Small`, `Corpus`, `Entropy`; increases measurement_time for `Large`/`Silesia`; expands comments linking budgets to CI wall-time.
CI benchmark sharding expansion for main-branch coverage `.github/workflows/ci.yml`	Adds `if` guard to skip bench pipeline for `release-plz[bot]` version-bump PRs; updates comments and non-PR `shards.json` to split `fast`→`fast-neg`/`fast-pos` and `lazy`→`lazy-lower`/`lazy-upper`, yielding nine strategy groups per target on `push: main`.
Configuration: ignore file `.gitignore`	Adds `CLAUDE.md` to ignored paths.

Sequence Diagram(s)

N/A

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

structured-world/structured-zstd#143: Related benchmark timing and CI sharding changes.
structured-world/structured-zstd#144: Related Criterion configure_group timing adjustments.
structured-world/structured-zstd#153: Related CI benchmark matrix changes and timeout adjustments.

Poem

🐰 Flat tables hum, no longer slow,
arithmetic hops where scans would go,
benchmarks tuned, CI splits the load,
a rabbit cheers the faster road—
tiny hops, big wins in tow.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary changes: FSE encoder optimization (donor-parity flat tables) and CI benchmark tuning.
Linked Issues check	✅ Passed	All linked issue `#164` requirements are met: FSE next_state linear scan replaced with O(1) donor-parity arithmetic, flat tables persisted, roundtrip verified, and benchmark improvements measured.
Out of Scope Changes check	✅ Passed	Changes are within scope: FSE encoder optimization, CI benchmark tuning, and .gitignore update are all justified by the linked issue or ancillary improvements.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#164-fse-flat-tables

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-18T01:01:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Replaces FSE encoder's per-sequence linear next_state scan with O(1) donor-parity arithmetic by retaining the flat next-state table and per-symbol coding transform already computed at table build time, and tunes CI bench wall-clock budgets / shard sharding to fit the 120-min cap.

Changes:

Adds state_table_flat: Box<[u16]> + symbol_tt: [SymbolTT; 256] to FSETable; rewrites next_state (and adjusts start_state) to return State by value via donor arithmetic; removes SymbolStates::get and State::contains; makes State: Copy.
Re-tunes configure_group budgets (Small 3→1s, Corpus/Entropy 8→3s, Large/Silesia 10→20s + warmups), and splits CI bench shards fast and lazy into smaller sub-shards.
Adds CLAUDE.md to .gitignore.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.

File	Description
zstd/src/fse/fse_encoder.rs	Stores donor flat next-state table and per-symbol TT on FSETable; converts hot-path `next_state` to O(1) arithmetic; retires linear scan.
zstd/benches/compare_ffi.rs	Adjusts criterion measurement/warmup budgets per scenario class to fit CI wall-clock cap.
.github/workflows/ci.yml	Splits `fast` and `lazy` bench shards; updates accompanying comments to reflect 9 strategy groups / 27 shards.
.gitignore	Ignores project-local `CLAUDE.md`.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/fse/fse_encoder.rs`:
- Around line 147-150: The SymbolTT field delta_nb_bits is used as a 16.16
fixed-point value but currently stored as usize causing 16-bit-target shift
overflow; change delta_nb_bits to a u32 (keep delta_find_state as-is) and
perform all fixed-point arithmetic and <<16 / >>16 shifts in u32, only
converting to usize at the final index or table lookup sites (where
delta_nb_bits is used to compute an index/offset). Update all usages in
fse_encoder.rs that do the 16.16 shifts (the same spots mentioned in the review)
to cast operands to u32 before shifting and cast results to usize only for array
indexing or API calls.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 10256d8d-7b87-4c80-84a3-8bd0c2338cc4

📥 Commits

Reviewing files that changed from the base of the PR and between f5a56dc and b0ddfce.

📒 Files selected for processing (4)

.github/workflows/ci.yml
.gitignore
zstd/benches/compare_ffi.rs
zstd/src/fse/fse_encoder.rs

Release-plz PRs are version-bump only (no source-code changes), so running the full bench matrix (build × 3 targets + 27 strategy shards on main / 3 on PR + aggregate + pages + regression) is pure CI waste — ~30 min wall + GitHub-hosted runner minutes consumed for zero usable output. Add an `if:` filter on `bench-matrix` that excludes: - `release-plz[bot]` as PR author - branch names starting with `release-plz-` The skip cascades through `needs:` so every downstream bench job (`bench-build`, `benchmark`, `benchmark-aggregate`, `benchmark-pages`, `benchmark-regression-check`) is automatically gated out by GH Actions' default skip-when-needs-skipped semantics. `benchmark-regression-check` already had its own equivalent filter from PR #159; this commit moves the gate one job upstream so the whole pipeline noops cleanly instead of running shards and then discarding the merged report at the gh-pages step (which is gated on `push && main` only).

CodeRabbit caught a 16-bit-target overflow in the new `FSETable::next_state` flat-table arithmetic. The donor 16.16 fixed-point value `delta_nb_bits` was stored as `usize`, but the `<<16` / `>>16` shifts assume at least 32-bit width. On 16-bit targets (AVR, MSP430, no-atomic Cortex-M0 — explicitly supported profiles for this crate per the `critical-section` feature docs) `usize` is 16 bits and those shifts silently overflow to zero, breaking the encode arithmetic. Donor `fse_compress.c` uses `U32` throughout the same fixed-point math for the same reason — this aligns us with that invariant. Change: - `SymbolTT.delta_nb_bits: usize` → `u32` - Build-time arithmetic in `build_table_from_probabilities` now computes `delta_nb_bits` in `u32` (`-1 | 1` and `probability > 1` arms both updated to `<< 16` on u32 operands). - The dedup-loop fixed-point math (`current_value + delta_nb_bits`, `current_value >> num_bits`) switched to u32 to keep the same invariant. - `FSETable::next_state` casts `value` to u32 at the start of the arithmetic, then back to `usize` only at the final indexing sites (`1usize << nb_bits`, `value >> nb_bits` slot lookup). Behavior unchanged on 32-bit / 64-bit targets; only the silent-wrap bug on 16-bit `usize` targets is fixed.

Copilot

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated no new comments.

…E-encode per candidate (#168) * perf(huff0): replace FSE-encode-per-candidate with cheap entropy proxy #167. `HuffmanTable::build_from_counts` previously called `try_table_description_size` per `min_table_log..=11` candidate, which ran a full FSE-encode of the weight stream against a freshly-built FSE table just to count bytes (~31 % inclusive on the 4 KiB compress profile after #166). Replace with `cheap_desc_size_proxy(weights)`: an integer entropy estimate that reproduces the donor `HUF_writeCTable_wksp` decision (FSE vs raw nibble) without touching the FSE encoder. - FSE estimate: `sum(c_i * ceil_log2(total / c_i))` over the 13-bin weight histogram + 8 B header overhead (empirical upper bound for the `acc_log = 6` weight FSE table seen in `encode_weight_description`). - Raw nibble: exact `weights.len().div_ceil(2) + 1`, representable when `weights.len() <= 128`. - Return min of the two when both are representable. Validated: - 502/502 lib tests (incl. new `cheap_desc_size_proxy_is_conservative_vs_exact`) - `compare_ffi` ratio REPORT sweep across all scenarios × all levels: no new `rust_bytes > ffi_bytes` cells; `decodecorpus-z000033` L18/L19 *improved* by 442 B each (R=443 434 → 442 992) — the proxy steers selection to a marginally tighter `table_log` on those cells. Every small-4k-log-lines cell preserved (L1_fast R=154 ≤ C=157, L2_dfast R=150 ≤ C=157, …). Speed (`compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust`): 47.4 µs → 34.2 µs (−28 % on top of #166; cumulative −59 % vs the pre-#165 baseline). Gap vs donor 6.7× → 4.8×. * fix(huff0): use ceiling division in cheap_desc_size_proxy entropy bound `cheap_desc_size_proxy` claims to be a conservative entropy upper bound, but used truncating integer division for `total / c` before `ceil_log2`. For non-integer ratios (e.g. `total=10, c=4 → 2.5`) this truncated to `2`, the subsequent `ceil_log2` emitted `1` bit, and the proxy under-shot the real entropy ≥ 2 bits per symbol. Switch to `total.div_ceil(c)` so the ceiling is taken BEFORE the `ceil_log2` step. Ratio sweep across `compare_ffi` corpus × every supported level: no new `rust_bytes > ffi_bytes` cells; small-4k-log unchanged; z000033 L18/L19 *improved* (R=442 992 → 442 863). small-4k-log L2_dfast pure_rust 34.2 µs → 32.8 µs (slightly faster: the tighter estimate cuts off more candidates earlier in the `min_table_log..=11` loop's monotone-increase break). Added `cheap_desc_size_proxy_edge_cases` covering every `(fse_ok, raw_ok)` arm plus the `n == 0` early-out and the `ratio <= 1` clamp branch — the prior test only hit a handful of input shapes, leaving the `(true, false)` / `(false, true)` / `(false, false)` arms and the early-out without coverage. * fix(huff0): cheap_desc_size_proxy off-by-one + drop per-candidate Vec alloc Three threads from PR #168 review. - **fse_ok off-by-one.** `encode_weight_description` rejects only `encoded.len() >= 128`, so `encoded.len() == 127` is the largest accepted FSE payload and the total serialized description (`encoded.len() + 1` length-byte prefix) is exactly 128 B at the boundary. The proxy's `fse_size` includes the length byte — accept `<= 128`, not `<= 127`. As written the proxy would skip a valid candidate at the exact boundary and force a worse fallback. - **Per-candidate `Vec<u8>` alloc.** `build_from_counts` called `table.weights()` per `table_log` candidate just to score `desc_size` — fresh `Vec<u8>` allocation each iteration. Replace with a stack-allocated `[u8; 256]` buffer reused across iterations (counts.len() max is 256). The Vec from `build_donor_limited_weights` already carries the same weight values; just copy them into the buffer. - **Test slice-length mismatch.** `cheap_desc_size_proxy_is_conservative_vs_exact` compared `proxy(weights)` (N items) against `table.try_table_description_size()` (which trims `[..N-1]` internally). Fix: trim `weights` before calling the proxy so both paths score the same serialized slice. All 503/503 lib tests pass, clippy / fmt clean. Targeted ratio sweep (small-4k-log L1-L3, z000033 L1-L3, large-log-stream L1-L3): unchanged — no new R>C cells, no regression. * test(huff0): rebuild conservative-vs-exact fixtures from build_from_counts; align proxy docs `cheap_desc_size_proxy_is_conservative_vs_exact` silently skipped every hand-curated fixture because their weight vectors ([1,2,3,4,5], [1;13], [6;50], (0..13).cycle().take(120), ...) all failed `huffman_weight_sum_is_power_of_two` — Kraft equality must hold for a valid Huffman weight set, and the synthetic arrays never did. The loop body was unreachable; the test passed trivially. - Rebuild fixtures via `HuffmanTable::build_from_counts(counts)` → `table.weights()`. The encoder's own output is Kraft-valid by construction. Counts inputs cover skewed, uniform, geometric, wide alphabets, and near-raw-limit cases. - Add an `exercised > 0` assertion to fail loud if a future refactor silently skips all fixtures again. - Fix `raw_floor`: the writer's raw representation for the trimmed slice is `trimmed.len().div_ceil(2) + 1` (nibbles + length byte), not `weights.len().div_ceil(2)`. Doc fixes: - `cheap_desc_size_proxy` docstring: "representable when result is < 128 bytes" → "<= 128 bytes" — `encode_weight_description` rejects only `encoded.len() >= 128`, so total `encoded.len() + 1` length-byte prefix tops out at exactly 128. The code at the `fse_ok` site already uses `<= 128`; the doc was out of sync. - `cheap_desc_size_proxy` docstring: explicitly document that `n == 0` returns `None` (raw could in principle encode an empty slice as just the length byte, but production callers never hand `n == 0` here — `build_from_counts` short-circuits on `symbol_cardinality <= 1`).

polaz added 2 commits May 18, 2026 03:47

Copilot AI review requested due to automatic review settings May 18, 2026 00:57

Copilot started reviewing on behalf of polaz May 18, 2026 01:01 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_encoder.rs

polaz added 2 commits May 18, 2026 04:14

polaz requested a review from Copilot May 18, 2026 01:18

Copilot started reviewing on behalf of polaz May 18, 2026 01:18 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

polaz merged commit 70c0830 into main May 18, 2026
25 checks passed

polaz deleted the perf/#164-fse-flat-tables branch May 18, 2026 01:20

sw-release-bot Bot mentioned this pull request May 17, 2026

chore: release v0.0.22 #156

Open

coderabbitai Bot mentioned this pull request May 18, 2026

perf(fse): donor FSE_buildCTable_wksp parity — drop per-symbol Vec<State> #166

Merged

coderabbitai Bot mentioned this pull request May 18, 2026

perf(huff0): investigate shared Huffman weight-description cache between planner and emitter [SPIKE] #169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(fse): replace next_state linear search with donor-parity flat tables + tune CI bench budgets#165

perf(fse): replace next_state linear search with donor-parity flat tables + tune CI bench budgets#165
polaz merged 4 commits into
mainfrom
perf/#164-fse-flat-tables

polaz commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov Bot commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

polaz commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. FSE encode hot path: replace linear search with donor-parity flat tables — perf

2. CI bench tuning — wall-time

3. Bonus: add CLAUDE.md to .gitignore

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov Bot commented May 18, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented May 18, 2026 •

edited by coderabbitai Bot

Loading

3. Bonus: add `CLAUDE.md` to `.gitignore`

coderabbitai Bot commented May 18, 2026 •

edited

Loading