Skip to content

perf(fse): replace next_state linear search with donor-parity flat tables + tune CI bench budgets#165

Merged
polaz merged 4 commits into
mainfrom
perf/#164-fse-flat-tables
May 18, 2026
Merged

perf(fse): replace next_state linear search with donor-parity flat tables + tune CI bench budgets#165
polaz merged 4 commits into
mainfrom
perf/#164-fse-flat-tables

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 18, 2026

Summary

Two changes in one PR (per scope expansion request).

1. FSE encode hot path: replace linear search with donor-parity flat tables — perf

FSETable::next_state previously linear-scanned a Vec<State> per symbol to find the entry whose [baseline, last_index] range contained the current state index. For each emitted sequence: 3 such scans (LL, ML, OF). The donor (fse_compress.c::FSE_encodeSymbol) does the same lookup in O(1) via flat nextStateTable + per-symbol FSE_symbolCompressionTransform.

The flat tables were already computed in our build_table_from_probabilities to seed the per-symbol Vec<State>, then discarded. This PR stores them on FSETable (state_table_flat: Box<[u16]> + symbol_tt: [SymbolTT; 256]) and switches next_state to donor arithmetic:

let value = (1 << acc_log) + idx;
let nb_bits = (value + tt.delta_nb_bits) >> 16;
let baseline = idx & !((1 << nb_bits) - 1);
let next_index = state_table_flat[(value >> nb_bits) + tt.delta_find_state];

State becomes Copy; next_state / start_state now return State by value; SymbolStates::get retired. Dead State.last_index kept (used by the builder's BTreeSet dedup key) with #[allow(dead_code)].

Standalone A/B (200 iters, level 3, same session):

Fixture Baseline After Δ
z000033 (target, compressible) 20543 µs 18286 µs −11.0%
1 MiB pseudo-random 699 µs 712 µs +2% (noise)
1 MiB repeating pattern 1179 µs 1183 µs neutral

Random / repeating-pattern inputs barely touch encode_sequences (raw fast-path / single short block) so the change is correctly neutral there.

2. CI bench tuning — wall-time

criterion 0.8 hard-asserts sample_size >= 10, so sample_size(3) is off the table without forking criterion. Two complementary tweaks bring the worst-case shard (lazy) under the 120-min cap:

configure_group budgets:

Class Old New
Small (1–10 KiB) 3 s + ~3 s warmup 1 s + 0.2 s
Corpus/Entropy (1 MiB) 8 s + ~3 s warmup 3 s + 0.5 s
Large/Silesia (16–100 MiB) 10 s + 0.5 s 20 s + 0.5 s

Small/Corpus/Entropy were budget-bound — the wall budget exceeded what samples actually needed. Large/Silesia was the opposite — 10 s was too tight on i686 / level_22_btultra2 / 100 MiB (~2 s × 10 samples ≈ 20 s wall), producing persistent "increase target time" warnings. The 20 s bump matches the realistic floor without affecting fast combos (criterion exits the budget early once samples complete).

Shard split (.github/workflows/ci.yml):

  • fast (8 levels) → fast-neg (-7..=-3) + fast-pos (-2..=-1, 1)
  • lazy (11 levels) → lazy-lower (5..=9) + lazy-upper (10..=15)

Other shards unchanged. Total strategy groups go from 7 to 9 → main-push matrix expands from 21 to 27 jobs, each smaller.

3. Bonus: add CLAUDE.md to .gitignore

Project-local Claude rules file, mirrors the already-ignored .claude/ and AGENTS.md. Stops untracked file from showing up in git status.

Test plan

  • cargo nextest run -p structured-zstd --features "dict_builder bench_internals" 526/526 ✅
  • cargo test --doc --features "dict_builder bench_internals" 12/12 ✅
  • cargo clippy --features "dict_builder bench_internals" --all-targets -- -D warnings
  • cargo fmt --check
  • level22_sequences_match_donor_on_corpus_proxy (ratio gate) PASS ✅
  • python3 -c "yaml.safe_load(open('.github/workflows/ci.yml'))"
  • Standalone A/B numbers above

Closes #164

Summary by CodeRabbit

  • Refactor

    • Improved compression encoder internals to reduce overhead and speed up encoding hot paths.
  • Chores

    • Increased CI benchmark sharding on main to broaden coverage across workload types.
    • Adjusted benchmark measurement budgets and warm-up timing to better fit CI runtime constraints.
    • Added a development file pattern to .gitignore.

Review Change Stack

polaz added 2 commits May 18, 2026 03:47
…bles

`FSETable::next_state(symbol, idx)` previously called
`SymbolStates::get` which scanned `Vec<State>` per symbol via
`.iter().find(state.contains(idx))`. On a typical level_3_dfast
encode of decodecorpus-z000033 this fired ~3 × N_sequences times
(LL + ML + OF per sequence), with each call walking through
5–30 states. `encode_sequences` showed up at 13.7% Rust self-time
vs `ZSTD_encodeSequences` 6.4% on the C donor.

Root cause: the donor-parity precomputed tables (`deltaNbBits`,
`deltaFindState`, flat `nextStateTable`) were already built in
`build_table_from_probabilities` for the donor `FSE_encodeSymbol`
arithmetic — used to populate `Vec<State>` and then discarded.

Change:

- Add `state_table_flat: Box<[u16]>` and `symbol_tt: [SymbolTT; 256]`
  to `FSETable`. Populated in `build_table_from_probabilities` from
  the same intermediates that fed the legacy `Vec<State>` push loop.
  Donor parity: `state_table_flat` mirrors `nextStateTable` byte for
  byte (u16, table_size entries); `SymbolTT` mirrors
  `FSE_symbolCompressionTransform`.

- `FSETable::next_state` now returns `State` by value, computed via
  donor arithmetic:
  ```text
  value      = (1 << acc_log) + idx
  nb_bits    = (value + delta_nb_bits) >> 16
  baseline   = idx & !((1 << nb_bits) - 1)
  next_index = state_table_flat[(value >> nb_bits) + delta_find_state]
  ```
  O(1) lookup, no Vec deref, no Option wrap, no linear scan.

- `FSETable::start_state` returns `State` by value (was `&State`) to
  match the new shape so callers don't juggle lifetimes; still backed
  by the existing `Vec<State>` storage (called once per stream, not
  hot).

- `State` gains `Copy` (4 fields, all Copy).

- Retired methods: `SymbolStates::get` and `State::contains` (callers
  removed). `State.last_index` kept (used by the BTreeSet dedup in
  the builder) with `#[allow(dead_code)]` since it is no longer read
  on the encode hot path.

- Caller-side: `encode_sequences` (`blocks/compressed.rs`) and the
  internal `FSEEncoder` glue (`fse/fse_encoder.rs`) now store
  `Option<State>` instead of `Option<&State>` — natural fit for the
  new by-value return shape.

Measurement (standalone, 200 iters, level_3_dfast / z000033, same
session A/B):

| fixture | baseline | after | delta |
|---------|---------:|------:|------:|
| z000033 (target, compressible) | 20543 µs | 18286 µs | **−11.0%** |
| 1 MiB pseudo-random | 699 µs | 712 µs | +2% noise |
| 1 MiB repeating pattern | 1179 µs | 1183 µs | neutral |

z000033 is the canonical Phase-7 ratio-gap fixture — the encode_sequences
path is hot exactly there. Random / RLE-shape inputs barely touch
encode_sequences (raw fast-path / single short block) so the change is
correctly neutral.

Gates green:
- `cargo nextest run -p structured-zstd --features dict_builder bench_internals` 526/526
- `cargo test --doc --features dict_builder bench_internals` 12/12
- `cargo clippy --features dict_builder bench_internals --all-targets -- -D warnings` clean
- `level22_sequences_match_donor_on_corpus_proxy` ratio gate PASS
criterion 0.8 hard-asserts `sample_size >= 10` (`benchmark_group.rs:97`,
`lib.rs:519`) so cutting the sample count is not an option without
forking criterion. Two complementary changes here drop the worst-case
shard wall under the 120-min CI cap while preserving (or improving)
measurement quality.

## 1. `configure_group` budget tuning (`zstd/benches/compare_ffi.rs`)

| Class | Old | New | Δ per side (×2) |
|-------|-----|-----|----------------|
| Small (1–10 KiB) | 3 s + ~3 s default warmup | 1 s + 0.2 s warmup | -8 s |
| Corpus / Entropy (1 MiB) | 8 s + ~3 s default warmup | 3 s + 0.5 s warmup | -15 s |
| Large / Silesia (16–100 MiB) | 10 s + 0.5 s | **20 s** + 0.5 s | +20 s |

Small / Corpus / Entropy: the old budgets were wall-bound by the
measurement window (criterion fit samples in less time than allotted).
Shrinking the budget reclaims that headroom; sample count stays at
30 / 10 respectively so measurement quality is unchanged on
fast-per-iter benches.

Large / Silesia: the old 10 s was too tight. i686 / level_22_btultra2 /
100 MiB takes ~2 s per iter × 10 samples ≈ 20 s wall, so the budget
was producing "increase target time" warnings + occasional flaky
measurements on the slowest combos. Widening to 20 s removes the
warning envelope without affecting wall on faster combos (criterion
exits the budget early once samples complete).

## 2. Split `fast` and `lazy` shards (`.github/workflows/ci.yml`)

`lazy` carried 11 levels (5–15) and `fast` 8 levels (-7..=-1, 1) —
together ~50% of the main-push bench surface. The `lazy` shard at
120 min remained the consistent CI bottleneck.

New split:
- `fast-neg` (-7..=-3), `fast-pos` (-2..=-1, 1)
- `lazy-lower` (5..=9), `lazy-upper` (10..=15)

Other shards unchanged: dfast (2,3), greedy (4), btopt (16,17),
btultra (18,19), btultra2 (20..=22). Total strategy groups now 9
(was 7) × 3 targets = **27 main-push shards (was 21)**.

## 3. .gitignore: add `CLAUDE.md`

Project-local Claude rules file (created in an earlier session) is
private to the maintainer's setup, mirrors `.claude/` and `AGENTS.md`
which are already ignored. Stops `CLAUDE.md` from showing up in
`git status` after every session.

## Expected impact

- Worst-case shard wall: 120-min cap → ~30–40 min headroom (lazy now
  split in half + ~60% measurement savings on Small/Corpus/Entropy).
- Large/Silesia measurement-quality regression: fixed.
- Total CI bench parallel jobs go from 21 to 27 — same `runs-on:
  ubuntu-latest` matrix expands, GitHub-hosted runner quota
  accommodates fine.
Copilot AI review requested due to automatic review settings May 18, 2026 00:57
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 711d762f-16db-4bfe-be5b-0fb02f159dd5

📥 Commits

Reviewing files that changed from the base of the PR and between b0ddfce and ba25219.

📒 Files selected for processing (2)
  • .github/workflows/ci.yml
  • zstd/src/fse/fse_encoder.rs

📝 Walkthrough

Walkthrough

This PR replaces linear-scan FSE next-state lookups with donor-parity O(1) arithmetic and flat tables, updates FSE table construction to emit per-symbol deltas and a flat state table, tunes benchmark timing, and expands CI shard splitting for main; also adds CLAUDE.md to .gitignore.

Changes

FSE O(1) state transition optimization

Layer / File(s) Summary
FSETable redesign with O(1) next_state arithmetic
zstd/src/fse/fse_encoder.rs
Introduces SymbolTT struct; expands FSETable to store state_table_flat and symbol_tt; rewrites FSETable::next_state/start_state to return State by value and compute next state via per-symbol deltas + flat table reads; marks State as Copy and annotates last_index as unused.
FSETable construction with donor-parity tables
zstd/src/fse/fse_encoder.rs
Builds flat donor state_table_flat from intermediate state structures; computes and stores per-symbol (delta_nb_bits, delta_find_state) using 16.16 fixed-point arithmetic; replaces O(N²) deduplication with a BTreeSet; assembles final FSETable with the new fields.
Benchmark timing configuration by scenario class
zstd/benches/compare_ffi.rs
Tunes Criterion sample_size, measurement_time, and warm_up_time by ScenarioClass: reduces measurement budget and adds warm-up for Small, Corpus, Entropy; increases measurement_time for Large/Silesia; expands comments linking budgets to CI wall-time.
CI benchmark sharding expansion for main-branch coverage
.github/workflows/ci.yml
Adds if guard to skip bench pipeline for release-plz[bot] version-bump PRs; updates comments and non-PR shards.json to split fastfast-neg/fast-pos and lazylazy-lower/lazy-upper, yielding nine strategy groups per target on push: main.
Configuration: ignore file
.gitignore
Adds CLAUDE.md to ignored paths.

Sequence Diagram(s)

N/A

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Poem

🐰 Flat tables hum, no longer slow,
arithmetic hops where scans would go,
benchmarks tuned, CI splits the load,
a rabbit cheers the faster road—
tiny hops, big wins in tow.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary changes: FSE encoder optimization (donor-parity flat tables) and CI benchmark tuning.
Linked Issues check ✅ Passed All linked issue #164 requirements are met: FSE next_state linear scan replaced with O(1) donor-parity arithmetic, flat tables persisted, roundtrip verified, and benchmark improvements measured.
Out of Scope Changes check ✅ Passed Changes are within scope: FSE encoder optimization, CI benchmark tuning, and .gitignore update are all justified by the linked issue or ancillary improvements.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#164-fse-flat-tables

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Replaces FSE encoder's per-sequence linear next_state scan with O(1) donor-parity arithmetic by retaining the flat next-state table and per-symbol coding transform already computed at table build time, and tunes CI bench wall-clock budgets / shard sharding to fit the 120-min cap.

Changes:

  • Adds state_table_flat: Box<[u16]> + symbol_tt: [SymbolTT; 256] to FSETable; rewrites next_state (and adjusts start_state) to return State by value via donor arithmetic; removes SymbolStates::get and State::contains; makes State: Copy.
  • Re-tunes configure_group budgets (Small 3→1s, Corpus/Entropy 8→3s, Large/Silesia 10→20s + warmups), and splits CI bench shards fast and lazy into smaller sub-shards.
  • Adds CLAUDE.md to .gitignore.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.

File Description
zstd/src/fse/fse_encoder.rs Stores donor flat next-state table and per-symbol TT on FSETable; converts hot-path next_state to O(1) arithmetic; retires linear scan.
zstd/benches/compare_ffi.rs Adjusts criterion measurement/warmup budgets per scenario class to fit CI wall-clock cap.
.github/workflows/ci.yml Splits fast and lazy bench shards; updates accompanying comments to reflect 9 strategy groups / 27 shards.
.gitignore Ignores project-local CLAUDE.md.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/fse/fse_encoder.rs`:
- Around line 147-150: The SymbolTT field delta_nb_bits is used as a 16.16
fixed-point value but currently stored as usize causing 16-bit-target shift
overflow; change delta_nb_bits to a u32 (keep delta_find_state as-is) and
perform all fixed-point arithmetic and <<16 / >>16 shifts in u32, only
converting to usize at the final index or table lookup sites (where
delta_nb_bits is used to compute an index/offset). Update all usages in
fse_encoder.rs that do the 16.16 shifts (the same spots mentioned in the review)
to cast operands to u32 before shifting and cast results to usize only for array
indexing or API calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 10256d8d-7b87-4c80-84a3-8bd0c2338cc4

📥 Commits

Reviewing files that changed from the base of the PR and between f5a56dc and b0ddfce.

📒 Files selected for processing (4)
  • .github/workflows/ci.yml
  • .gitignore
  • zstd/benches/compare_ffi.rs
  • zstd/src/fse/fse_encoder.rs

Comment thread zstd/src/fse/fse_encoder.rs
polaz added 2 commits May 18, 2026 04:14
Release-plz PRs are version-bump only (no source-code changes), so
running the full bench matrix (build × 3 targets + 27 strategy
shards on main / 3 on PR + aggregate + pages + regression) is pure
CI waste — ~30 min wall + GitHub-hosted runner minutes consumed
for zero usable output.

Add an `if:` filter on `bench-matrix` that excludes:
- `release-plz[bot]` as PR author
- branch names starting with `release-plz-`

The skip cascades through `needs:` so every downstream bench job
(`bench-build`, `benchmark`, `benchmark-aggregate`,
`benchmark-pages`, `benchmark-regression-check`) is automatically
gated out by GH Actions' default skip-when-needs-skipped semantics.

`benchmark-regression-check` already had its own equivalent filter
from PR #159; this commit moves the gate one job upstream so the
whole pipeline noops cleanly instead of running shards and then
discarding the merged report at the gh-pages step (which is gated
on `push && main` only).
CodeRabbit caught a 16-bit-target overflow in the new
`FSETable::next_state` flat-table arithmetic. The donor 16.16
fixed-point value `delta_nb_bits` was stored as `usize`, but the
`<<16` / `>>16` shifts assume at least 32-bit width. On 16-bit
targets (AVR, MSP430, no-atomic Cortex-M0 — explicitly supported
profiles for this crate per the `critical-section` feature docs)
`usize` is 16 bits and those shifts silently overflow to zero,
breaking the encode arithmetic.

Donor `fse_compress.c` uses `U32` throughout the same fixed-point
math for the same reason — this aligns us with that invariant.

Change:
- `SymbolTT.delta_nb_bits: usize` → `u32`
- Build-time arithmetic in `build_table_from_probabilities` now
  computes `delta_nb_bits` in `u32` (`-1 | 1` and `probability > 1`
  arms both updated to `<< 16` on u32 operands).
- The dedup-loop fixed-point math (`current_value + delta_nb_bits`,
  `current_value >> num_bits`) switched to u32 to keep the same
  invariant.
- `FSETable::next_state` casts `value` to u32 at the start of the
  arithmetic, then back to `usize` only at the final indexing sites
  (`1usize << nb_bits`, `value >> nb_bits` slot lookup).

Behavior unchanged on 32-bit / 64-bit targets; only the silent-wrap
bug on 16-bit `usize` targets is fixed.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated no new comments.

@polaz polaz merged commit 70c0830 into main May 18, 2026
25 checks passed
@polaz polaz deleted the perf/#164-fse-flat-tables branch May 18, 2026 01:20
polaz added a commit that referenced this pull request May 18, 2026
…E-encode per candidate (#168)

* perf(huff0): replace FSE-encode-per-candidate with cheap entropy proxy

#167. `HuffmanTable::build_from_counts` previously called
`try_table_description_size` per `min_table_log..=11` candidate, which
ran a full FSE-encode of the weight stream against a freshly-built FSE
table just to count bytes (~31 % inclusive on the 4 KiB compress
profile after #166).

Replace with `cheap_desc_size_proxy(weights)`: an integer entropy
estimate that reproduces the donor `HUF_writeCTable_wksp` decision
(FSE vs raw nibble) without touching the FSE encoder.

- FSE estimate: `sum(c_i * ceil_log2(total / c_i))` over the 13-bin
  weight histogram + 8 B header overhead (empirical upper bound for
  the `acc_log = 6` weight FSE table seen in
  `encode_weight_description`).
- Raw nibble: exact `weights.len().div_ceil(2) + 1`, representable
  when `weights.len() <= 128`.
- Return min of the two when both are representable.

Validated:
- 502/502 lib tests (incl. new `cheap_desc_size_proxy_is_conservative_vs_exact`)
- `compare_ffi` ratio REPORT sweep across all scenarios × all levels:
  no new `rust_bytes > ffi_bytes` cells; `decodecorpus-z000033` L18/L19
  *improved* by 442 B each (R=443 434 → 442 992) — the proxy steers
  selection to a marginally tighter `table_log` on those cells. Every
  small-4k-log-lines cell preserved (L1_fast R=154 ≤ C=157, L2_dfast
  R=150 ≤ C=157, …).

Speed (`compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust`):
47.4 µs → 34.2 µs (−28 % on top of #166; cumulative −59 % vs the
pre-#165 baseline). Gap vs donor 6.7× → 4.8×.

* fix(huff0): use ceiling division in cheap_desc_size_proxy entropy bound

`cheap_desc_size_proxy` claims to be a conservative entropy upper
bound, but used truncating integer division for `total / c` before
`ceil_log2`. For non-integer ratios (e.g. `total=10, c=4 → 2.5`) this
truncated to `2`, the subsequent `ceil_log2` emitted `1` bit, and the
proxy under-shot the real entropy ≥ 2 bits per symbol.

Switch to `total.div_ceil(c)` so the ceiling is taken BEFORE the
`ceil_log2` step. Ratio sweep across `compare_ffi` corpus × every
supported level: no new `rust_bytes > ffi_bytes` cells; small-4k-log
unchanged; z000033 L18/L19 *improved* (R=442 992 → 442 863).
small-4k-log L2_dfast pure_rust 34.2 µs → 32.8 µs (slightly faster:
the tighter estimate cuts off more candidates earlier in the
`min_table_log..=11` loop's monotone-increase break).

Added `cheap_desc_size_proxy_edge_cases` covering every `(fse_ok,
raw_ok)` arm plus the `n == 0` early-out and the `ratio <= 1` clamp
branch — the prior test only hit a handful of input shapes, leaving
the `(true, false)` / `(false, true)` / `(false, false)` arms and the
early-out without coverage.

* fix(huff0): cheap_desc_size_proxy off-by-one + drop per-candidate Vec alloc

Three threads from PR #168 review.

- **fse_ok off-by-one.** `encode_weight_description` rejects only
  `encoded.len() >= 128`, so `encoded.len() == 127` is the largest
  accepted FSE payload and the total serialized description
  (`encoded.len() + 1` length-byte prefix) is exactly 128 B at the
  boundary. The proxy's `fse_size` includes the length byte — accept
  `<= 128`, not `<= 127`. As written the proxy would skip a valid
  candidate at the exact boundary and force a worse fallback.

- **Per-candidate `Vec<u8>` alloc.** `build_from_counts` called
  `table.weights()` per `table_log` candidate just to score
  `desc_size` — fresh `Vec<u8>` allocation each iteration. Replace
  with a stack-allocated `[u8; 256]` buffer reused across iterations
  (counts.len() max is 256). The Vec from
  `build_donor_limited_weights` already carries the same weight
  values; just copy them into the buffer.

- **Test slice-length mismatch.** `cheap_desc_size_proxy_is_conservative_vs_exact`
  compared `proxy(weights)` (N items) against
  `table.try_table_description_size()` (which trims `[..N-1]`
  internally). Fix: trim `weights` before calling the proxy so both
  paths score the same serialized slice.

All 503/503 lib tests pass, clippy / fmt clean. Targeted ratio sweep
(small-4k-log L1-L3, z000033 L1-L3, large-log-stream L1-L3):
unchanged — no new R>C cells, no regression.

* test(huff0): rebuild conservative-vs-exact fixtures from build_from_counts; align proxy docs

`cheap_desc_size_proxy_is_conservative_vs_exact` silently skipped
every hand-curated fixture because their weight vectors
([1,2,3,4,5], [1;13], [6;50], (0..13).cycle().take(120), ...) all
failed `huffman_weight_sum_is_power_of_two` — Kraft equality must
hold for a valid Huffman weight set, and the synthetic arrays never
did. The loop body was unreachable; the test passed trivially.

- Rebuild fixtures via `HuffmanTable::build_from_counts(counts)` →
  `table.weights()`. The encoder's own output is Kraft-valid by
  construction. Counts inputs cover skewed, uniform, geometric, wide
  alphabets, and near-raw-limit cases.
- Add an `exercised > 0` assertion to fail loud if a future refactor
  silently skips all fixtures again.
- Fix `raw_floor`: the writer's raw representation for the trimmed
  slice is `trimmed.len().div_ceil(2) + 1` (nibbles + length byte),
  not `weights.len().div_ceil(2)`.

Doc fixes:
- `cheap_desc_size_proxy` docstring: "representable when result is
  < 128 bytes" → "<= 128 bytes" — `encode_weight_description` rejects
  only `encoded.len() >= 128`, so total `encoded.len() + 1` length-byte
  prefix tops out at exactly 128. The code at the `fse_ok` site
  already uses `<= 128`; the doc was out of sync.
- `cheap_desc_size_proxy` docstring: explicitly document that
  `n == 0` returns `None` (raw could in principle encode an empty
  slice as just the length byte, but production callers never hand
  `n == 0` here — `build_from_counts` short-circuits on
  `symbol_cardinality <= 1`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(encoding): #111 Phase 7 — replace FSE encode linear search with donor-parity flat tables

2 participants