Skip to content

perf(huff0): cache encoded weight-description bytes on HuffmanTable and reuse in emit path#170

Merged
polaz merged 6 commits into
mainfrom
copilot/investigate-huffman-cache
May 18, 2026
Merged

perf(huff0): cache encoded weight-description bytes on HuffmanTable and reuse in emit path#170
polaz merged 6 commits into
mainfrom
copilot/investigate-huffman-cache

Conversation

Copy link
Copy Markdown

Copilot AI commented May 18, 2026

Goal

Eliminate duplicated FSE-encoding of Huffman weights across HuffmanTable's sizing + emit paths. Same weight stream was previously encoded multiple times per emitted compressed-literals block (once via try_table_description_size, once via HuffmanEncoder::write_table). Caching the encoded bytes on the table instance removes the redundancy without changing selection or output semantics.

What changed

  • std builds: lazy cache on HuffmanTable via std::sync::OnceLock<Option<Vec<u8>>> (atomic-init, Sync, lock-free read-fast-path). Populated on first call to try_table_description_size / writeable_table_description_size or HuffmanEncoder::write_table. Subsequent calls on the same HuffmanTable instance reuse the cached FSE bytes.
  • no_std builds: cache field is #[cfg(feature = "std")]-gated and absent entirely. try_table_description_size and write_table use the original recompute-every-time path. This preserves the Sync auto-trait on pub HuffmanTable for no_std + alloc consumers that share encoder tables across threads (e.g. via Arc<HuffmanTable>) — core::cell::OnceCell would have made the type !Sync and silently broken downstream API.
  • write_table cold-path raw fallback computes weights() exactly once, sharing the slice between cache initialization and the raw-nibble writer. The pre-merge implementation recomputed weights twice on this path, which was a measurable hotspot for small / low-cardinality tables.

Behavioral guarantees

  • Selection unchanged: try_table_description_size returns exactly the byte count the writer would produce. Planner decisions (compute_block_size_to_compressed, compress_literals_or_reuse) are not affected.
  • Raw fallback unchanged: when FSE description is not representable / not beneficial, the writer still emits raw nibble-packed weights.
  • Wire-format unchanged: byte-identical output to pre-PR for every (input, level) cell.
  • C FFI surface unchanged: cache is a Rust-internal implementation detail, never crosses the cdylib boundary.

API surface

#[cfg(feature = "std")]
type CachedDescription = std::sync::OnceLock<Option<Vec<u8>>>;

pub struct HuffmanTable {
    codes: Vec<(u32, u8)>,
    #[cfg(feature = "std")]
    cached_encoded_weight_description: CachedDescription,
}

No changes to public methods or trait impls. HuffmanTable retains Sync + Send + Clone under both feature configurations.

Tests

  • cached_encoded_weight_description_is_reused_for_write_table (std-only): verifies the cache populates on first size query and write_table emits exactly the cached bytes.
  • write_table_raw_path_initializes_none_cache (std-only): verifies raw-fallback path correctly stores Some(None) sentinel in the cache so subsequent calls skip the failed FSE attempt.
  • 505 / 505 existing lib tests pass on default features.
  • cargo check --no-default-features builds clean (no_std path).

Benchmark — vs main (414355a), same-session A/B

compress/level_3_dfast/small-4k-log-lines/matrix/pure_rust (default preset, single-block small input — strongest relative gain since each encode hits the cold path):

Branch Median CI
main 28.01 µs [26.36, 30.86] µs
this PR 25.58 µs [24.62, 26.82] µs
delta −8.7% tighter CI on PR (cache removes per-emit jitter)

L2_dfast for cross-check on the same scenario:

Branch Median CI
main 32.10 µs [31.73, 32.69] µs
this PR 28.85 µs [28.72, 29.00] µs
delta −10.1% clean statistical separation

The cache eliminates one redundant FSE-encode-of-weights per emitted compressed-literals block. On single-block small inputs that's ~1-3 µs out of ~28-32 µs total = ~10% relative. On larger inputs (z000033, large-log-stream) the relative gain shrinks because the LZ pass dominates encode time, but the per-block absolute saving scales linearly with block count — same ratio, same speed across the corpus matrix.

Ratio sweep

All compare_ffi --features dict_builder REPORT cells identical to main on the test scenarios (small-4k-log L1-L22, z000033 L1-L22, large-log-stream L1-L22, low/high-entropy 1m). No new rust_bytes > ffi_bytes cells; no regression of pre-existing ones.

Related issues

Copilot AI changed the title [WIP] Investigate shared Huffman weight-description cache between planner and emitter perf(huff0): cache encoded weight-description bytes on HuffmanTable and reuse in emit path May 18, 2026
Copilot finished work on behalf of polaz May 18, 2026 11:06
Copilot AI requested a review from polaz May 18, 2026 11:06
@polaz polaz marked this pull request as ready for review May 18, 2026 11:59
Copilot AI review requested due to automatic review settings May 18, 2026 11:59
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 98.97959% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/huff0/huff0_encoder.rs 98.97% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@polaz
Copy link
Copy Markdown
Member

polaz commented May 18, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

📝 Walkthrough

Walkthrough

This PR adds caching of FSE-encoded weight-descriptions to HuffmanTable using OnceCell. The cache is populated lazily via a new helper function and reused in write_table and try_table_description_size to avoid redundant encoding computations.

Changes

Huffman weight-description caching

Layer / File(s) Summary
Cache field and lazy initialization
zstd/src/huff0/huff0_encoder.rs
Adds OnceCell import, new cached_encoded_weight_description field to HuffmanTable, cached_encoded_weight_description() lazy loader that derives raw weights and computes FSE encoding, and field initialization in build_from_weights.
Cache usage in table writing and size estimation
zstd/src/huff0/huff0_encoder.rs
write_table prefers cached FSE description; try_table_description_size returns cached size instead of recomputing.
Cache validation test
zstd/src/huff0/huff0_encoder.rs
Unit test builds a large-alphabet HuffmanTable, accesses cached description, writes table, and verifies output begins with cached bytes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

Poem

🐰 A cache born from cells so wise,
Weight descriptions need not be surprised,
Encoded once, then shared with glee,
Through write and size, reuse flows free,
One hop of speed, test passes by! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding a cache for encoded weight-description bytes on HuffmanTable and reusing it in the emit path for performance improvement.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch copilot/investigate-huffman-cache

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@polaz
Copy link
Copy Markdown
Member

polaz commented May 18, 2026

@copilot re-review - and re-chech internet / repo links access

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces duplicated Huffman weight-description work during block compression by caching the FSE-encoded weight-description bytes on HuffmanTable and reusing them for both sizing and emission.

Changes:

  • Added a lazy cache (OnceCell<Option<Vec<u8>>>) to HuffmanTable for the encoded weight description.
  • Reworked table-description sizing and HuffmanEncoder::write_table() to reuse cached encoded bytes when available.
  • Added a unit test asserting the cache is populated by sizing and reused verbatim by write_table().

Comment thread zstd/src/huff0/huff0_encoder.rs
Comment thread zstd/src/huff0/huff0_encoder.rs
Copy link
Copy Markdown
Author

Copilot AI commented May 18, 2026

@copilot re-review - and re-chech internet / repo links access

Re-review complete on commit 65a43a7: no additional code issues found in the current diff.

Re-check results:

  • Repo link access: https://github.com/structured-world/structured-zstd is reachable.
  • Internet access: https://www.rust-lang.org/ is reachable.

I also re-ran validation locally: cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings and the focused cache test both pass.

@polaz
Copy link
Copy Markdown
Member

polaz commented May 18, 2026

@copilot address your review threads, verify the actual code benchmarks after your theoretical implementation and provide achieved numbers (both pre and post) in issue body

Copilot AI requested review from Copilot and removed request for Copilot May 18, 2026 12:16
Copilot finished work on behalf of polaz May 18, 2026 12:17
Copilot AI review requested due to automatic review settings May 18, 2026 12:23
Copilot AI review requested due to automatic review settings May 18, 2026 12:23
…on field

`core::cell::OnceCell` is `!Sync`, which made `pub HuffmanTable`
silently lose its `Sync` auto-trait when the cache field was added.
Downstream consumers sharing encoder tables across threads would
see this as a breaking change.

Introduce a `CachedDescription` type alias that resolves to:

- `std::sync::OnceLock<Option<Vec<u8>>>` when `feature = "std"` is
  active — `Sync` via atomic-init, lock-free read-fast-path.
- `core::cell::OnceCell<Option<Vec<u8>>>` otherwise — keeps no_std
  builds working, narrower thread-safety expected for no_std
  embedded targets that don't share heap state across threads.

Same `get` / `get_or_init` interface in both branches, no call-site
changes needed beyond the constructor (`CachedDescription::new()`).
Benchmark on `compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust`:
36.0 µs after the switch, within criterion noise of the pre-fix
37.3 µs — the std-build atomic-init read-path is effectively free.
Copy link
Copy Markdown
Author

Copilot AI commented May 18, 2026

@copilot address your review threads, verify the actual code benchmarks after your theoretical implementation and provide achieved numbers (both pre and post) in issue body

Addressed in 5ff1e50.

Review-thread follow-up is now covered:

  • HuffmanTable cache cell uses std::sync::OnceLock on std builds (keeps thread-safe auto-traits there) and core::cell::OnceCell on no_std builds.
  • prior write_table() duplicate-weights fix remains in place (1cd4860 / 33153c3).

Bench numbers (actual runs, same target):

  • benchmark: compress/level_3_dfast/small-4k-log-lines/matrix/pure_rust
  • pre: 35.929–36.031 µs (108.41–108.72 MiB/s)
  • post: 36.383–36.586 µs (106.77–107.36 MiB/s)

I also updated the PR body checklist entry with these pre/post values.

Copilot finished work on behalf of polaz May 18, 2026 12:27
Copilot AI review requested due to automatic review settings May 18, 2026 12:33
@polaz polaz force-pushed the copilot/investigate-huffman-cache branch from 5ff1e50 to bdb20b0 Compare May 18, 2026 12:33
@polaz polaz requested review from Copilot and removed request for Copilot May 18, 2026 12:36
@polaz
Copy link
Copy Markdown
Member

polaz commented May 18, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/huff0/huff0_encoder.rs Outdated
Comment thread zstd/src/huff0/huff0_encoder.rs Outdated
…che to std builds

Cold-path raw fallback recomputed `weights()` twice — once via
`cached_encoded_weight_description_with_weights(weights)` to
initialize the cache, then again inside the prior
`write_raw_table_description()` helper that fetched its own
weights slice. For small / low-cardinality tables that's a
measurable hotspot. Inline the raw-write path in `write_table`
so it reuses the already-computed `weights` slice in the cold
branch, while keeping the cached-`None` sentinel branch using a
single fresh recompute (unavoidable — the cache stores only the
FSE encoding, not the raw nibbles). The `write_raw_table_description`
helper goes away — its one remaining caller was the cached-`None`
path, inlined there too.

Cache field `cached_encoded_weight_description` is now
`#[cfg(feature = "std")]`. `core::cell::OnceCell` is `!Sync`, so
in no_std builds the cache would have broken the `Sync` auto-trait
for `pub HuffmanTable` — potentially breaking downstream consumers
running no_std+alloc with `Arc<HuffmanTable>`. std builds keep
`OnceLock<Option<Vec<u8>>>` (Sync, atomic-init). no_std builds drop
the cache field entirely and revert to recompute-every-time —
`try_table_description_size` and `write_table` get cfg-branched
non-cached paths that match pre-cache semantics exactly.

Cache-touching tests are gated on `feature = "std"` so the test
suite still compiles in no_std-only configurations.
@polaz polaz requested review from Copilot and polaz and removed request for polaz May 18, 2026 12:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/huff0/huff0_encoder.rs
@polaz polaz merged commit c71559e into main May 18, 2026
27 of 28 checks passed
@polaz polaz deleted the copilot/investigate-huffman-cache branch May 18, 2026 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(huff0): investigate shared Huffman weight-description cache between planner and emitter [SPIKE]

3 participants