perf(huff0): cache encoded weight-description bytes on HuffmanTable and reuse in emit path#170
Conversation
HuffmanTable and reuse in emit path
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
📝 WalkthroughWalkthroughThis PR adds caching of FSE-encoded weight-descriptions to ChangesHuffman weight-description caching
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@copilot re-review - and re-chech internet / repo links access |
There was a problem hiding this comment.
Pull request overview
This PR reduces duplicated Huffman weight-description work during block compression by caching the FSE-encoded weight-description bytes on HuffmanTable and reusing them for both sizing and emission.
Changes:
- Added a lazy cache (
OnceCell<Option<Vec<u8>>>) toHuffmanTablefor the encoded weight description. - Reworked table-description sizing and
HuffmanEncoder::write_table()to reuse cached encoded bytes when available. - Added a unit test asserting the cache is populated by sizing and reused verbatim by
write_table().
Re-review complete on commit Re-check results:
I also re-ran validation locally: |
|
@copilot address your review threads, verify the actual code benchmarks after your theoretical implementation and provide achieved numbers (both pre and post) in issue body |
Agent-Logs-Url: https://github.com/structured-world/structured-zstd/sessions/6bdb0283-45b1-4909-8dde-e9b2114fbf4c Co-authored-by: polaz <4152123+polaz@users.noreply.github.com>
…back Agent-Logs-Url: https://github.com/structured-world/structured-zstd/sessions/29cef8e6-3ed8-4ed3-a365-84f0ebf18546 Co-authored-by: polaz <4152123+polaz@users.noreply.github.com>
Agent-Logs-Url: https://github.com/structured-world/structured-zstd/sessions/29cef8e6-3ed8-4ed3-a365-84f0ebf18546 Co-authored-by: polaz <4152123+polaz@users.noreply.github.com>
…on field `core::cell::OnceCell` is `!Sync`, which made `pub HuffmanTable` silently lose its `Sync` auto-trait when the cache field was added. Downstream consumers sharing encoder tables across threads would see this as a breaking change. Introduce a `CachedDescription` type alias that resolves to: - `std::sync::OnceLock<Option<Vec<u8>>>` when `feature = "std"` is active — `Sync` via atomic-init, lock-free read-fast-path. - `core::cell::OnceCell<Option<Vec<u8>>>` otherwise — keeps no_std builds working, narrower thread-safety expected for no_std embedded targets that don't share heap state across threads. Same `get` / `get_or_init` interface in both branches, no call-site changes needed beyond the constructor (`CachedDescription::new()`). Benchmark on `compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust`: 36.0 µs after the switch, within criterion noise of the pre-fix 37.3 µs — the std-build atomic-init read-path is effectively free.
Addressed in Review-thread follow-up is now covered:
Bench numbers (actual runs, same target):
I also updated the PR body checklist entry with these pre/post values. |
5ff1e50 to
bdb20b0
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
…che to std builds Cold-path raw fallback recomputed `weights()` twice — once via `cached_encoded_weight_description_with_weights(weights)` to initialize the cache, then again inside the prior `write_raw_table_description()` helper that fetched its own weights slice. For small / low-cardinality tables that's a measurable hotspot. Inline the raw-write path in `write_table` so it reuses the already-computed `weights` slice in the cold branch, while keeping the cached-`None` sentinel branch using a single fresh recompute (unavoidable — the cache stores only the FSE encoding, not the raw nibbles). The `write_raw_table_description` helper goes away — its one remaining caller was the cached-`None` path, inlined there too. Cache field `cached_encoded_weight_description` is now `#[cfg(feature = "std")]`. `core::cell::OnceCell` is `!Sync`, so in no_std builds the cache would have broken the `Sync` auto-trait for `pub HuffmanTable` — potentially breaking downstream consumers running no_std+alloc with `Arc<HuffmanTable>`. std builds keep `OnceLock<Option<Vec<u8>>>` (Sync, atomic-init). no_std builds drop the cache field entirely and revert to recompute-every-time — `try_table_description_size` and `write_table` get cfg-branched non-cached paths that match pre-cache semantics exactly. Cache-touching tests are gated on `feature = "std"` so the test suite still compiles in no_std-only configurations.
Goal
Eliminate duplicated FSE-encoding of Huffman weights across
HuffmanTable's sizing + emit paths. Same weight stream was previously encoded multiple times per emitted compressed-literals block (once viatry_table_description_size, once viaHuffmanEncoder::write_table). Caching the encoded bytes on the table instance removes the redundancy without changing selection or output semantics.What changed
stdbuilds: lazy cache onHuffmanTableviastd::sync::OnceLock<Option<Vec<u8>>>(atomic-init,Sync, lock-free read-fast-path). Populated on first call totry_table_description_size/writeable_table_description_sizeorHuffmanEncoder::write_table. Subsequent calls on the sameHuffmanTableinstance reuse the cached FSE bytes.no_stdbuilds: cache field is#[cfg(feature = "std")]-gated and absent entirely.try_table_description_sizeandwrite_tableuse the original recompute-every-time path. This preserves theSyncauto-trait onpub HuffmanTableforno_std + allocconsumers that share encoder tables across threads (e.g. viaArc<HuffmanTable>) —core::cell::OnceCellwould have made the type!Syncand silently broken downstream API.write_tablecold-path raw fallback computesweights()exactly once, sharing the slice between cache initialization and the raw-nibble writer. The pre-merge implementation recomputed weights twice on this path, which was a measurable hotspot for small / low-cardinality tables.Behavioral guarantees
try_table_description_sizereturns exactly the byte count the writer would produce. Planner decisions (compute_block_size_to_compressed,compress_literals_or_reuse) are not affected.cdylibboundary.API surface
No changes to public methods or trait impls.
HuffmanTableretainsSync + Send + Cloneunder both feature configurations.Tests
cached_encoded_weight_description_is_reused_for_write_table(std-only): verifies the cache populates on first size query andwrite_tableemits exactly the cached bytes.write_table_raw_path_initializes_none_cache(std-only): verifies raw-fallback path correctly storesSome(None)sentinel in the cache so subsequent calls skip the failed FSE attempt.cargo check --no-default-featuresbuilds clean (no_std path).Benchmark — vs
main(414355a), same-session A/Bcompress/level_3_dfast/small-4k-log-lines/matrix/pure_rust(default preset, single-block small input — strongest relative gain since each encode hits the cold path):mainL2_dfast for cross-check on the same scenario:
mainThe cache eliminates one redundant FSE-encode-of-weights per emitted compressed-literals block. On single-block small inputs that's ~1-3 µs out of ~28-32 µs total = ~10% relative. On larger inputs (z000033, large-log-stream) the relative gain shrinks because the LZ pass dominates encode time, but the per-block absolute saving scales linearly with block count — same ratio, same speed across the corpus matrix.
Ratio sweep
All
compare_ffi --features dict_builderREPORT cells identical tomainon the test scenarios (small-4k-log L1-L22, z000033 L1-L22, large-log-stream L1-L22, low/high-entropy 1m). No newrust_bytes > ffi_bytescells; no regression of pre-existing ones.Related issues