You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Investigation / spike. Outcome may be "doesn't pay off, close." That's an acceptable result — the point is to measure, not to commit to landing.
Context
After #168 + #167, the FSE-encode of Huffman weights still runs twice per emitted compressed-literals block:
Planner — compute_block_size_to_compressed (compressed.rs:458): builds a fresh HuffmanTable, calls writeable_table_description_size (= try_table_description_size) which FSE-encodes the weight stream to count bytes. Used in payload + desc to decide raw-vs-FSE literals.
Emitter — compress_literals_or_reuse (compressed.rs:1546): independently builds another fresh HuffmanTable (same literals → same counts → same weights), calls writeable_table_description_size again (same FSE-encode), feeds it into decide_huff_reuse_like_encoder. Then HuffmanEncoder::write_table FSE-encodes the weights a THIRD time (the actual emit).
So we currently pay 3 FSE-encodes of weights per emitted block for what is mathematically the same computation on the same input.
What was tried and rejected
#167-followup attempt (commit drafted, then reverted on this branch in the post-#168 work session 2026-05-18): replacing writeable_table_description_size with the cheap_desc_size_proxy upper bound. Result:
Speed: ~1-2 µs / block — marginal.
Ratio: decodecorpus-z000033 L18 / L19 worsened by +13 B each (already R > C donor cells; the proxy's conservative upper-bound biases the planner toward raw literals on borderline blocks). Per project rule "Ratio first — if rust_bytes > ffi_bytes we lose vs donor → real bug", any worsening of an already-R-above-C cell is unacceptable.
Single-shot proxy in the planner is not the right path. The planner needs the exact size; the cost we want to eliminate is the duplication, not the exactness.
Proposed investigation
Share the FSE-encoded weight description between the three call sites. Concrete shape to try:
Lift table construction out of the duplicate code path. Make compute_block_size_to_compressed (planner) and compress_literals_or_reuse (emitter) consume a singleHuffmanTable instance built once per logical block, instead of building two copies from the same counts.
Cache the encoded description on HuffmanTable. Add description: OnceCell<Option<Vec<u8>>> (or equivalent interior-mutability slot for &self access — core::cell::OnceCell on the assumption the table isn't shared across threads inside one frame). Populate lazily on the first writeable_table_description_size call; reuse the bytes verbatim in HuffmanEncoder::write_table (currently calls encode_weight_description again).
Confirm the emitter's write_table path can ingest pre-encoded bytes — donor HUF_writeCTable_wksp produces the exact same byte stream we'd cache from the FSE encoder; emit becomes writer.append_bytes(&cached) instead of re-running FSE.
Acceptance criteria (kill-switch — must hit ALL)
No new R > C cells in the compare_ffi REPORT sweep across every supported (scenario, level).
Measurable speedup (>= 5 %) on compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust vs the same baseline. Below 5 % is not worth the refactor surface.
If any of these fail, close without landing. The investigation will have produced a measurement + rationale that future work can reference — that is the deliverable, not the code.
Status
Investigation / spike. Outcome may be "doesn't pay off, close." That's an acceptable result — the point is to measure, not to commit to landing.
Context
After #168 + #167, the FSE-encode of Huffman weights still runs twice per emitted compressed-literals block:
compute_block_size_to_compressed(compressed.rs:458): builds a freshHuffmanTable, callswriteable_table_description_size(=try_table_description_size) which FSE-encodes the weight stream to count bytes. Used inpayload + descto decide raw-vs-FSE literals.compress_literals_or_reuse(compressed.rs:1546): independently builds another freshHuffmanTable(sameliterals→ same counts → same weights), callswriteable_table_description_sizeagain (same FSE-encode), feeds it intodecide_huff_reuse_like_encoder. ThenHuffmanEncoder::write_tableFSE-encodes the weights a THIRD time (the actual emit).So we currently pay 3 FSE-encodes of weights per emitted block for what is mathematically the same computation on the same input.
What was tried and rejected
#167-followup attempt (commit drafted, then reverted on this branch in the post-#168 work session 2026-05-18): replacing
writeable_table_description_sizewith thecheap_desc_size_proxyupper bound. Result:decodecorpus-z000033L18 / L19 worsened by +13 B each (already R > C donor cells; the proxy's conservative upper-bound biases the planner toward raw literals on borderline blocks). Per project rule "Ratio first — ifrust_bytes > ffi_byteswe lose vs donor → real bug", any worsening of an already-R-above-C cell is unacceptable.Single-shot proxy in the planner is not the right path. The planner needs the exact size; the cost we want to eliminate is the duplication, not the exactness.
Proposed investigation
Share the FSE-encoded weight description between the three call sites. Concrete shape to try:
compute_block_size_to_compressed(planner) andcompress_literals_or_reuse(emitter) consume a singleHuffmanTableinstance built once per logical block, instead of building two copies from the samecounts.HuffmanTable. Adddescription: OnceCell<Option<Vec<u8>>>(or equivalent interior-mutability slot for&selfaccess —core::cell::OnceCellon the assumption the table isn't shared across threads inside one frame). Populate lazily on the firstwriteable_table_description_sizecall; reuse the bytes verbatim inHuffmanEncoder::write_table(currently callsencode_weight_descriptionagain).write_tablepath can ingest pre-encoded bytes — donorHUF_writeCTable_wkspproduces the exact same byte stream we'd cache from the FSE encoder; emit becomeswriter.append_bytes(&cached)instead of re-running FSE.Acceptance criteria (kill-switch — must hit ALL)
(scenario, level).f61dde38on this branch, or the latestmainHEAD).compress/level_2_dfast/small-4k-log-lines/matrix/pure_rustvs the same baseline. Below 5 % is not worth the refactor surface.cargo clippy -- -D warningsclean.If any of these fail, close without landing. The investigation will have produced a measurement + rationale that future work can reference — that is the deliverable, not the code.
Surface area
zstd/src/huff0/huff0_encoder.rs—HuffmanTablestruct + lazy cache,HuffmanEncoder::write_tablerework.zstd/src/encoding/blocks/compressed.rs—compute_block_size_to_compressed+compress_literals_or_reuseplumbing (share the table).Estimated size: ~150-300 LoC of code + tests, single PR.
Related
table_logselection).