Skip to content

perf(huff0): investigate shared Huffman weight-description cache between planner and emitter [SPIKE] #169

@polaz

Description

@polaz

Status

Investigation / spike. Outcome may be "doesn't pay off, close." That's an acceptable result — the point is to measure, not to commit to landing.

Context

After #168 + #167, the FSE-encode of Huffman weights still runs twice per emitted compressed-literals block:

  1. Plannercompute_block_size_to_compressed (compressed.rs:458): builds a fresh HuffmanTable, calls writeable_table_description_size (= try_table_description_size) which FSE-encodes the weight stream to count bytes. Used in payload + desc to decide raw-vs-FSE literals.
  2. Emittercompress_literals_or_reuse (compressed.rs:1546): independently builds another fresh HuffmanTable (same literals → same counts → same weights), calls writeable_table_description_size again (same FSE-encode), feeds it into decide_huff_reuse_like_encoder. Then HuffmanEncoder::write_table FSE-encodes the weights a THIRD time (the actual emit).

So we currently pay 3 FSE-encodes of weights per emitted block for what is mathematically the same computation on the same input.

What was tried and rejected

#167-followup attempt (commit drafted, then reverted on this branch in the post-#168 work session 2026-05-18): replacing writeable_table_description_size with the cheap_desc_size_proxy upper bound. Result:

  • Speed: ~1-2 µs / block — marginal.
  • Ratio: decodecorpus-z000033 L18 / L19 worsened by +13 B each (already R > C donor cells; the proxy's conservative upper-bound biases the planner toward raw literals on borderline blocks). Per project rule "Ratio first — if rust_bytes > ffi_bytes we lose vs donor → real bug", any worsening of an already-R-above-C cell is unacceptable.

Single-shot proxy in the planner is not the right path. The planner needs the exact size; the cost we want to eliminate is the duplication, not the exactness.

Proposed investigation

Share the FSE-encoded weight description between the three call sites. Concrete shape to try:

  1. Lift table construction out of the duplicate code path. Make compute_block_size_to_compressed (planner) and compress_literals_or_reuse (emitter) consume a single HuffmanTable instance built once per logical block, instead of building two copies from the same counts.
  2. Cache the encoded description on HuffmanTable. Add description: OnceCell<Option<Vec<u8>>> (or equivalent interior-mutability slot for &self access — core::cell::OnceCell on the assumption the table isn't shared across threads inside one frame). Populate lazily on the first writeable_table_description_size call; reuse the bytes verbatim in HuffmanEncoder::write_table (currently calls encode_weight_description again).
  3. Confirm the emitter's write_table path can ingest pre-encoded bytes — donor HUF_writeCTable_wksp produces the exact same byte stream we'd cache from the FSE encoder; emit becomes writer.append_bytes(&cached) instead of re-running FSE.

Acceptance criteria (kill-switch — must hit ALL)

  • No new R > C cells in the compare_ffi REPORT sweep across every supported (scenario, level).
  • No worsening (no positive delta) of any existing R > C cell vs the post-perf(huff0): #167 cheap entropy proxy for table_log selection — no FSE-encode per candidate #168 / pre-investigation baseline (f61dde38 on this branch, or the latest main HEAD).
  • Measurable speedup (>= 5 %) on compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust vs the same baseline. Below 5 % is not worth the refactor surface.
  • All 503/503 lib tests pass; cargo clippy -- -D warnings clean.

If any of these fail, close without landing. The investigation will have produced a measurement + rationale that future work can reference — that is the deliverable, not the code.

Surface area

  • zstd/src/huff0/huff0_encoder.rsHuffmanTable struct + lazy cache, HuffmanEncoder::write_table rework.
  • zstd/src/encoding/blocks/compressed.rscompute_block_size_to_compressed + compress_literals_or_reuse plumbing (share the table).

Estimated size: ~150-300 LoC of code + tests, single PR.

Related

Metadata

Metadata

Assignees

Labels

P2-mediumMedium priority — important improvementenhancementNew feature or requestperformancePerformance optimization

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions