perf(huff0): investigate shared Huffman weight-description cache between planner and emitter [SPIKE]

## Status

**Investigation / spike.** Outcome may be "doesn't pay off, close." That's an acceptable result — the point is to measure, not to commit to landing.

## Context

After #168 + #167, the FSE-encode of Huffman weights still runs **twice per emitted compressed-literals block**:

1. **Planner** — `compute_block_size_to_compressed` (`compressed.rs:458`): builds a fresh `HuffmanTable`, calls `writeable_table_description_size` (= `try_table_description_size`) which FSE-encodes the weight stream to count bytes. Used in `payload + desc` to decide raw-vs-FSE literals.
2. **Emitter** — `compress_literals_or_reuse` (`compressed.rs:1546`): independently builds another fresh `HuffmanTable` (same `literals` → same counts → same weights), calls `writeable_table_description_size` again (same FSE-encode), feeds it into `decide_huff_reuse_like_encoder`. Then `HuffmanEncoder::write_table` FSE-encodes the weights a THIRD time (the actual emit).

So we currently pay **3 FSE-encodes of weights per emitted block** for what is mathematically the same computation on the same input.

## What was tried and rejected

#167-followup attempt (commit drafted, then reverted on this branch in the post-#168 work session 2026-05-18): replacing `writeable_table_description_size` with the `cheap_desc_size_proxy` upper bound. Result:

- Speed: ~1-2 µs / block — marginal.
- Ratio: `decodecorpus-z000033` L18 / L19 worsened by **+13 B each** (already R > C donor cells; the proxy's conservative upper-bound biases the planner toward raw literals on borderline blocks). Per project rule "Ratio first — if `rust_bytes > ffi_bytes` we lose vs donor → real bug", any worsening of an already-R-above-C cell is unacceptable.

Single-shot proxy in the planner is **not** the right path. The planner needs the exact size; the cost we want to eliminate is the duplication, not the exactness.

## Proposed investigation

Share the FSE-encoded weight description between the three call sites. Concrete shape to try:

1. **Lift table construction out of the duplicate code path.** Make `compute_block_size_to_compressed` (planner) and `compress_literals_or_reuse` (emitter) consume a **single** `HuffmanTable` instance built once per logical block, instead of building two copies from the same `counts`.
2. **Cache the encoded description on `HuffmanTable`.** Add `description: OnceCell<Option<Vec<u8>>>` (or equivalent interior-mutability slot for `&self` access — `core::cell::OnceCell` on the assumption the table isn't shared across threads inside one frame). Populate lazily on the first `writeable_table_description_size` call; reuse the bytes verbatim in `HuffmanEncoder::write_table` (currently calls `encode_weight_description` again).
3. **Confirm the emitter's `write_table` path can ingest pre-encoded bytes** — donor `HUF_writeCTable_wksp` produces the exact same byte stream we'd cache from the FSE encoder; emit becomes `writer.append_bytes(&cached)` instead of re-running FSE.

## Acceptance criteria (kill-switch — must hit ALL)

- **No new R > C cells** in the compare_ffi REPORT sweep across every supported `(scenario, level)`.
- **No worsening** (no positive delta) of any existing R > C cell vs the post-#168 / pre-investigation baseline (`f61dde38` on this branch, or the latest `main` HEAD).
- **Measurable speedup** (>= 5 %) on `compress/level_2_dfast/small-4k-log-lines/matrix/pure_rust` vs the same baseline. Below 5 % is not worth the refactor surface.
- All 503/503 lib tests pass; `cargo clippy -- -D warnings` clean.

If any of these fail, **close without landing**. The investigation will have produced a measurement + rationale that future work can reference — that is the deliverable, not the code.

## Surface area

- `zstd/src/huff0/huff0_encoder.rs` — `HuffmanTable` struct + lazy cache, `HuffmanEncoder::write_table` rework.
- `zstd/src/encoding/blocks/compressed.rs` — `compute_block_size_to_compressed` + `compress_literals_or_reuse` plumbing (share the table).

Estimated size: ~150-300 LoC of code + tests, single PR.

## Related

- Part of #111 Phase 7 Lane A (block compressor entropy coding speed).
- Follow-up from #167 / PR #168 (cheap entropy proxy for `table_log` selection).
- Builds on #166 (FSE_buildCTable_wksp parity, drop per-symbol Vec<State>).
- Surfaced 2026-05-18 during #168 review cycle.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(huff0): investigate shared Huffman weight-description cache between planner and emitter [SPIKE] #169

Status

Context

What was tried and rejected

Proposed investigation

Acceptance criteria (kill-switch — must hit ALL)

Surface area

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf(huff0): investigate shared Huffman weight-description cache between planner and emitter [SPIKE] #169

Description

Status

Context

What was tried and rejected

Proposed investigation

Acceptance criteria (kill-switch — must hit ALL)

Surface area

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions