Skip to content

perf(decode): collapse FrameDecoder → block_decoder → ... 9-layer chain to donor's 3-layer ZSTD_decompressBlock_internal shape #265

@polaz

Description

@polaz

Context

Layer count from FrameDecoder::decode_to_slice_trusted to actual byte-level work:

FrameDecoder::decode_to_slice_trusted
  → decode_block_content_from_slice<DirectScratch>
    → decompress_block_inplace<DirectScratch>
      → decompress_block_inplace_with_parts<UserSliceBackend>
        → decode_and_execute_sequences<UserSliceBackend>
          → run_pipelined_sequence_loop<UserSliceBackend>
            → execute_one_sequence_pipelined<UserSliceBackend>
              → buffer.push / buffer.repeat_lookahead_prefetched
                → UserSliceBackend::extend / extend_from_within_unchecked
                  → simd_copy::copy_bytes_overshooting
                    → single_op_copy_16 / chunked SIMD

That's 9+ layers of function call indirection per sequence (most #[inline(always)] but LLVM doesn't always honour deep nesting under fat LTO + codegen-units=1, and even when it does the IR bloat increases pressure on later passes). PR #263 collapsed the last 4 layers for UserSliceBackend's donor inline path; the outer 5 remain.

Donor zstd's equivalent shape (lib/decompress/zstd_decompress_block.c::ZSTD_decompressBlock_internal):

ZSTD_decompressFrame
  → ZSTD_decompressBlock_internal
    → ZSTD_decodeLiteralsBlock + ZSTD_decompressSequences_body

3 layers max, all explicitly FORCE_INLINE_TEMPLATE. The compiler emits one flat ZSTD_decompressFrame with the body of every inner function pasted in line.

Proposal

Collapse the decoder's outer block-driver chain into a single donor-shape decompress_block_donor function for the UserSliceBackend direct-decode path. Specifically:

  1. New block_decoder_donor.rs module with unsafe fn decompress_block_donor(...) — mirrors donor's ZSTD_decompressBlock_internal shape: receives header, source ptr, dest ptr, scratch refs; dispatches block type (Raw / RLE / Compressed); for Compressed, decodes literals THEN runs the sequence loop inline (no decompress_block_inplace_* wrappers).
  2. FrameDecoder::decode_to_slice_trusted calls this directly for compressed blocks, skipping the BlockDecoder layer entirely.
  3. Existing block_decoder.rs stays for FlatBuf / RingBuffer paths — preserves the legacy abstraction surface on backends that need it.

Out of scope (separate refactors): legacy decode_all chain, RingBuffer multi-segment streaming.

Acceptance criteria

  • block_decoder_donor.rs with the collapsed body.
  • FrameDecoder::decode_to_slice_trusted dispatches to it on the UserSliceBackend + non-dict path.
  • Layer count from entry to byte work drops from 9+ to ≤ 4 (verify via cargo asm / disasm).
  • All 637+ nextest pass.
  • Bench delta on i9 across decompress/level_-1_fast/{decodecorpus-z000033/c_stream, low-entropy-1m/rust_stream}/matrix/pure_rust_direct and small-10k-random/level_-6_fast/c_stream: ≥ 5% time reduction averaged, no regression on any single fixture.

Part of #247.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1-highHigh priority — core functionalityenhancementNew feature or requestperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions