Context
Layer count from FrameDecoder::decode_to_slice_trusted to actual byte-level work:
FrameDecoder::decode_to_slice_trusted
→ decode_block_content_from_slice<DirectScratch>
→ decompress_block_inplace<DirectScratch>
→ decompress_block_inplace_with_parts<UserSliceBackend>
→ decode_and_execute_sequences<UserSliceBackend>
→ run_pipelined_sequence_loop<UserSliceBackend>
→ execute_one_sequence_pipelined<UserSliceBackend>
→ buffer.push / buffer.repeat_lookahead_prefetched
→ UserSliceBackend::extend / extend_from_within_unchecked
→ simd_copy::copy_bytes_overshooting
→ single_op_copy_16 / chunked SIMD
That's 9+ layers of function call indirection per sequence (most #[inline(always)] but LLVM doesn't always honour deep nesting under fat LTO + codegen-units=1, and even when it does the IR bloat increases pressure on later passes). PR #263 collapsed the last 4 layers for UserSliceBackend's donor inline path; the outer 5 remain.
Donor zstd's equivalent shape (lib/decompress/zstd_decompress_block.c::ZSTD_decompressBlock_internal):
ZSTD_decompressFrame
→ ZSTD_decompressBlock_internal
→ ZSTD_decodeLiteralsBlock + ZSTD_decompressSequences_body
3 layers max, all explicitly FORCE_INLINE_TEMPLATE. The compiler emits one flat ZSTD_decompressFrame with the body of every inner function pasted in line.
Proposal
Collapse the decoder's outer block-driver chain into a single donor-shape decompress_block_donor function for the UserSliceBackend direct-decode path. Specifically:
- New
block_decoder_donor.rs module with unsafe fn decompress_block_donor(...) — mirrors donor's ZSTD_decompressBlock_internal shape: receives header, source ptr, dest ptr, scratch refs; dispatches block type (Raw / RLE / Compressed); for Compressed, decodes literals THEN runs the sequence loop inline (no decompress_block_inplace_* wrappers).
FrameDecoder::decode_to_slice_trusted calls this directly for compressed blocks, skipping the BlockDecoder layer entirely.
- Existing
block_decoder.rs stays for FlatBuf / RingBuffer paths — preserves the legacy abstraction surface on backends that need it.
Out of scope (separate refactors): legacy decode_all chain, RingBuffer multi-segment streaming.
Acceptance criteria
Part of #247.
Context
Layer count from
FrameDecoder::decode_to_slice_trustedto actual byte-level work:That's 9+ layers of function call indirection per sequence (most
#[inline(always)]but LLVM doesn't always honour deep nesting under fat LTO + codegen-units=1, and even when it does the IR bloat increases pressure on later passes). PR #263 collapsed the last 4 layers for UserSliceBackend's donor inline path; the outer 5 remain.Donor zstd's equivalent shape (
lib/decompress/zstd_decompress_block.c::ZSTD_decompressBlock_internal):3 layers max, all explicitly
FORCE_INLINE_TEMPLATE. The compiler emits one flatZSTD_decompressFramewith the body of every inner function pasted in line.Proposal
Collapse the decoder's outer block-driver chain into a single donor-shape
decompress_block_donorfunction for theUserSliceBackenddirect-decode path. Specifically:block_decoder_donor.rsmodule withunsafe fn decompress_block_donor(...)— mirrors donor'sZSTD_decompressBlock_internalshape: receives header, source ptr, dest ptr, scratch refs; dispatches block type (Raw / RLE / Compressed); for Compressed, decodes literals THEN runs the sequence loop inline (nodecompress_block_inplace_*wrappers).FrameDecoder::decode_to_slice_trustedcalls this directly for compressed blocks, skipping the BlockDecoder layer entirely.block_decoder.rsstays for FlatBuf / RingBuffer paths — preserves the legacy abstraction surface on backends that need it.Out of scope (separate refactors): legacy
decode_allchain, RingBuffer multi-segment streaming.Acceptance criteria
block_decoder_donor.rswith the collapsed body.FrameDecoder::decode_to_slice_trusteddispatches to it on the UserSliceBackend + non-dict path.cargo asm/ disasm).decompress/level_-1_fast/{decodecorpus-z000033/c_stream, low-entropy-1m/rust_stream}/matrix/pure_rust_directandsmall-10k-random/level_-6_fast/c_stream: ≥ 5% time reduction averaged, no regression on any single fixture.Part of #247.