perf(decode): SIMD-16 fast path for short offsets {1, 2, 4}#213
Conversation
Replaces the 8-byte phase-pattern inner loop with 16-byte SIMD stores
for all short offsets (1..7).
For offset ∈ {1, 2, 4} the repeating period divides 16, so a single
pre-built 16-byte chunk feeds the entire loop with zero phase tracking
(inner loop = one 16-byte SIMD store + one add).
For offset ∈ {3, 5, 6, 7} the period does not divide 16, so each
16-byte window starts at a different sub-position. Pre-build all
`offset` phase-shifted 16-byte windows (LCM(offset, 16) bytes worth =
48/80/48/112 B steady-state period); the cursor advances the phase
by `16 % offset` per iteration. Inner loop = one 16-byte SIMD store
+ small modulo, same 16-byte width as the divides-16 fast path.
Stack budget: 7 × 16 = 112 B phase-pattern table (vs 7 × 8 = 56 B
previously). Inner-loop store width doubles from 8 → 16 bytes.
Adds a regression test that compares `DecodeBuffer::repeat` output
against the canonical `output[i] = base[i % offset]` reference for
every offset 1..=7 across 25 match-lengths covering both the
chunk-aligned cases (16/32/48/64/128) and the tail-path remainders.
…edback
A first attempt extended the SIMD-16 path to all short offsets via 16-byte
phase-shifted windows (LCM(offset, 16) = 48/80/48/112 B steady-state). On
Intel i9-9900K the doubled inner-loop store width was offset by the larger
7x16 = 112 B phase-pattern setup cost (2x the 8-byte setup), so the
16-byte version regressed every measured scenario on
`decodecorpus-z000033` except `level_1_fast` (where it broke even).
Root cause: real short-offset matches are short enough that setup
dominates total cost; doubling setup for marginal inner-loop wins is a
net loss on realistic input.
Keep the SIMD-16 fast path only where the period divides 16
(offset in {1, 2, 4}): there setup is trivially small (one 16-byte
chunk) and the inner-loop store is genuinely 16 bytes at zero overhead.
Restore the 8-byte phase-pattern path (7x8 = 56 B setup) for
offset in {3, 5, 6, 7} which proved the fastest measured option on
this workload.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe pull request optimizes the decompression path for short-distance repeat matches by rewriting ChangesShort-offset repeat SIMD optimization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Optimizes the decoder’s short-offset overlapping match expansion by adding a specialized 16-byte chunk emit path for offsets whose repeating period divides 16 (offset ∈ {1, 2, 4}), while keeping the existing 8-byte phase-pattern logic for other short offsets (3, 5, 6, 7). This targets decompression hot-path performance without changing public APIs.
Changes:
- Add a SIMD-friendly 16-byte repeat loop for offsets 1/2/4 using a single prebuilt 16-byte pattern chunk.
- Retain and document the existing 8-byte phase-pattern path for offsets 3/5/6/7.
- Add a regression test that validates short-offset repeat output against a canonical reference for offsets 1..=7 across multiple match lengths.
Summary
Replaces the 8-byte phase-pattern inner loop in
repeat_short_offsetwith a 16-byte SIMD store for the three short offsets whose repeating
period divides 16 —
offset ∈ {1, 2, 4}(RLE byte, 2-byte alternation,4-byte aligned-word repeat). Offsets {3, 5, 6, 7} stay on the existing
8-byte phase-pattern path.
For
offset ∈ {1, 2, 4}the period divides 16, so a single pre-built16-byte chunk feeds the entire loop with zero phase tracking. Inner
loop = one 16-byte SIMD store + one add.
Why the SIMD-16 path is restricted to
offset ∈ {1, 2, 4}A first attempt extended the 16-byte path to all short offsets via
phase-shifted 16-byte windows for offsets {3, 5, 6, 7} (LCM(offset, 16)
= 48 / 80 / 48 / 112 byte steady-state period). On
decodecorpus-z000033the doubled inner-loop store width was offset bya 7×16 = 112 B phase-pattern setup cost (2× the existing 8-byte setup),
which regressed every measured scenario except
level_1_fast(where itbroke even).
Real short-offset matches on this workload are short enough that setup
dominates total per-call cost; doubling setup for marginal inner-loop
wins is a net loss. The 8-byte phase-pattern path (7×8 = 56 B setup)
turned out to be the fastest measured option for {3, 5, 6, 7} on
realistic input. The 16-byte fast path is therefore kept only where
setup is trivially small (one 16-byte chunk), i.e. for offsets whose
period divides 16.
Benchmarks
Intel i9-9900K (BMI2 + AVX2, Fedora 44, Linux 7.0):
Direction: branch faster than main where Δ is negative.
mainmsbranchmsdecompress/level_-1_fast/decodecorpus-z000033/c_stream/matrix/pure_rustdecompress/level_1_fast/decodecorpus-z000033/c_stream/matrix/pure_rustdecompress/level_3_dfast/decodecorpus-z000033/c_stream/matrix/pure_rustdecompress/level_4_greedy/decodecorpus-z000033/c_stream/matrix/pure_rustAll four scenarios statistically significant (p < 0.05). Largest win on
level_1_fastwhere the fast strategy emits the highest fraction ofshort-offset literal repeats.
Tests
cargo clippy --lib --tests -- -D warningsclean.repeat_short_offset_matches_canonical_for_all_offsets_and_lengthscompares
DecodeBuffer::repeatoutput against the canonicaloutput[i] = base[i % offset]reference for every offset 1..=7across 25 match-lengths covering both the chunk-aligned cases
(16/32/48/64/128) and the tail-path remainders (1, 2, 3, 5, 9, 15,
17, 23, 25, 31, 33, 47, 49, 127, 4096).
Files
zstd/src/decoding/decode_buffer.rs—repeat_short_offsetSIMD-16fast path for offset ∈ {1, 2, 4}; regression test added.
Closes #209.
Summary by CodeRabbit
Bug Fixes
Tests