Skip to content

perf(decode): SIMD-16 fast path for short offsets {1, 2, 4}#213

Merged
polaz merged 2 commits into
mainfrom
perf/#209-simd16-short-offset
May 20, 2026
Merged

perf(decode): SIMD-16 fast path for short offsets {1, 2, 4}#213
polaz merged 2 commits into
mainfrom
perf/#209-simd16-short-offset

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 20, 2026

Summary

Replaces the 8-byte phase-pattern inner loop in repeat_short_offset
with a 16-byte SIMD store for the three short offsets whose repeating
period divides 16 — offset ∈ {1, 2, 4} (RLE byte, 2-byte alternation,
4-byte aligned-word repeat). Offsets {3, 5, 6, 7} stay on the existing
8-byte phase-pattern path.

For offset ∈ {1, 2, 4} the period divides 16, so a single pre-built
16-byte chunk feeds the entire loop with zero phase tracking. Inner
loop = one 16-byte SIMD store + one add.

Why the SIMD-16 path is restricted to offset ∈ {1, 2, 4}

A first attempt extended the 16-byte path to all short offsets via
phase-shifted 16-byte windows for offsets {3, 5, 6, 7} (LCM(offset, 16)
= 48 / 80 / 48 / 112 byte steady-state period). On
decodecorpus-z000033 the doubled inner-loop store width was offset by
a 7×16 = 112 B phase-pattern setup cost (2× the existing 8-byte setup),
which regressed every measured scenario except level_1_fast (where it
broke even).

Real short-offset matches on this workload are short enough that setup
dominates total per-call cost; doubling setup for marginal inner-loop
wins is a net loss. The 8-byte phase-pattern path (7×8 = 56 B setup)
turned out to be the fastest measured option for {3, 5, 6, 7} on
realistic input. The 16-byte fast path is therefore kept only where
setup is trivially small (one 16-byte chunk), i.e. for offsets whose
period divides 16.

Benchmarks

Intel i9-9900K (BMI2 + AVX2, Fedora 44, Linux 7.0):

Direction: branch faster than main where Δ is negative.

Scenario main ms branch ms Speedup p
decompress/level_-1_fast/decodecorpus-z000033/c_stream/matrix/pure_rust 2.373 2.360 −0.55 % (faster) 0.00
decompress/level_1_fast/decodecorpus-z000033/c_stream/matrix/pure_rust 6.965 6.816 −2.14 % (faster) 0.00
decompress/level_3_dfast/decodecorpus-z000033/c_stream/matrix/pure_rust 6.159 6.096 −1.02 % (faster) 0.00
decompress/level_4_greedy/decodecorpus-z000033/c_stream/matrix/pure_rust 6.127 6.064 −1.03 % (faster) 0.00

All four scenarios statistically significant (p < 0.05). Largest win on
level_1_fast where the fast strategy emits the highest fraction of
short-offset literal repeats.

Tests

  • 539/539 nextest pass.
  • cargo clippy --lib --tests -- -D warnings clean.
  • New regression test
    repeat_short_offset_matches_canonical_for_all_offsets_and_lengths
    compares DecodeBuffer::repeat output against the canonical
    output[i] = base[i % offset] reference for every offset 1..=7
    across 25 match-lengths covering both the chunk-aligned cases
    (16/32/48/64/128) and the tail-path remainders (1, 2, 3, 5, 9, 15,
    17, 23, 25, 31, 33, 47, 49, 127, 4096).

Files

  • zstd/src/decoding/decode_buffer.rsrepeat_short_offset SIMD-16
    fast path for offset ∈ {1, 2, 4}; regression test added.

Closes #209.

Summary by CodeRabbit

  • Bug Fixes

    • Improved decompression robustness through optimized buffer handling.
  • Tests

    • Added regression test suite validating decompression correctness across all supported data patterns and edge cases.

Review Change Stack

polaz added 2 commits May 20, 2026 16:07
Replaces the 8-byte phase-pattern inner loop with 16-byte SIMD stores
for all short offsets (1..7).

For offset ∈ {1, 2, 4} the repeating period divides 16, so a single
pre-built 16-byte chunk feeds the entire loop with zero phase tracking
(inner loop = one 16-byte SIMD store + one add).

For offset ∈ {3, 5, 6, 7} the period does not divide 16, so each
16-byte window starts at a different sub-position. Pre-build all
`offset` phase-shifted 16-byte windows (LCM(offset, 16) bytes worth =
48/80/48/112 B steady-state period); the cursor advances the phase
by `16 % offset` per iteration. Inner loop = one 16-byte SIMD store
+ small modulo, same 16-byte width as the divides-16 fast path.

Stack budget: 7 × 16 = 112 B phase-pattern table (vs 7 × 8 = 56 B
previously). Inner-loop store width doubles from 8 → 16 bytes.

Adds a regression test that compares `DecodeBuffer::repeat` output
against the canonical `output[i] = base[i % offset]` reference for
every offset 1..=7 across 25 match-lengths covering both the
chunk-aligned cases (16/32/48/64/128) and the tail-path remainders.
…edback

A first attempt extended the SIMD-16 path to all short offsets via 16-byte
phase-shifted windows (LCM(offset, 16) = 48/80/48/112 B steady-state). On
Intel i9-9900K the doubled inner-loop store width was offset by the larger
7x16 = 112 B phase-pattern setup cost (2x the 8-byte setup), so the
16-byte version regressed every measured scenario on
`decodecorpus-z000033` except `level_1_fast` (where it broke even).

Root cause: real short-offset matches are short enough that setup
dominates total cost; doubling setup for marginal inner-loop wins is a
net loss on realistic input.

Keep the SIMD-16 fast path only where the period divides 16
(offset in {1, 2, 4}): there setup is trivially small (one 16-byte
chunk) and the inner-loop store is genuinely 16 bytes at zero overhead.
Restore the 8-byte phase-pattern path (7x8 = 56 B setup) for
offset in {3, 5, 6, 7} which proved the fastest measured option on
this workload.
Copilot AI review requested due to automatic review settings May 20, 2026 13:15
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 30ae6a2d-e25c-4bad-a083-503380df85b3

📥 Commits

Reviewing files that changed from the base of the PR and between d68ce5b and 8dc383d.

📒 Files selected for processing (1)
  • zstd/src/decoding/decode_buffer.rs

📝 Walkthrough

Walkthrough

The pull request optimizes the decompression path for short-distance repeat matches by rewriting repeat_short_offset to emit 16-byte SIMD chunks for the most common offsets (1, 2, 4) while maintaining efficient 8-byte iteration for the remaining offsets (3, 5, 6, 7), with comprehensive regression test validation.

Changes

Short-offset repeat SIMD optimization

Layer / File(s) Summary
Optimized repeat_short_offset implementation
zstd/src/decoding/decode_buffer.rs
Enforces offset <= 7 contract. Introduces a 16-byte fast path for offset ∈ {1,2,4} that builds the periodic pattern once and emits in 16-byte chunks with proper tail handling. Refactors offset ∈ {3,5,6,7} to use an 8-byte phase-pattern loop advancing by 8 % offset per iteration.
Regression test coverage
zstd/src/decoding/decode_buffer.rs
Adds repeat_short_offset_matches_canonical_for_all_offsets_and_lengths unit test that exercises the repeat path for all offsets 1..=7 across a wide range of match lengths, validating decoded output matches the canonical base[i % offset] expansion and covering both SIMD-16 and phase-pattern behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • structured-world/structured-zstd#209: This PR implements the SIMD-16 fast path optimization for short offsets {1, 2, 4} described in the issue acceptance criteria.
  • structured-world/structured-zstd#189: Both changes address the decode hot-path for short-offset repeat operations; this PR refines the <8 short-offset logic while the related issue proposes extending the <16 SIMD wildcopy pattern.

Possibly related PRs

  • structured-world/structured-zstd#42: Both PRs directly modify DecodeBuffer::repeat_short_offset for offsets 1..=7, with this PR rewriting the algorithm and test coverage compared to the earlier short-offset implementation.

Poem

🐰 A rabbit hops through repeat patterns bright,
Building 16-byte chunks with SIMD might,
While shorter roads take phase-tracked strides so fine,
Each offset dances in its own design!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main optimization: SIMD-16 fast path implementation for short offsets {1, 2, 4}, which is the core change in this PR.
Linked Issues check ✅ Passed The PR implementation meets all primary coding objectives from issue #209: SIMD-16 fast path for offsets {1,2,4}, preserved 8-byte phase-pattern for {3,5,6,7}, added regression tests, and delivered measurable performance improvements.
Out of Scope Changes check ✅ Passed All changes are within scope: modifications are limited to decode_buffer.rs for repeat_short_offset, with no changes to LCM-based chunks for {3,5,6,7} or offset=8/≥16 paths.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#209-simd16-short-offset

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/decoding/decode_buffer.rs 97.50% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the decoder’s short-offset overlapping match expansion by adding a specialized 16-byte chunk emit path for offsets whose repeating period divides 16 (offset ∈ {1, 2, 4}), while keeping the existing 8-byte phase-pattern logic for other short offsets (3, 5, 6, 7). This targets decompression hot-path performance without changing public APIs.

Changes:

  • Add a SIMD-friendly 16-byte repeat loop for offsets 1/2/4 using a single prebuilt 16-byte pattern chunk.
  • Retain and document the existing 8-byte phase-pattern path for offsets 3/5/6/7.
  • Add a regression test that validates short-offset repeat output against a canonical reference for offsets 1..=7 across multiple match lengths.

@polaz polaz merged commit 355b22d into main May 20, 2026
26 checks passed
@polaz polaz deleted the perf/#209-simd16-short-offset branch May 20, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(decode): SIMD-16 fast path for short offsets {1, 2, 4}

2 participants