perf(decode): SIMD-16 fast path for short offsets {1, 2, 4} by polaz · Pull Request #213 · structured-world/structured-zstd

polaz · 2026-05-20T13:15:34Z

Summary

Replaces the 8-byte phase-pattern inner loop in repeat_short_offset
with a 16-byte SIMD store for the three short offsets whose repeating
period divides 16 — offset ∈ {1, 2, 4} (RLE byte, 2-byte alternation,
4-byte aligned-word repeat). Offsets {3, 5, 6, 7} stay on the existing
8-byte phase-pattern path.

For offset ∈ {1, 2, 4} the period divides 16, so a single pre-built
16-byte chunk feeds the entire loop with zero phase tracking. Inner
loop = one 16-byte SIMD store + one add.

Why the SIMD-16 path is restricted to `offset ∈ {1, 2, 4}`

A first attempt extended the 16-byte path to all short offsets via
phase-shifted 16-byte windows for offsets {3, 5, 6, 7} (LCM(offset, 16)
= 48 / 80 / 48 / 112 byte steady-state period). On
decodecorpus-z000033 the doubled inner-loop store width was offset by
a 7×16 = 112 B phase-pattern setup cost (2× the existing 8-byte setup),
which regressed every measured scenario except level_1_fast (where it
broke even).

Real short-offset matches on this workload are short enough that setup
dominates total per-call cost; doubling setup for marginal inner-loop
wins is a net loss. The 8-byte phase-pattern path (7×8 = 56 B setup)
turned out to be the fastest measured option for {3, 5, 6, 7} on
realistic input. The 16-byte fast path is therefore kept only where
setup is trivially small (one 16-byte chunk), i.e. for offsets whose
period divides 16.

Benchmarks

Intel i9-9900K (BMI2 + AVX2, Fedora 44, Linux 7.0):

Direction: branch faster than main where Δ is negative.

Scenario	`main` ms	`branch` ms	Speedup
`decompress/level_-1_fast/decodecorpus-z000033/c_stream/matrix/pure_rust`	2.373	2.360	−0.55 % (faster)
`decompress/level_1_fast/decodecorpus-z000033/c_stream/matrix/pure_rust`	6.965	6.816	−2.14 % (faster)
`decompress/level_3_dfast/decodecorpus-z000033/c_stream/matrix/pure_rust`	6.159	6.096	−1.02 % (faster)
`decompress/level_4_greedy/decodecorpus-z000033/c_stream/matrix/pure_rust`	6.127	6.064	−1.03 % (faster)

All four scenarios statistically significant (p < 0.05). Largest win on
level_1_fast where the fast strategy emits the highest fraction of
short-offset literal repeats.

Tests

539/539 nextest pass.
cargo clippy --lib --tests -- -D warnings clean.
New regression test
repeat_short_offset_matches_canonical_for_all_offsets_and_lengths
compares DecodeBuffer::repeat output against the canonical
output[i] = base[i % offset] reference for every offset 1..=7
across 25 match-lengths covering both the chunk-aligned cases
(16/32/48/64/128) and the tail-path remainders (1, 2, 3, 5, 9, 15,
17, 23, 25, 31, 33, 47, 49, 127, 4096).

Files

zstd/src/decoding/decode_buffer.rs — repeat_short_offset SIMD-16
fast path for offset ∈ {1, 2, 4}; regression test added.

Closes #209.

Summary by CodeRabbit

Bug Fixes
- Improved decompression robustness through optimized buffer handling.
Tests
- Added regression test suite validating decompression correctness across all supported data patterns and edge cases.

Replaces the 8-byte phase-pattern inner loop with 16-byte SIMD stores for all short offsets (1..7). For offset ∈ {1, 2, 4} the repeating period divides 16, so a single pre-built 16-byte chunk feeds the entire loop with zero phase tracking (inner loop = one 16-byte SIMD store + one add). For offset ∈ {3, 5, 6, 7} the period does not divide 16, so each 16-byte window starts at a different sub-position. Pre-build all `offset` phase-shifted 16-byte windows (LCM(offset, 16) bytes worth = 48/80/48/112 B steady-state period); the cursor advances the phase by `16 % offset` per iteration. Inner loop = one 16-byte SIMD store + small modulo, same 16-byte width as the divides-16 fast path. Stack budget: 7 × 16 = 112 B phase-pattern table (vs 7 × 8 = 56 B previously). Inner-loop store width doubles from 8 → 16 bytes. Adds a regression test that compares `DecodeBuffer::repeat` output against the canonical `output[i] = base[i % offset]` reference for every offset 1..=7 across 25 match-lengths covering both the chunk-aligned cases (16/32/48/64/128) and the tail-path remainders.

…edback A first attempt extended the SIMD-16 path to all short offsets via 16-byte phase-shifted windows (LCM(offset, 16) = 48/80/48/112 B steady-state). On Intel i9-9900K the doubled inner-loop store width was offset by the larger 7x16 = 112 B phase-pattern setup cost (2x the 8-byte setup), so the 16-byte version regressed every measured scenario on `decodecorpus-z000033` except `level_1_fast` (where it broke even). Root cause: real short-offset matches are short enough that setup dominates total cost; doubling setup for marginal inner-loop wins is a net loss on realistic input. Keep the SIMD-16 fast path only where the period divides 16 (offset in {1, 2, 4}): there setup is trivially small (one 16-byte chunk) and the inner-loop store is genuinely 16 bytes at zero overhead. Restore the 8-byte phase-pattern path (7x8 = 56 B setup) for offset in {3, 5, 6, 7} which proved the fastest measured option on this workload.

coderabbitai · 2026-05-20T13:15:48Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 30ae6a2d-e25c-4bad-a083-503380df85b3

📥 Commits

Reviewing files that changed from the base of the PR and between d68ce5b and 8dc383d.

📒 Files selected for processing (1)

zstd/src/decoding/decode_buffer.rs

📝 Walkthrough

Walkthrough

The pull request optimizes the decompression path for short-distance repeat matches by rewriting repeat_short_offset to emit 16-byte SIMD chunks for the most common offsets (1, 2, 4) while maintaining efficient 8-byte iteration for the remaining offsets (3, 5, 6, 7), with comprehensive regression test validation.

Changes

Short-offset repeat SIMD optimization

Layer / File(s)	Summary
Optimized repeat_short_offset implementation `zstd/src/decoding/decode_buffer.rs`	Enforces `offset <= 7` contract. Introduces a 16-byte fast path for `offset ∈ {1,2,4}` that builds the periodic pattern once and emits in 16-byte chunks with proper tail handling. Refactors `offset ∈ {3,5,6,7}` to use an 8-byte phase-pattern loop advancing by `8 % offset` per iteration.
Regression test coverage `zstd/src/decoding/decode_buffer.rs`	Adds `repeat_short_offset_matches_canonical_for_all_offsets_and_lengths` unit test that exercises the repeat path for all offsets `1..=7` across a wide range of match lengths, validating decoded output matches the canonical `base[i % offset]` expansion and covering both SIMD-16 and phase-pattern behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

structured-world/structured-zstd#209: This PR implements the SIMD-16 fast path optimization for short offsets {1, 2, 4} described in the issue acceptance criteria.
structured-world/structured-zstd#189: Both changes address the decode hot-path for short-offset repeat operations; this PR refines the <8 short-offset logic while the related issue proposes extending the <16 SIMD wildcopy pattern.

Possibly related PRs

structured-world/structured-zstd#42: Both PRs directly modify DecodeBuffer::repeat_short_offset for offsets 1..=7, with this PR rewriting the algorithm and test coverage compared to the earlier short-offset implementation.

Poem

🐰 A rabbit hops through repeat patterns bright,
Building 16-byte chunks with SIMD might,
While shorter roads take phase-tracked strides so fine,
Each offset dances in its own design!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main optimization: SIMD-16 fast path implementation for short offsets {1, 2, 4}, which is the core change in this PR.
Linked Issues check	✅ Passed	The PR implementation meets all primary coding objectives from issue `#209`: SIMD-16 fast path for offsets {1,2,4}, preserved 8-byte phase-pattern for {3,5,6,7}, added regression tests, and delivered measurable performance improvements.
Out of Scope Changes check	✅ Passed	All changes are within scope: modifications are limited to decode_buffer.rs for repeat_short_offset, with no changes to LCM-based chunks for {3,5,6,7} or offset=8/≥16 paths.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#209-simd16-short-offset

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-20T13:17:58Z

Codecov Report

❌ Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/decoding/decode_buffer.rs	97.50%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Optimizes the decoder’s short-offset overlapping match expansion by adding a specialized 16-byte chunk emit path for offsets whose repeating period divides 16 (offset ∈ {1, 2, 4}), while keeping the existing 8-byte phase-pattern logic for other short offsets (3, 5, 6, 7). This targets decompression hot-path performance without changing public APIs.

Changes:

Add a SIMD-friendly 16-byte repeat loop for offsets 1/2/4 using a single prebuilt 16-byte pattern chunk.
Retain and document the existing 8-byte phase-pattern path for offsets 3/5/6/7.
Add a regression test that validates short-offset repeat output against a canonical reference for offsets 1..=7 across multiple match lengths.

polaz added 2 commits May 20, 2026 16:07

Copilot AI review requested due to automatic review settings May 20, 2026 13:15

Copilot started reviewing on behalf of polaz May 20, 2026 13:16 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

polaz merged commit 355b22d into main May 20, 2026
26 checks passed

polaz deleted the perf/#209-simd16-short-offset branch May 20, 2026 13:21

sw-release-bot Bot mentioned this pull request May 20, 2026

chore: release v0.0.23 #203

Open

polaz mentioned this pull request May 20, 2026

perf(fse): elide bounds check on init_state + update_state decode reads #214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decode): SIMD-16 fast path for short offsets {1, 2, 4}#213

perf(decode): SIMD-16 fast path for short offsets {1, 2, 4}#213
polaz merged 2 commits into
mainfrom
perf/#209-simd16-short-offset

polaz commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

codecov Bot commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

polaz commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why the SIMD-16 path is restricted to offset ∈ {1, 2, 4}

Benchmarks

Tests

Files

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

codecov Bot commented May 20, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Why the SIMD-16 path is restricted to `offset ∈ {1, 2, 4}`

coderabbitai Bot commented May 20, 2026 •

edited

Loading