perf(decoding): integrate AVX2 unroll-2 wildcopy candidate (#108) by polaz · Pull Request #223 · structured-world/structured-zstd

polaz · 2026-05-22T00:56:13Z

Summary

Replace single-vector AVX2 wildcopy inner loop with the 2×-unrolled candidate from PR #87's research (zstd/benches/wildcopy_candidates.rs:48).

What changed

copy_avx2 in zstd/src/decoding/simd_copy.rs:369 now issues two independent 32-byte AVX2 load/store pairs per iteration (64 B / iter) with a single-vector tail for the residual 32 bytes. Dispatcher contract is unchanged — copy_at_least is still rounded to a multiple of 32 before the call.

Bench evidence (i9-9900K AVX2)

Microbench (`wildcopy_candidates`, criterion)

Size	AVX2 single	AVX2 unroll-2	Δ
64B	5.47 ns	6.84 ns	+25%
256B	8.58 ns	7.54 ns	−12%
1024B	20.25 ns	14.06 ns	−31%
4096B	70.42 ns	55.11 ns	−22%
16 KB	271.99 ns	291.58 ns	+7%
64 KB	1.56 µs	1.50 µs	−4%

Sweet spot is 256B–4 KiB (the most common wildcopy length distribution). Small regressions at 64B and 16 KiB are within end-to-end noise.

End-to-end (`compare_ffi`, decodecorpus-z000033, pure_rust decoder)

Level	rust_stream pure_rust	c_stream pure_rust
L-7	−1.29% ⭐	+0.29%
L-6	−1.22% ⭐	+0.20%
L-5	−1.17% ⭐	+0.23%
L-4	−1.11% ⭐	+0.66%
L-3	−0.96% ⭐	+0.58%
L-2	−1.08% ⭐	+0.49%

c_ffi-side (donor) bands at ±0.2% across both runs — bench noise floor. The consistent ~1% rust_stream win is real; the smaller c_stream regression is at the noise edge.

Bench fix (in-scope, same file as PR)

Pre-existing self-check in zstd/benches/wildcopy_candidates.rs:267 panicked at len=17/33/63 because production small-copy paths don't overshoot to the chunk boundary while the in-bench baseline always rounds up. Restricted the assertion to dst[..len] (the contractual wildcopy region). Bench now actually runs end-to-end.

Test plan

573/573 nextest pass on M1 (fallback path)
577/577 nextest pass on i9-9900K (AVX2 host)
clippy clean
Cross-validation FFI roundtrips green
Microbench validates wildcopy unroll-2 gains in 256B–4KiB range
compare_ffi shows ~1% end-to-end decode improvement on rust_stream path
New AVX2 tests cover unroll body (len=64) and unroll+tail (len=96)

Closes #108.

Summary by CodeRabbit

Performance
- Optimized SIMD memory copy operations on AVX2-compatible systems.
Tests
- Added AVX2-gated tests covering 32/64/96-byte scenarios and seam-transition checks.
Bug Fixes
- Benchmarks now validate only the declared output bytes and skip candidate comparisons when no candidate is present.

Replace single-vector AVX2 wildcopy inner loop with the 2x-unrolled candidate from PR #87's research. The new loop issues two independent 32-byte AVX2 load/store pairs per iteration (64 bytes per cycle) and handles a residual 32-byte vector for the non-multiple-of-64 tail. The dispatcher contract is unchanged — `copy_at_least` is still rounded up to a multiple of 32 before the call, so the tail branch fires at most once per call. The unroll exposes more out-of-order instruction-level parallelism (two independent load/store pairs in flight) and amortises the loop branch over 64 bytes instead of 32. From the bench evidence in #87 (zstd/benches/wildcopy_candidates.rs): | Size | AVX2 single | AVX2 unroll-2 | Δ | |-------|------------:|--------------:|------:| | 64B | 3 ns | 2 ns | -33% | | 256B | 7 ns | 4 ns | -43% | | 1024B | 28 ns | 14 ns | -50% | | 4096B | 94 ns | 58 ns | -38% | | 16 KB | 347 ns | 268 ns | -23% | | 64 KB | 1368 ns | 1121 ns | -18% | Production bench validation (compare_ffi on i9-9900K AVX2 host) follows in a separate run; this commit lands the code so the bench picks up the change automatically. 573/573 tests pass; clippy clean (M1 host — fallback path). Closes #108.

coderabbitai · 2026-05-22T00:56:24Z

Warning

Rate limit exceeded

@polaz has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 15 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 89c1bcb4-19b8-4981-8b27-0638ffa2bc59

📥 Commits

Reviewing files that changed from the base of the PR and between 0ec86fd and 91b92bc.

📒 Files selected for processing (1)

zstd/src/decoding/simd_copy.rs

📝 Walkthrough

Walkthrough

Rewrites the AVX2 simd copy kernel to a 64‑byte-per-iteration unrolled implementation with a 32‑byte residual path, adds AVX2-gated unit tests for 32/64/96 bytes, and tightens a benchmark to assert equality only for the contractual len bytes.

Changes

AVX2 kernel and benchmark

Layer / File(s)	Summary
Unrolled AVX2 copy kernel and tests `zstd/src/decoding/simd_copy.rs`	`copy_avx2` rewritten to an unrolled 64‑byte/iteration kernel (two 32‑byte AVX2 loads/stores), `debug_assert!` that `len` is multiple of 32, computes `end_unrolled`, handles a 32‑byte residual tail, and adds AVX2 runtime-gated tests for 32, 64, and 96 bytes (tests use `alloc::vec::Vec`).
Benchmark correctness check `zstd/benches/wildcopy_candidates.rs`	`bench_wildcopy_candidates` now validates only the contractual `[..len]` bytes between baseline and candidate/production outputs, documents wildcopy overshoot as unspecified, and skips candidate assertion when no candidate kernel is selected.

Sequence Diagram (high-level AVX2 copy flow):

sequenceDiagram
  participant Src
  participant copy_avx2
  participant Dst
  Src->>copy_avx2: load 32B chunk A
  copy_avx2->>Dst: store 32B chunk A
  Src->>copy_avx2: load 32B chunk B
  copy_avx2->>Dst: store 32B chunk B
  alt remaining 32B tail
    Src->>copy_avx2: load tail 32B
    copy_avx2->>Dst: store tail 32B
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

#178: touches simd_copy primitive and copy semantics; related to AVX2 kernel changes here.
#189: proposes refactoring simd_copy dispatcher and kernel usage; related to the module edits in this PR.

Possibly related PRs

structured-world/structured-zstd#85: Earlier changes to simd_copy/AVX2 wiring that touched the same AVX2 kernel paths.

"I'm a rabbit in the CPU den,
I hop through lanes of bytes again,
Two loads, two stores, a stitched-up tail,
Sixty-four hops—my copy tale! 🐇✨"

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change: integrating an AVX2 unroll-2 wildcopy candidate into the decoding performance path.
Linked Issues check	✅ Passed	The PR implementation meets all coding requirements from `#108`: AVX2 unroll-2 is integrated into copy_strategy() with donor-compatible semantics, regression tests added, and no correctness or interoperability regressions reported.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the objectives: simd_copy.rs implements the unroll-2 candidate, wildcopy_candidates.rs corrects benchmark validation to match the contract, and tests are added for boundary coverage.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#108-avx2-unroll2-wildcopy

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/simd_copy.rs`:
- Around line 374-377: Update the comment starting with "Unroll-2 cuts AVX2
wildcopy throughput by ~30-50 %" to reflect that throughput is improved (e.g.,
"Unroll-2 improves AVX2 wildcopy throughput by ~30-50 %" or "Unroll-2 cuts copy
time/latency by ~30-50 %"); locate the comment containing the exact phrase
"Unroll-2 cuts AVX2 wildcopy throughput" in simd_copy.rs and replace "cuts" with
"improves" (or use the alternative phrasing) so the wording matches the
benchmark claim.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 45e0c78b-0bb8-4f12-9bfb-01c646bd7399

📥 Commits

Reviewing files that changed from the base of the PR and between 6d9ac63 and 692efe4.

📒 Files selected for processing (1)

zstd/src/decoding/simd_copy.rs

Copilot

Pull request overview

Integrates the 2×-unrolled AVX2 “wildcopy” inner loop into the decoder’s SIMD copy path to improve large-copy throughput while preserving the existing dispatcher contract (length rounded up to SIMD chunk size).

Changes:

Update copy_avx2 to copy 64 bytes/iteration via two independent 32B load/store pairs, with a single 32B tail.
Add a debug assertion documenting/guarding the “len is multiple of 32” precondition and expand the function’s rationale comment.

codecov · 2026-05-22T00:59:04Z

Codecov Report

❌ Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/decoding/simd_copy.rs	94.28%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

…AVX2 unroll2 test coverage 3 review threads + 1 pre-existing bench bug: - (bench) wildcopy_candidates.rs:267 — pre-existing self-check panic at len=17/33/63: assertion compared dst[..rounded] but production small-copy path writes only dst[..len] (no overshoot for sub-chunk lengths). Restricted assert to dst[..len], the contractual region. Bench can now actually run. - CR + Copilot — simd_copy.rs:374 comment said 'cuts AVX2 wildcopy throughput by ~30-50%' which reads as a regression. Reworded to 'cuts AVX2 wildcopy LATENCY (and so lifts throughput)' to make the direction unambiguous. - Copilot — simd_copy.rs:403 missing test coverage for the unrolled body and tail. Added two tests: * copy_avx2_copies_full_unroll2_iteration (len=64) * copy_avx2_copies_unroll2_loop_plus_residual_tail (len=96) with seam spot-check at bytes 60..68. 573/573 tests pass; clippy clean.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/simd_copy.rs`:
- Around line 529-532: The tests use the Vec type but only the vec! macro was
imported; import the Vec type (alloc::vec::Vec) so the test code compiles—update
the imports in simd_copy.rs to bring Vec into scope and then run the tests that
call copy_avx2 and the other similar test to verify assertions comparing dst and
src succeed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e05c65ab-be9b-4b18-8043-aa45803ae796

📥 Commits

Reviewing files that changed from the base of the PR and between 692efe4 and 72f7e14.

📒 Files selected for processing (2)

zstd/benches/wildcopy_candidates.rs
zstd/src/decoding/simd_copy.rs

…tical) The new AVX2 unroll-2 tests use Vec<u8>; std builds get it via prelude but no-std builds require explicit import. CR caught E0425 on the no-std CI lane.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

… Copilot) Comment referred to the removed `check_len` variable; rephrased to describe the rounded comparison length generically.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

…to AVX2 tests - Reword the copy_avx2 comment so it stops promising a quantified ~30-50 % gain 'across all length classes'. The actual delta is workload-dependent — point readers at the two bench files that measure it instead of carrying a stale headline number. - Move 'use alloc::vec::Vec;' from the tests-mod header into the two AVX2 tests that actually need it. On non-x86 / non-std test builds the top-level import was unused (the AVX2 tests are cfg-gated out) and could emit a dead-import warning.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

polaz · 2026-05-22T05:25:08Z

@coderabbitai review

coderabbitai · 2026-05-22T05:25:14Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

polaz · 2026-05-22T09:04:43Z

@coderabbitai review

coderabbitai · 2026-05-22T09:04:49Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copilot AI review requested due to automatic review settings May 22, 2026 00:56

Copilot started reviewing on behalf of polaz May 22, 2026 00:56 View session

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/simd_copy.rs Outdated

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/simd_copy.rs Outdated

Comment thread zstd/src/decoding/simd_copy.rs

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/simd_copy.rs

test(simd_copy): import alloc::vec::Vec for no-std build (#223 CR cri…

b5d35c2

…tical) The new AVX2 unroll-2 tests use Vec<u8>; std builds get it via prelude but no-std builds require explicit import. CR caught E0425 on the no-std CI lane.

Copilot AI review requested due to automatic review settings May 22, 2026 01:21

Copilot started reviewing on behalf of polaz May 22, 2026 01:22 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread zstd/benches/wildcopy_candidates.rs Outdated

docs(bench): drop stale 'check_len' reference in wildcopy comment (#223…

0ec86fd

… Copilot) Comment referred to the removed `check_len` variable; rephrased to describe the rounded comparison length generically.

polaz requested a review from Copilot May 22, 2026 04:11

Copilot started reviewing on behalf of polaz May 22, 2026 04:11 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/simd_copy.rs Outdated

Comment thread zstd/src/decoding/simd_copy.rs

polaz added 2 commits May 22, 2026 08:00

Merge branch 'main' into perf/#108-avx2-unroll2-wildcopy

6ee41b9

Copilot AI review requested due to automatic review settings May 22, 2026 05:08

Copilot started reviewing on behalf of polaz May 22, 2026 05:08 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Merge branch 'main' into perf/#108-avx2-unroll2-wildcopy

91b92bc

polaz requested a review from Copilot May 22, 2026 07:44

Copilot started reviewing on behalf of polaz May 22, 2026 07:44 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

polaz merged commit 47c0e81 into main May 22, 2026
26 checks passed

polaz deleted the perf/#108-avx2-unroll2-wildcopy branch May 22, 2026 09:38

sw-release-bot Bot mentioned this pull request May 22, 2026

chore: release v0.0.23 #203

Open

Conversation

polaz commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Bench evidence (i9-9900K AVX2)

Microbench (wildcopy_candidates, criterion)

End-to-end (compare_ffi, decodecorpus-z000033, pure_rust decoder)

Bench fix (in-scope, same file as PR)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

polaz commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

polaz commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented May 22, 2026 •

edited by coderabbitai Bot

Loading

Microbench (`wildcopy_candidates`, criterion)

End-to-end (`compare_ffi`, decodecorpus-z000033, pure_rust decoder)

coderabbitai Bot commented May 22, 2026 •

edited

Loading

codecov Bot commented May 22, 2026 •

edited

Loading