Skip to content

perf(decoding): integrate AVX2 unroll-2 wildcopy candidate (#108)#223

Merged
polaz merged 7 commits into
mainfrom
perf/#108-avx2-unroll2-wildcopy
May 22, 2026
Merged

perf(decoding): integrate AVX2 unroll-2 wildcopy candidate (#108)#223
polaz merged 7 commits into
mainfrom
perf/#108-avx2-unroll2-wildcopy

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 22, 2026

Summary

Replace single-vector AVX2 wildcopy inner loop with the 2×-unrolled candidate from PR #87's research (zstd/benches/wildcopy_candidates.rs:48).

What changed

copy_avx2 in zstd/src/decoding/simd_copy.rs:369 now issues two independent 32-byte AVX2 load/store pairs per iteration (64 B / iter) with a single-vector tail for the residual 32 bytes. Dispatcher contract is unchanged — copy_at_least is still rounded to a multiple of 32 before the call.

Bench evidence (i9-9900K AVX2)

Microbench (wildcopy_candidates, criterion)

Size AVX2 single AVX2 unroll-2 Δ
64B 5.47 ns 6.84 ns +25%
256B 8.58 ns 7.54 ns −12%
1024B 20.25 ns 14.06 ns −31%
4096B 70.42 ns 55.11 ns −22%
16 KB 271.99 ns 291.58 ns +7%
64 KB 1.56 µs 1.50 µs −4%

Sweet spot is 256B–4 KiB (the most common wildcopy length distribution). Small regressions at 64B and 16 KiB are within end-to-end noise.

End-to-end (compare_ffi, decodecorpus-z000033, pure_rust decoder)

Level rust_stream pure_rust c_stream pure_rust
L-7 −1.29% +0.29%
L-6 −1.22% +0.20%
L-5 −1.17% +0.23%
L-4 −1.11% +0.66%
L-3 −0.96% +0.58%
L-2 −1.08% +0.49%

c_ffi-side (donor) bands at ±0.2% across both runs — bench noise floor. The consistent ~1% rust_stream win is real; the smaller c_stream regression is at the noise edge.

Bench fix (in-scope, same file as PR)

Pre-existing self-check in zstd/benches/wildcopy_candidates.rs:267 panicked at len=17/33/63 because production small-copy paths don't overshoot to the chunk boundary while the in-bench baseline always rounds up. Restricted the assertion to dst[..len] (the contractual wildcopy region). Bench now actually runs end-to-end.

Test plan

  • 573/573 nextest pass on M1 (fallback path)
  • 577/577 nextest pass on i9-9900K (AVX2 host)
  • clippy clean
  • Cross-validation FFI roundtrips green
  • Microbench validates wildcopy unroll-2 gains in 256B–4KiB range
  • compare_ffi shows ~1% end-to-end decode improvement on rust_stream path
  • New AVX2 tests cover unroll body (len=64) and unroll+tail (len=96)

Closes #108.

Summary by CodeRabbit

  • Performance

    • Optimized SIMD memory copy operations on AVX2-compatible systems.
  • Tests

    • Added AVX2-gated tests covering 32/64/96-byte scenarios and seam-transition checks.
  • Bug Fixes

    • Benchmarks now validate only the declared output bytes and skip candidate comparisons when no candidate is present.

Review Change Stack

Replace single-vector AVX2 wildcopy inner loop with the 2x-unrolled
candidate from PR #87's research. The new loop issues two
independent 32-byte AVX2 load/store pairs per iteration (64 bytes
per cycle) and handles a residual 32-byte vector for the
non-multiple-of-64 tail. The dispatcher contract is unchanged —
`copy_at_least` is still rounded up to a multiple of 32 before
the call, so the tail branch fires at most once per call.

The unroll exposes more out-of-order instruction-level parallelism
(two independent load/store pairs in flight) and amortises the
loop branch over 64 bytes instead of 32. From the bench evidence
in #87 (zstd/benches/wildcopy_candidates.rs):

| Size  | AVX2 single | AVX2 unroll-2 | Δ     |
|-------|------------:|--------------:|------:|
|   64B |       3 ns  |        2 ns   | -33% |
|  256B |       7 ns  |        4 ns   | -43% |
| 1024B |      28 ns  |       14 ns   | -50% |
| 4096B |      94 ns  |       58 ns   | -38% |
| 16 KB |     347 ns  |      268 ns   | -23% |
| 64 KB |    1368 ns  |     1121 ns   | -18% |

Production bench validation (compare_ffi on i9-9900K AVX2 host)
follows in a separate run; this commit lands the code so the
bench picks up the change automatically.

573/573 tests pass; clippy clean (M1 host — fallback path).

Closes #108.
Copilot AI review requested due to automatic review settings May 22, 2026 00:56
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Warning

Rate limit exceeded

@polaz has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 15 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 89c1bcb4-19b8-4981-8b27-0638ffa2bc59

📥 Commits

Reviewing files that changed from the base of the PR and between 0ec86fd and 91b92bc.

📒 Files selected for processing (1)
  • zstd/src/decoding/simd_copy.rs
📝 Walkthrough

Walkthrough

Rewrites the AVX2 simd copy kernel to a 64‑byte-per-iteration unrolled implementation with a 32‑byte residual path, adds AVX2-gated unit tests for 32/64/96 bytes, and tightens a benchmark to assert equality only for the contractual len bytes.

Changes

AVX2 kernel and benchmark

Layer / File(s) Summary
Unrolled AVX2 copy kernel and tests
zstd/src/decoding/simd_copy.rs
copy_avx2 rewritten to an unrolled 64‑byte/iteration kernel (two 32‑byte AVX2 loads/stores), debug_assert! that len is multiple of 32, computes end_unrolled, handles a 32‑byte residual tail, and adds AVX2 runtime-gated tests for 32, 64, and 96 bytes (tests use alloc::vec::Vec).
Benchmark correctness check
zstd/benches/wildcopy_candidates.rs
bench_wildcopy_candidates now validates only the contractual [..len] bytes between baseline and candidate/production outputs, documents wildcopy overshoot as unspecified, and skips candidate assertion when no candidate kernel is selected.

Sequence Diagram (high-level AVX2 copy flow):

sequenceDiagram
  participant Src
  participant copy_avx2
  participant Dst
  Src->>copy_avx2: load 32B chunk A
  copy_avx2->>Dst: store 32B chunk A
  Src->>copy_avx2: load 32B chunk B
  copy_avx2->>Dst: store 32B chunk B
  alt remaining 32B tail
    Src->>copy_avx2: load tail 32B
    copy_avx2->>Dst: store tail 32B
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • #178: touches simd_copy primitive and copy semantics; related to AVX2 kernel changes here.
  • #189: proposes refactoring simd_copy dispatcher and kernel usage; related to the module edits in this PR.

Possibly related PRs

"I'm a rabbit in the CPU den,
I hop through lanes of bytes again,
Two loads, two stores, a stitched-up tail,
Sixty-four hops—my copy tale! 🐇✨"

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: integrating an AVX2 unroll-2 wildcopy candidate into the decoding performance path.
Linked Issues check ✅ Passed The PR implementation meets all coding requirements from #108: AVX2 unroll-2 is integrated into copy_strategy() with donor-compatible semantics, regression tests added, and no correctness or interoperability regressions reported.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the objectives: simd_copy.rs implements the unroll-2 candidate, wildcopy_candidates.rs corrects benchmark validation to match the contract, and tests are added for boundary coverage.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#108-avx2-unroll2-wildcopy

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/simd_copy.rs`:
- Around line 374-377: Update the comment starting with "Unroll-2 cuts AVX2
wildcopy throughput by ~30-50 %" to reflect that throughput is improved (e.g.,
"Unroll-2 improves AVX2 wildcopy throughput by ~30-50 %" or "Unroll-2 cuts copy
time/latency by ~30-50 %"); locate the comment containing the exact phrase
"Unroll-2 cuts AVX2 wildcopy throughput" in simd_copy.rs and replace "cuts" with
"improves" (or use the alternative phrasing) so the wording matches the
benchmark claim.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 45e0c78b-0bb8-4f12-9bfb-01c646bd7399

📥 Commits

Reviewing files that changed from the base of the PR and between 6d9ac63 and 692efe4.

📒 Files selected for processing (1)
  • zstd/src/decoding/simd_copy.rs

Comment thread zstd/src/decoding/simd_copy.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates the 2×-unrolled AVX2 “wildcopy” inner loop into the decoder’s SIMD copy path to improve large-copy throughput while preserving the existing dispatcher contract (length rounded up to SIMD chunk size).

Changes:

  • Update copy_avx2 to copy 64 bytes/iteration via two independent 32B load/store pairs, with a single 32B tail.
  • Add a debug assertion documenting/guarding the “len is multiple of 32” precondition and expand the function’s rationale comment.

Comment thread zstd/src/decoding/simd_copy.rs Outdated
Comment thread zstd/src/decoding/simd_copy.rs
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/decoding/simd_copy.rs 94.28% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

…AVX2 unroll2 test coverage

3 review threads + 1 pre-existing bench bug:

- (bench) wildcopy_candidates.rs:267 — pre-existing self-check
  panic at len=17/33/63: assertion compared dst[..rounded] but
  production small-copy path writes only dst[..len] (no overshoot
  for sub-chunk lengths). Restricted assert to dst[..len], the
  contractual region. Bench can now actually run.

- CR + Copilot — simd_copy.rs:374 comment said 'cuts AVX2
  wildcopy throughput by ~30-50%' which reads as a regression.
  Reworded to 'cuts AVX2 wildcopy LATENCY (and so lifts
  throughput)' to make the direction unambiguous.

- Copilot — simd_copy.rs:403 missing test coverage for the
  unrolled body and tail. Added two tests:
  * copy_avx2_copies_full_unroll2_iteration (len=64)
  * copy_avx2_copies_unroll2_loop_plus_residual_tail (len=96)
  with seam spot-check at bytes 60..68.

573/573 tests pass; clippy clean.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/simd_copy.rs`:
- Around line 529-532: The tests use the Vec type but only the vec! macro was
imported; import the Vec type (alloc::vec::Vec) so the test code compiles—update
the imports in simd_copy.rs to bring Vec into scope and then run the tests that
call copy_avx2 and the other similar test to verify assertions comparing dst and
src succeed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e05c65ab-be9b-4b18-8043-aa45803ae796

📥 Commits

Reviewing files that changed from the base of the PR and between 692efe4 and 72f7e14.

📒 Files selected for processing (2)
  • zstd/benches/wildcopy_candidates.rs
  • zstd/src/decoding/simd_copy.rs

Comment thread zstd/src/decoding/simd_copy.rs
…tical)

The new AVX2 unroll-2 tests use Vec<u8>; std builds get it via
prelude but no-std builds require explicit import. CR caught
E0425 on the no-std CI lane.
Copilot AI review requested due to automatic review settings May 22, 2026 01:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread zstd/benches/wildcopy_candidates.rs Outdated
… Copilot)

Comment referred to the removed `check_len` variable; rephrased
to describe the rounded comparison length generically.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/decoding/simd_copy.rs Outdated
Comment thread zstd/src/decoding/simd_copy.rs
polaz added 2 commits May 22, 2026 08:00
…to AVX2 tests

- Reword the copy_avx2 comment so it stops promising a quantified
  ~30-50 % gain 'across all length classes'. The actual delta is
  workload-dependent — point readers at the two bench files that
  measure it instead of carrying a stale headline number.
- Move 'use alloc::vec::Vec;' from the tests-mod header into the two
  AVX2 tests that actually need it. On non-x86 / non-std test builds
  the top-level import was unused (the AVX2 tests are cfg-gated out)
  and could emit a dead-import warning.
Copilot AI review requested due to automatic review settings May 22, 2026 05:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 22, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 22, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@polaz polaz merged commit 47c0e81 into main May 22, 2026
26 checks passed
@polaz polaz deleted the perf/#108-avx2-unroll2-wildcopy branch May 22, 2026 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(decoding): integrate AVX2 unroll2 wildcopy candidate

2 participants