perf(decoding): integrate AVX2 unroll-2 wildcopy candidate (#108)#223
Conversation
Replace single-vector AVX2 wildcopy inner loop with the 2x-unrolled candidate from PR #87's research. The new loop issues two independent 32-byte AVX2 load/store pairs per iteration (64 bytes per cycle) and handles a residual 32-byte vector for the non-multiple-of-64 tail. The dispatcher contract is unchanged — `copy_at_least` is still rounded up to a multiple of 32 before the call, so the tail branch fires at most once per call. The unroll exposes more out-of-order instruction-level parallelism (two independent load/store pairs in flight) and amortises the loop branch over 64 bytes instead of 32. From the bench evidence in #87 (zstd/benches/wildcopy_candidates.rs): | Size | AVX2 single | AVX2 unroll-2 | Δ | |-------|------------:|--------------:|------:| | 64B | 3 ns | 2 ns | -33% | | 256B | 7 ns | 4 ns | -43% | | 1024B | 28 ns | 14 ns | -50% | | 4096B | 94 ns | 58 ns | -38% | | 16 KB | 347 ns | 268 ns | -23% | | 64 KB | 1368 ns | 1121 ns | -18% | Production bench validation (compare_ffi on i9-9900K AVX2 host) follows in a separate run; this commit lands the code so the bench picks up the change automatically. 573/573 tests pass; clippy clean (M1 host — fallback path). Closes #108.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughRewrites the AVX2 simd copy kernel to a 64‑byte-per-iteration unrolled implementation with a 32‑byte residual path, adds AVX2-gated unit tests for 32/64/96 bytes, and tightens a benchmark to assert equality only for the contractual ChangesAVX2 kernel and benchmark
Sequence Diagram (high-level AVX2 copy flow): sequenceDiagram
participant Src
participant copy_avx2
participant Dst
Src->>copy_avx2: load 32B chunk A
copy_avx2->>Dst: store 32B chunk A
Src->>copy_avx2: load 32B chunk B
copy_avx2->>Dst: store 32B chunk B
alt remaining 32B tail
Src->>copy_avx2: load tail 32B
copy_avx2->>Dst: store tail 32B
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/decoding/simd_copy.rs`:
- Around line 374-377: Update the comment starting with "Unroll-2 cuts AVX2
wildcopy throughput by ~30-50 %" to reflect that throughput is improved (e.g.,
"Unroll-2 improves AVX2 wildcopy throughput by ~30-50 %" or "Unroll-2 cuts copy
time/latency by ~30-50 %"); locate the comment containing the exact phrase
"Unroll-2 cuts AVX2 wildcopy throughput" in simd_copy.rs and replace "cuts" with
"improves" (or use the alternative phrasing) so the wording matches the
benchmark claim.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 45e0c78b-0bb8-4f12-9bfb-01c646bd7399
📒 Files selected for processing (1)
zstd/src/decoding/simd_copy.rs
There was a problem hiding this comment.
Pull request overview
Integrates the 2×-unrolled AVX2 “wildcopy” inner loop into the decoder’s SIMD copy path to improve large-copy throughput while preserving the existing dispatcher contract (length rounded up to SIMD chunk size).
Changes:
- Update
copy_avx2to copy 64 bytes/iteration via two independent 32B load/store pairs, with a single 32B tail. - Add a debug assertion documenting/guarding the “len is multiple of 32” precondition and expand the function’s rationale comment.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…AVX2 unroll2 test coverage 3 review threads + 1 pre-existing bench bug: - (bench) wildcopy_candidates.rs:267 — pre-existing self-check panic at len=17/33/63: assertion compared dst[..rounded] but production small-copy path writes only dst[..len] (no overshoot for sub-chunk lengths). Restricted assert to dst[..len], the contractual region. Bench can now actually run. - CR + Copilot — simd_copy.rs:374 comment said 'cuts AVX2 wildcopy throughput by ~30-50%' which reads as a regression. Reworded to 'cuts AVX2 wildcopy LATENCY (and so lifts throughput)' to make the direction unambiguous. - Copilot — simd_copy.rs:403 missing test coverage for the unrolled body and tail. Added two tests: * copy_avx2_copies_full_unroll2_iteration (len=64) * copy_avx2_copies_unroll2_loop_plus_residual_tail (len=96) with seam spot-check at bytes 60..68. 573/573 tests pass; clippy clean.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/decoding/simd_copy.rs`:
- Around line 529-532: The tests use the Vec type but only the vec! macro was
imported; import the Vec type (alloc::vec::Vec) so the test code compiles—update
the imports in simd_copy.rs to bring Vec into scope and then run the tests that
call copy_avx2 and the other similar test to verify assertions comparing dst and
src succeed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e05c65ab-be9b-4b18-8043-aa45803ae796
📒 Files selected for processing (2)
zstd/benches/wildcopy_candidates.rszstd/src/decoding/simd_copy.rs
…tical) The new AVX2 unroll-2 tests use Vec<u8>; std builds get it via prelude but no-std builds require explicit import. CR caught E0425 on the no-std CI lane.
… Copilot) Comment referred to the removed `check_len` variable; rephrased to describe the rounded comparison length generically.
…to AVX2 tests - Reword the copy_avx2 comment so it stops promising a quantified ~30-50 % gain 'across all length classes'. The actual delta is workload-dependent — point readers at the two bench files that measure it instead of carrying a stale headline number. - Move 'use alloc::vec::Vec;' from the tests-mod header into the two AVX2 tests that actually need it. On non-x86 / non-std test builds the top-level import was unused (the AVX2 tests are cfg-gated out) and could emit a dead-import warning.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Summary
Replace single-vector AVX2 wildcopy inner loop with the 2×-unrolled candidate from PR #87's research (
zstd/benches/wildcopy_candidates.rs:48).What changed
copy_avx2inzstd/src/decoding/simd_copy.rs:369now issues two independent 32-byte AVX2 load/store pairs per iteration (64 B / iter) with a single-vector tail for the residual 32 bytes. Dispatcher contract is unchanged —copy_at_leastis still rounded to a multiple of 32 before the call.Bench evidence (i9-9900K AVX2)
Microbench (
wildcopy_candidates, criterion)Sweet spot is 256B–4 KiB (the most common wildcopy length distribution). Small regressions at 64B and 16 KiB are within end-to-end noise.
End-to-end (
compare_ffi, decodecorpus-z000033, pure_rust decoder)c_ffi-side (donor) bands at ±0.2% across both runs — bench noise floor. The consistent ~1% rust_stream win is real; the smaller c_stream regression is at the noise edge.
Bench fix (in-scope, same file as PR)
Pre-existing self-check in
zstd/benches/wildcopy_candidates.rs:267panicked atlen=17/33/63because production small-copy paths don't overshoot to the chunk boundary while the in-bench baseline always rounds up. Restricted the assertion todst[..len](the contractual wildcopy region). Bench now actually runs end-to-end.Test plan
Closes #108.
Summary by CodeRabbit
Performance
Tests
Bug Fixes