Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds three decode-path performance changes: runtime-dispatched x86_64 BMI2 PEXT triple-bit extraction (disabled on Zen1/Zen2), stride-based multi-line prefetch with L1/T1 variants and N+2 lookahead, and a branchless, table-driven offset-history implementation with unit tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/bit_io/bit_reader_reverse.rs`:
- Around line 75-88: The two unnecessary unsafe blocks around CPU feature
queries should be removed: call __cpuid directly (instead of unsafe {
__cpuid(...) }) when building the vendor array and when reading eax, so the
vendor computation (vendor, leaf0, is_amd) and the subsequent eax =
__cpuid(1).eax use safe calls; leave logic that returns TripleExtractDispatch {
use_pext: true } and the is_amd comparison unchanged. Ensure there are no
remaining unused unsafe wrappers around __cpuid to avoid the unused_unsafe
warning.
In `@zstd/src/decoding/prefetch.rs`:
- Around line 58-75: The prefetch_stride_x86 function uses a runtime hint
parameter but needs the hint as a const generic like the x86_64 fix; change the
signature of prefetch_stride_x86 to take a const generic hint (e.g. fn
prefetch_stride_x86<const HINT: i32>(slice: &[u8])) and replace uses of the
runtime hint variable with the const HINT in the unsafe _mm_prefetch call, then
update all callers to invoke the function with the const generic (e.g.
prefetch_stride_x86::<{_MM_HINT_T0}> or the appropriate hint constant) to mirror
the x86_64 change.
- Around line 27-42: The function prefetch_stride_x86_64 currently takes hint:
i32 at runtime which fails because core::arch::x86_64::_mm_prefetch requires a
compile-time constant; change prefetch_stride_x86_64 to accept a const generic
(e.g., const HINT: i32) and use that constant when calling _mm_prefetch, and
update all call sites to invoke prefetch_stride_x86_64::<HINT>(slice) with the
appropriate constant hint values; ensure the function body uses HINT rather than
a runtime parameter and remove the old hint argument from its signature.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8485e503-1db7-4a83-81f2-a12d9c9fdb38
📒 Files selected for processing (4)
zstd/src/bit_io/bit_reader_reverse.rszstd/src/decoding/decode_buffer.rszstd/src/decoding/prefetch.rszstd/src/decoding/sequence_execution.rs
There was a problem hiding this comment.
Pull request overview
Performance-focused updates to the zstd decode hot path: minimizing branches in offset-history handling, adding a more aggressive prefetch pipeline (literals + match source), and introducing a runtime-dispatched BMI2 path for triple bit extraction.
Changes:
- Reworks
do_offset_history()to a table-driven, branch-minimized implementation and adds regression tests for offset-history behavior. - Adds N+2 lookahead literal prefetching and introduces L1/L2 prefetch variants (including an AArch64
prfmpath). - Adds a runtime-dispatched BMI2 (
pext) extraction path inpeek_bits_triple()on x86_64 (std builds), with AMD family gating.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
zstd/src/decoding/sequence_execution.rs |
Branch-minimized offset-history resolution + N+2 literal prefetch + new offset-history tests. |
zstd/src/decoding/prefetch.rs |
Adds stride-based multi-line prefetch and separate L1 (T0) vs L2 (T1) hint paths, incl. AArch64. |
zstd/src/decoding/decode_buffer.rs |
Prefetches match source with L2 hint and gates by match length. |
zstd/src/bit_io/bit_reader_reverse.rs |
Adds runtime-dispatched BMI2 pext path for triple field extraction on x86_64 (std). |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…raction - make sequence offset-history updates table-driven with targeted regression tests - add N+2 literal prefetch and multi-line stride prefetch with L1/L2 hints - route match-source prefetch through T1 with size gating - add runtime-dispatched BMI2 pext path for triple bit extraction with AMD Zen1/2 fallback Refs #69
- remove unnecessary unsafe wrappers around __cpuid dispatch checks - switch x86/x86_64 prefetch helpers to const-generic prefetch hints and fix fallback cfg gating - preserve ZeroOffset error path in branchless offset history and add regression test - sync PR/issue wording with current BMI2 3x pext implementation
1fd4266 to
18f32aa
Compare
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/decoding/prefetch.rs`:
- Around line 96-120: The loop currently matches on PrefetchHintAarch64 per
iteration causing a branch; change prefetch_stride_aarch64 to a const-generic
form so the hint is resolved at compile time (e.g. fn
prefetch_stride_aarch64<const L1: bool>(slice: &[u8], /* or no hint param */) or
if your PrefetchHintAarch64 is usable as a const, use fn
prefetch_stride_aarch64<const H: PrefetchHintAarch64>(slice: &[u8]) ), move the
conditional/asm selection out of the loop (use an if L1 { asm!("prfm
pldl1keep...") } else { asm!("prfm pldl2keep...") } inside the loop but with L1
const eliminates branching) and update callers to call the function with the
appropriate const generic (<true> or <false> or <H>) instead of passing a
runtime hint; keep the rest of the logic (CACHE_LINE, MAX_LINES, bounds)
unchanged and retain #[inline(always)].
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 294ed149-289e-41c5-9b53-412ff74a2905
📒 Files selected for processing (4)
zstd/src/bit_io/bit_reader_reverse.rszstd/src/decoding/decode_buffer.rszstd/src/decoding/prefetch.rszstd/src/decoding/sequence_execution.rs
- make aarch64 prefetch hint compile-time via const generics and use readonly asm options - make N+2 literal prefetch range computation overflow-safe on usize targets - remove lit_len branch from do_offset_history via a single indexed rule table - add BMI2 triple-extract regression coverage against scalar reference
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/bit_io/bit_reader_reverse.rs`:
- Around line 65-97: The CPUID-dependent policy in
detect_triple_extract_dispatch is hard to unit-test; extract the vendor/family
predicate into a pure helper function (e.g., should_use_pext(vendor: [u8;12],
family: u32) -> bool) and have detect_triple_extract_dispatch call that helper
to decide TripleExtractDispatch.use_pext. Keep existing CPUID reads in
detect_triple_extract_dispatch but move the logic that computes is_amd and the
family != 0x17 check into should_use_pext, and add table-driven unit tests for
should_use_pext covering AuthenticAMD+0x17 => false, AuthenticAMD+0x19 => true,
and a non-AMD vendor => true.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 80849646-c9d0-4a4a-b4b3-211a6719e801
📒 Files selected for processing (4)
zstd/src/bit_io/bit_reader_reverse.rszstd/src/decoding/decode_buffer.rszstd/src/decoding/prefetch.rszstd/src/decoding/sequence_execution.rs
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.
| Benchmark suite | Current: 9d8f51e | Previous: 1b396b2 | Ratio |
|---|---|---|---|
compress/default/small-4k-log-lines/matrix/pure_rust |
6.611 ms |
4.594 ms |
1.44 |
compress/best/decodecorpus-z000033/matrix/pure_rust |
100.615 ms |
64.284 ms |
1.57 |
compress/best/low-entropy-1m/matrix/c_ffi |
1.589 ms |
1.159 ms |
1.37 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
- extract should_use_pext(vendor, family) from cpuid dispatch path - add table-driven tests for AMD Zen1/2 guard and non-AMD behavior - keep runtime dispatch semantics unchanged while improving branch coverage
Summary
do_offset_history()table-driven and branch-minimized in the sequence loopN+2literal prefetch in sequence execution (gated by size)64Blines, up to 4 lines) with L1 (T0) and L2 (T1) hintsT1with size gatingprfm pldl1keep/pldl2keeppextextraction path forpeek_bits_triple()on x86_64, with fallback disabled for AMD family0x17(Zen1/Zen2)Validation
cargo fmt --allcargo clippy -p structured-zstd --all-targets -- -D warningscargo build --workspacecargo nextest run --workspacecargo test --doc --workspaceBenchmark snapshot (local)
decompress/default/small-4k-log-lines/rust_stream/matrix/pure_rust:1439 ns/iterdecompress/default/small-4k-log-lines/rust_stream/matrix/c_ffi:466 ns/iterdecompress/default/decodecorpus-z000033/rust_stream/matrix/pure_rust:1,491,532 ns/iterdecompress/default/decodecorpus-z000033/rust_stream/matrix/c_ffi:380,672 ns/iterCloses #69
Summary by CodeRabbit