perf(decoding): branchless offset history, prefetch pipeline, and BMI2 triple extract by polaz · Pull Request #90 · structured-world/structured-zstd

polaz · 2026-04-09T07:22:20Z

Summary

make do_offset_history() table-driven and branch-minimized in the sequence loop
add N+2 literal prefetch in sequence execution (gated by size)
add multi-line stride prefetch (64B lines, up to 4 lines) with L1 (T0) and L2 (T1) hints
route match-source prefetch via T1 with size gating
add AArch64 prefetch path using prfm pldl1keep / pldl2keep
add runtime-dispatched BMI2 3x pext extraction path for peek_bits_triple() on x86_64, with fallback disabled for AMD family 0x17 (Zen1/Zen2)
add offset-history regression tests covering lit/non-lit and rep/new-offset paths

Validation

cargo fmt --all
cargo clippy -p structured-zstd --all-targets -- -D warnings
cargo build --workspace
cargo nextest run --workspace
cargo test --doc --workspace

Benchmark snapshot (local)

decompress/default/small-4k-log-lines/rust_stream/matrix/pure_rust: 1439 ns/iter
decompress/default/small-4k-log-lines/rust_stream/matrix/c_ffi: 466 ns/iter
decompress/default/decodecorpus-z000033/rust_stream/matrix/pure_rust: 1,491,532 ns/iter
decompress/default/decodecorpus-z000033/rust_stream/matrix/c_ffi: 380,672 ns/iter

Closes #69

Summary by CodeRabbit

Chores
- Improved CPU prefetching with separate L1/T1 strategies and stride-based prefetch for x86_64 and ARM64; short matches now skip prefetch to avoid waste. Added T1 prefetch entry point.
- Added architecture-aware runtime dispatch to leverage faster triple-bit extraction on supported x86_64 CPUs.
Refactor
- Reworked offset-history and sequence execution to be more branchless and efficient, with lookahead prefetching.
Tests
- Added unit tests covering offset history and extraction dispatch correctness.

coderabbitai · 2026-04-09T07:23:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 76a5d1e4-75a7-4f92-a2cd-3821299acb1d

📥 Commits

Reviewing files that changed from the base of the PR and between 09e93b9 and 9d8f51e.

📒 Files selected for processing (1)

zstd/src/bit_io/bit_reader_reverse.rs

📝 Walkthrough

Walkthrough

Adds three decode-path performance changes: runtime-dispatched x86_64 BMI2 PEXT triple-bit extraction (disabled on Zen1/Zen2), stride-based multi-line prefetch with L1/T1 variants and N+2 lookahead, and a branchless, table-driven offset-history implementation with unit tests.

Changes

Cohort / File(s)	Summary
BMI2 Runtime Dispatch `zstd/src/bit_io/bit_reader_reverse.rs`	Adds runtime dispatch (OnceLock + CPUID) to use BMI2 `_pext_u64` for triple-bit extraction when appropriate, implements `extract_triple_pext` (`#[target_feature(enable="bmi2")]`), `try_extract_triple_with_pext`, and falls back to scalar mask/shift. Adds BMI2 correctness tests and policy table unit test.
Prefetch System Refactor `zstd/src/decoding/prefetch.rs`, `zstd/src/decoding/decode_buffer.rs`	Splits prefetch into L1/T1 impls and exposes `prefetch_slice_t1`; implements stride-based multi-line prefetch (up to 4×64B) for x86/x86_64 and AArch64, changes `prefetch_slice` to call L1 impl, switches match-source prefetch to T1 and skips prefetch for short matches; `DecodeBuffer::repeat` now forwards `match_length`.
Sequence Execution Optimizations `zstd/src/decoding/sequence_execution.rs`	Replaces per-iteration literal prefetch with N+2 lookahead (`prefetch_literals_n_plus_two`), removes old helper, and rewrites `do_offset_history` into a branchless rule-table–driven selector with conditional updates. Adds unit tests covering history behaviors.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

perf(decoding): SIMD wildcopy for literal and match memcpy #68 — Shares the same three objectives (branchless offset history, stride prefetch, BMI2 PEXT) and appears to target overlapping optimizations.

Possibly related PRs

perf(decoding): branchless bitstream reader with mask table and BMI2 support #58 — Modifies bit_reader_reverse.rs with BMI2-related fast paths; likely overlaps with the PEXT changes here.
perf(decoding): dual-state interleaved FSE sequence decoding #55 — Changes to BitReaderReversed that may interact with the new triple-extract/peek modifications.
perf(decoding): optimize sequence execution with overlap fast paths #42 — Touches prefetch and repeat logic in decode paths related to DecodeBuffer::repeat and prefetch helpers.

Poem

🐇
I hopped through bits and nudged the cache ahead,
I taught offsets to twirl without a branchy tread,
I puffed small prefetch lines into the night,
And PEXT now hums when silicon is right,
A carrot-coded cheer beneath the CPU light 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the three main performance optimizations implemented in the PR: branchless offset history, prefetch pipeline improvements, and BMI2 triple extract. It is specific, clear, and directly reflects the changeset.
Linked Issues check	✅ Passed	All three primary objectives from issue `#69` are met: (1) do_offset_history() converted to table-driven branchless implementation [`#69`], (2) stride prefetch with multi-line and N+2 lookahead plus match-source T1 prefetches with AArch64 support [`#69`], (3) runtime-dispatched BMI2 pext for triple extraction with fallback and AMD Zen1/Zen2 guard [`#69`]. Code changes align with acceptance criteria.
Out of Scope Changes check	✅ Passed	All code changes are directly scoped to the three micro-optimizations defined in issue `#69` (branchless offset history, stride prefetch, BMI2 pext). No unrelated refactorings, documentation, or config changes are present.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#69-branchless-prefetch-pext

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/bit_io/bit_reader_reverse.rs`:
- Around line 75-88: The two unnecessary unsafe blocks around CPU feature
queries should be removed: call __cpuid directly (instead of unsafe {
__cpuid(...) }) when building the vendor array and when reading eax, so the
vendor computation (vendor, leaf0, is_amd) and the subsequent eax =
__cpuid(1).eax use safe calls; leave logic that returns TripleExtractDispatch {
use_pext: true } and the is_amd comparison unchanged. Ensure there are no
remaining unused unsafe wrappers around __cpuid to avoid the unused_unsafe
warning.

In `@zstd/src/decoding/prefetch.rs`:
- Around line 58-75: The prefetch_stride_x86 function uses a runtime hint
parameter but needs the hint as a const generic like the x86_64 fix; change the
signature of prefetch_stride_x86 to take a const generic hint (e.g. fn
prefetch_stride_x86<const HINT: i32>(slice: &[u8])) and replace uses of the
runtime hint variable with the const HINT in the unsafe _mm_prefetch call, then
update all callers to invoke the function with the const generic (e.g.
prefetch_stride_x86::<{_MM_HINT_T0}> or the appropriate hint constant) to mirror
the x86_64 change.
- Around line 27-42: The function prefetch_stride_x86_64 currently takes hint:
i32 at runtime which fails because core::arch::x86_64::_mm_prefetch requires a
compile-time constant; change prefetch_stride_x86_64 to accept a const generic
(e.g., const HINT: i32) and use that constant when calling _mm_prefetch, and
update all call sites to invoke prefetch_stride_x86_64::<HINT>(slice) with the
appropriate constant hint values; ensure the function body uses HINT rather than
a runtime parameter and remove the old hint argument from its signature.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8485e503-1db7-4a83-81f2-a12d9c9fdb38

📥 Commits

Reviewing files that changed from the base of the PR and between 1112571 and e61c7bf.

📒 Files selected for processing (4)

zstd/src/bit_io/bit_reader_reverse.rs
zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/prefetch.rs
zstd/src/decoding/sequence_execution.rs

zstd/src/bit_io/bit_reader_reverse.rs

zstd/src/decoding/prefetch.rs

Copilot

Pull request overview

Performance-focused updates to the zstd decode hot path: minimizing branches in offset-history handling, adding a more aggressive prefetch pipeline (literals + match source), and introducing a runtime-dispatched BMI2 path for triple bit extraction.

Changes:

Reworks do_offset_history() to a table-driven, branch-minimized implementation and adds regression tests for offset-history behavior.
Adds N+2 lookahead literal prefetching and introduces L1/L2 prefetch variants (including an AArch64 prfm path).
Adds a runtime-dispatched BMI2 (pext) extraction path in peek_bits_triple() on x86_64 (std builds), with AMD family gating.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`zstd/src/decoding/sequence_execution.rs`	Branch-minimized offset-history resolution + N+2 literal prefetch + new offset-history tests.
`zstd/src/decoding/prefetch.rs`	Adds stride-based multi-line prefetch and separate L1 (T0) vs L2 (T1) hint paths, incl. AArch64.
`zstd/src/decoding/decode_buffer.rs`	Prefetches match source with L2 hint and gates by match length.
`zstd/src/bit_io/bit_reader_reverse.rs`	Adds runtime-dispatched BMI2 `pext` path for triple field extraction on x86_64 (std).

zstd/src/decoding/prefetch.rs

zstd/src/bit_io/bit_reader_reverse.rs

zstd/src/decoding/sequence_execution.rs

codecov · 2026-04-09T07:39:45Z

Codecov Report

❌ Patch coverage is 96.99571% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/bit_io/bit_reader_reverse.rs	93.80%	7 Missing ⚠️

📢 Thoughts on this report? Let us know!

…raction - make sequence offset-history updates table-driven with targeted regression tests - add N+2 literal prefetch and multi-line stride prefetch with L1/L2 hints - route match-source prefetch through T1 with size gating - add runtime-dispatched BMI2 pext path for triple bit extraction with AMD Zen1/2 fallback Refs #69

- remove unnecessary unsafe wrappers around __cpuid dispatch checks - switch x86/x86_64 prefetch helpers to const-generic prefetch hints and fix fallback cfg gating - preserve ZeroOffset error path in branchless offset history and add regression test - sync PR/issue wording with current BMI2 3x pext implementation

polaz · 2026-04-09T07:48:45Z

@coderabbitai re-review

coderabbitai · 2026-04-09T07:49:00Z

@polaz: Sure! I'll re-review the PR now.

🧠 Learnings used

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T20:26:15.902Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_covers_all_copy_strategies` derives copy lengths from a helper `active_chunk_size_for_tests()` (not hardcoded values) so that single-chunk, multi-chunk, and sub-chunk (fallback) cases adapt to whichever SIMD backend width is active at runtime (AVX-512/AVX2/SSE2/NEON/scalar). Do not flag hardcoded test lengths as missing coverage for other backends — the test is backend-adaptive by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:56.626Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_preserves_prefix_for_runtime_chunk_lengths` (or equivalent) includes a misaligned extra-capacity case that uses `add(1)`-offset pointers and a `chunk_plus_one` length to exercise the rounded-up wildcopy overshoot path (where `copy_multiple > copy_at_least`) and validates prefix correctness. Do not flag the absence of a misaligned-pointer case in this test — it is present by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:51.926Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test suite for `copy_bytes_overshooting` includes a misaligned extra-capacity case that exercises the rounded-up wildcopy path (where `copy_multiple > copy_at_least`) and validates prefix correctness for that scenario. Do not flag the absence of a misaligned/overshooting extra-capacity sub-case as missing test coverage.

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/decoding/prefetch.rs`:
- Around line 96-120: The loop currently matches on PrefetchHintAarch64 per
iteration causing a branch; change prefetch_stride_aarch64 to a const-generic
form so the hint is resolved at compile time (e.g. fn
prefetch_stride_aarch64<const L1: bool>(slice: &[u8], /* or no hint param */) or
if your PrefetchHintAarch64 is usable as a const, use fn
prefetch_stride_aarch64<const H: PrefetchHintAarch64>(slice: &[u8]) ), move the
conditional/asm selection out of the loop (use an if L1 { asm!("prfm
pldl1keep...") } else { asm!("prfm pldl2keep...") } inside the loop but with L1
const eliminates branching) and update callers to call the function with the
appropriate const generic (<true> or <false> or <H>) instead of passing a
runtime hint; keep the rest of the logic (CACHE_LINE, MAX_LINES, bounds)
unchanged and retain #[inline(always)].

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 294ed149-289e-41c5-9b53-412ff74a2905

📥 Commits

Reviewing files that changed from the base of the PR and between 1b396b2 and 18f32aa.

📒 Files selected for processing (4)

zstd/src/bit_io/bit_reader_reverse.rs
zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/prefetch.rs
zstd/src/decoding/sequence_execution.rs

zstd/src/decoding/prefetch.rs

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

zstd/src/decoding/sequence_execution.rs

zstd/src/bit_io/bit_reader_reverse.rs

zstd/src/decoding/prefetch.rs

- make aarch64 prefetch hint compile-time via const generics and use readonly asm options - make N+2 literal prefetch range computation overflow-safe on usize targets - remove lit_len branch from do_offset_history via a single indexed rule table - add BMI2 triple-extract regression coverage against scalar reference

polaz · 2026-04-09T08:08:58Z

@coderabbitai re-review

coderabbitai · 2026-04-09T08:09:16Z

@polaz Sure! I'll re-review the PR now.

🧠 Learnings used

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:51.926Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test suite for `copy_bytes_overshooting` includes a misaligned extra-capacity case that exercises the rounded-up wildcopy path (where `copy_multiple > copy_at_least`) and validates prefix correctness for that scenario. Do not flag the absence of a misaligned/overshooting extra-capacity sub-case as missing test coverage.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T20:26:15.902Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_covers_all_copy_strategies` derives copy lengths from a helper `active_chunk_size_for_tests()` (not hardcoded values) so that single-chunk, multi-chunk, and sub-chunk (fallback) cases adapt to whichever SIMD backend width is active at runtime (AVX-512/AVX2/SSE2/NEON/scalar). Do not flag hardcoded test lengths as missing coverage for other backends — the test is backend-adaptive by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:56.626Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_preserves_prefix_for_runtime_chunk_lengths` (or equivalent) includes a misaligned extra-capacity case that uses `add(1)`-offset pointers and a `chunk_plus_one` length to exercise the rounded-up wildcopy overshoot path (where `copy_multiple > copy_at_least`) and validates prefix correctness. Do not flag the absence of a misaligned-pointer case in this test — it is present by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T06:17:40.408Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), both `copy_with_checks` and `copy_with_nobranch_check` thread aggregate source/destination capacities (`m1_src_cap`, `m2_src_cap`, `f1_dst_cap`, `f2_dst_cap`) into `simd_copy::copy_bytes_overshooting` so the SIMD eligibility check (`min(src_len, dst_len) >= copy_multiple`) is applied uniformly. Do not flag the capacity parameters in `copy_with_nobranch_check` as unnecessary or inconsistent with `copy_with_checks`.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-06T01:40:24.378Z
Learning: In `zstd/benches/compare_ffi.rs` (structured-world/structured-zstd), Rust FastCOVER trains with the post-finalization content budget in both the `REPORT_DICT_TRAIN` emission path (around lines 208-225) and the Criterion benchmark path (around lines 266-280). Both paths were aligned in commit 8622344. Do not flag these ranges as using inconsistent budget values.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-05T22:29:06.406Z
Learning: In `zstd/src/dictionary/fastcover.rs` (structured-world/structured-zstd), `FastCoverTuned.accel` is correctly populated in `train_fastcover_internal` using `accel: params.accel`, and `normalize_fastcover_params` clamps `accel` to `1..=10`. Do not flag `FastCoverTuned.accel` as unclamped or incorrectly set.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T16:00:30.438Z
Learning: In `zstd/src/encoding/match_generator.rs`, the `RowMatchGenerator` (used for `CompressionLevel::Level(4)` / `MatcherBackend::Row`) uses a stable 4-byte hash key, consistent with the 4-byte lookahead constraint shared by `HcMatchGenerator`. Previous-block tail positions are backfilled into the row tables before matching/skip begins (analogous to `backfill_boundary_positions` for the HC backend), and a regression test covers cross-boundary tail reuse. Do not flag missing backfill or key-width issues for the Row backend.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/bit_io/bit_reader_reverse.rs`:
- Around line 65-97: The CPUID-dependent policy in
detect_triple_extract_dispatch is hard to unit-test; extract the vendor/family
predicate into a pure helper function (e.g., should_use_pext(vendor: [u8;12],
family: u32) -> bool) and have detect_triple_extract_dispatch call that helper
to decide TripleExtractDispatch.use_pext. Keep existing CPUID reads in
detect_triple_extract_dispatch but move the logic that computes is_amd and the
family != 0x17 check into should_use_pext, and add table-driven unit tests for
should_use_pext covering AuthenticAMD+0x17 => false, AuthenticAMD+0x19 => true,
and a non-AMD vendor => true.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 80849646-c9d0-4a4a-b4b3-211a6719e801

📥 Commits

Reviewing files that changed from the base of the PR and between 1b396b2 and 09e93b9.

📒 Files selected for processing (4)

zstd/src/bit_io/bit_reader_reverse.rs
zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/prefetch.rs
zstd/src/decoding/sequence_execution.rs

zstd/src/bit_io/bit_reader_reverse.rs

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

sw-release-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.

Benchmark suite	Current: `9d8f51e`	Previous: `1b396b2`	Ratio
`compress/default/small-4k-log-lines/matrix/pure_rust`	`6.611` ms	`4.594` ms	`1.44`
`compress/best/decodecorpus-z000033/matrix/pure_rust`	`100.615` ms	`64.284` ms	`1.57`
`compress/best/low-entropy-1m/matrix/c_ffi`	`1.589` ms	`1.159` ms	`1.37`

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

- extract should_use_pext(vendor, family) from cpuid dispatch path - add table-driven tests for AMD Zen1/2 guard and non-AMD behavior - keep runtime dispatch semantics unchanged while improving branch coverage

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings April 9, 2026 07:22

Copilot started reviewing on behalf of polaz April 9, 2026 07:23 View session

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

zstd/src/bit_io/bit_reader_reverse.rs Outdated Show resolved Hide resolved

zstd/src/decoding/prefetch.rs Outdated Show resolved Hide resolved

zstd/src/decoding/prefetch.rs Show resolved Hide resolved

Copilot AI reviewed Apr 9, 2026

View reviewed changes

zstd/src/decoding/prefetch.rs Outdated Show resolved Hide resolved

zstd/src/bit_io/bit_reader_reverse.rs Show resolved Hide resolved

zstd/src/decoding/sequence_execution.rs Show resolved Hide resolved

polaz added 2 commits April 9, 2026 10:45

polaz force-pushed the perf/#69-branchless-prefetch-pext branch from 1fd4266 to 18f32aa Compare April 9, 2026 07:45

polaz requested a review from Copilot April 9, 2026 07:47

Copilot started reviewing on behalf of polaz April 9, 2026 07:47 View session

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

zstd/src/decoding/prefetch.rs Show resolved Hide resolved

Copilot AI reviewed Apr 9, 2026

View reviewed changes

zstd/src/decoding/sequence_execution.rs Outdated Show resolved Hide resolved

zstd/src/decoding/sequence_execution.rs Outdated Show resolved Hide resolved

zstd/src/bit_io/bit_reader_reverse.rs Show resolved Hide resolved

zstd/src/decoding/prefetch.rs Outdated Show resolved Hide resolved

polaz requested a review from Copilot April 9, 2026 08:09

Copilot started reviewing on behalf of polaz April 9, 2026 08:10 View session

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

zstd/src/bit_io/bit_reader_reverse.rs Show resolved Hide resolved

Copilot AI reviewed Apr 9, 2026

View reviewed changes

sw-release-bot bot reviewed Apr 9, 2026

View reviewed changes

test(bit-reader): cover pext policy with pure helper

9d8f51e

- extract should_use_pext(vendor, family) from cpuid dispatch path - add table-driven tests for AMD Zen1/2 guard and non-AMD behavior - keep runtime dispatch semantics unchanged while improving branch coverage

polaz requested a review from Copilot April 9, 2026 11:02

Copilot started reviewing on behalf of polaz April 9, 2026 11:03 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

polaz merged commit e101e88 into main Apr 9, 2026
17 checks passed

polaz deleted the perf/#69-branchless-prefetch-pext branch April 9, 2026 11:34

sw-release-bot bot mentioned this pull request Apr 9, 2026

chore: release v0.0.12 #91

Merged

polaz mentioned this pull request Apr 9, 2026

Roadmap: structured-zstd feature parity with C zstd #28

Open

coderabbitai bot mentioned this pull request Apr 9, 2026

perf(encoding): SIMD match-length comparison (SSE4.2/AVX2/NEON) #70

Closed

5 tasks

Conversation

polaz commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Benchmark snapshot (local)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

polaz commented Apr 9, 2026

Uh oh!

coderabbitai bot commented Apr 9, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

polaz commented Apr 9, 2026

Uh oh!

coderabbitai bot commented Apr 9, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

sw-release-bot bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

polaz commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

codecov bot commented Apr 9, 2026 •

edited

Loading

sw-release-bot bot left a comment •

edited

Loading