Skip to content

perf(decoding): branchless offset history, prefetch pipeline, and BMI2 triple extract#90

Merged
polaz merged 4 commits intomainfrom
perf/#69-branchless-prefetch-pext
Apr 9, 2026
Merged

perf(decoding): branchless offset history, prefetch pipeline, and BMI2 triple extract#90
polaz merged 4 commits intomainfrom
perf/#69-branchless-prefetch-pext

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Apr 9, 2026

Summary

  • make do_offset_history() table-driven and branch-minimized in the sequence loop
  • add N+2 literal prefetch in sequence execution (gated by size)
  • add multi-line stride prefetch (64B lines, up to 4 lines) with L1 (T0) and L2 (T1) hints
  • route match-source prefetch via T1 with size gating
  • add AArch64 prefetch path using prfm pldl1keep / pldl2keep
  • add runtime-dispatched BMI2 3x pext extraction path for peek_bits_triple() on x86_64, with fallback disabled for AMD family 0x17 (Zen1/Zen2)
  • add offset-history regression tests covering lit/non-lit and rep/new-offset paths

Validation

  • cargo fmt --all
  • cargo clippy -p structured-zstd --all-targets -- -D warnings
  • cargo build --workspace
  • cargo nextest run --workspace
  • cargo test --doc --workspace

Benchmark snapshot (local)

  • decompress/default/small-4k-log-lines/rust_stream/matrix/pure_rust: 1439 ns/iter
  • decompress/default/small-4k-log-lines/rust_stream/matrix/c_ffi: 466 ns/iter
  • decompress/default/decodecorpus-z000033/rust_stream/matrix/pure_rust: 1,491,532 ns/iter
  • decompress/default/decodecorpus-z000033/rust_stream/matrix/c_ffi: 380,672 ns/iter

Closes #69

Summary by CodeRabbit

  • Chores
    • Improved CPU prefetching with separate L1/T1 strategies and stride-based prefetch for x86_64 and ARM64; short matches now skip prefetch to avoid waste. Added T1 prefetch entry point.
    • Added architecture-aware runtime dispatch to leverage faster triple-bit extraction on supported x86_64 CPUs.
  • Refactor
    • Reworked offset-history and sequence execution to be more branchless and efficient, with lookahead prefetching.
  • Tests
    • Added unit tests covering offset history and extraction dispatch correctness.

Copilot AI review requested due to automatic review settings April 9, 2026 07:22
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 76a5d1e4-75a7-4f92-a2cd-3821299acb1d

📥 Commits

Reviewing files that changed from the base of the PR and between 09e93b9 and 9d8f51e.

📒 Files selected for processing (1)
  • zstd/src/bit_io/bit_reader_reverse.rs

📝 Walkthrough

Walkthrough

Adds three decode-path performance changes: runtime-dispatched x86_64 BMI2 PEXT triple-bit extraction (disabled on Zen1/Zen2), stride-based multi-line prefetch with L1/T1 variants and N+2 lookahead, and a branchless, table-driven offset-history implementation with unit tests.

Changes

Cohort / File(s) Summary
BMI2 Runtime Dispatch
zstd/src/bit_io/bit_reader_reverse.rs
Adds runtime dispatch (OnceLock + CPUID) to use BMI2 _pext_u64 for triple-bit extraction when appropriate, implements extract_triple_pext (#[target_feature(enable="bmi2")]), try_extract_triple_with_pext, and falls back to scalar mask/shift. Adds BMI2 correctness tests and policy table unit test.
Prefetch System Refactor
zstd/src/decoding/prefetch.rs, zstd/src/decoding/decode_buffer.rs
Splits prefetch into L1/T1 impls and exposes prefetch_slice_t1; implements stride-based multi-line prefetch (up to 4×64B) for x86/x86_64 and AArch64, changes prefetch_slice to call L1 impl, switches match-source prefetch to T1 and skips prefetch for short matches; DecodeBuffer::repeat now forwards match_length.
Sequence Execution Optimizations
zstd/src/decoding/sequence_execution.rs
Replaces per-iteration literal prefetch with N+2 lookahead (prefetch_literals_n_plus_two), removes old helper, and rewrites do_offset_history into a branchless rule-table–driven selector with conditional updates. Adds unit tests covering history behaviors.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Poem

🐇
I hopped through bits and nudged the cache ahead,
I taught offsets to twirl without a branchy tread,
I puffed small prefetch lines into the night,
And PEXT now hums when silicon is right,
A carrot-coded cheer beneath the CPU light 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main performance optimizations implemented in the PR: branchless offset history, prefetch pipeline improvements, and BMI2 triple extract. It is specific, clear, and directly reflects the changeset.
Linked Issues check ✅ Passed All three primary objectives from issue #69 are met: (1) do_offset_history() converted to table-driven branchless implementation [#69], (2) stride prefetch with multi-line and N+2 lookahead plus match-source T1 prefetches with AArch64 support [#69], (3) runtime-dispatched BMI2 pext for triple extraction with fallback and AMD Zen1/Zen2 guard [#69]. Code changes align with acceptance criteria.
Out of Scope Changes check ✅ Passed All code changes are directly scoped to the three micro-optimizations defined in issue #69 (branchless offset history, stride prefetch, BMI2 pext). No unrelated refactorings, documentation, or config changes are present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#69-branchless-prefetch-pext

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/bit_io/bit_reader_reverse.rs`:
- Around line 75-88: The two unnecessary unsafe blocks around CPU feature
queries should be removed: call __cpuid directly (instead of unsafe {
__cpuid(...) }) when building the vendor array and when reading eax, so the
vendor computation (vendor, leaf0, is_amd) and the subsequent eax =
__cpuid(1).eax use safe calls; leave logic that returns TripleExtractDispatch {
use_pext: true } and the is_amd comparison unchanged. Ensure there are no
remaining unused unsafe wrappers around __cpuid to avoid the unused_unsafe
warning.

In `@zstd/src/decoding/prefetch.rs`:
- Around line 58-75: The prefetch_stride_x86 function uses a runtime hint
parameter but needs the hint as a const generic like the x86_64 fix; change the
signature of prefetch_stride_x86 to take a const generic hint (e.g. fn
prefetch_stride_x86<const HINT: i32>(slice: &[u8])) and replace uses of the
runtime hint variable with the const HINT in the unsafe _mm_prefetch call, then
update all callers to invoke the function with the const generic (e.g.
prefetch_stride_x86::<{_MM_HINT_T0}> or the appropriate hint constant) to mirror
the x86_64 change.
- Around line 27-42: The function prefetch_stride_x86_64 currently takes hint:
i32 at runtime which fails because core::arch::x86_64::_mm_prefetch requires a
compile-time constant; change prefetch_stride_x86_64 to accept a const generic
(e.g., const HINT: i32) and use that constant when calling _mm_prefetch, and
update all call sites to invoke prefetch_stride_x86_64::<HINT>(slice) with the
appropriate constant hint values; ensure the function body uses HINT rather than
a runtime parameter and remove the old hint argument from its signature.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8485e503-1db7-4a83-81f2-a12d9c9fdb38

📥 Commits

Reviewing files that changed from the base of the PR and between 1112571 and e61c7bf.

📒 Files selected for processing (4)
  • zstd/src/bit_io/bit_reader_reverse.rs
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/prefetch.rs
  • zstd/src/decoding/sequence_execution.rs

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Performance-focused updates to the zstd decode hot path: minimizing branches in offset-history handling, adding a more aggressive prefetch pipeline (literals + match source), and introducing a runtime-dispatched BMI2 path for triple bit extraction.

Changes:

  • Reworks do_offset_history() to a table-driven, branch-minimized implementation and adds regression tests for offset-history behavior.
  • Adds N+2 lookahead literal prefetching and introduces L1/L2 prefetch variants (including an AArch64 prfm path).
  • Adds a runtime-dispatched BMI2 (pext) extraction path in peek_bits_triple() on x86_64 (std builds), with AMD family gating.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
zstd/src/decoding/sequence_execution.rs Branch-minimized offset-history resolution + N+2 literal prefetch + new offset-history tests.
zstd/src/decoding/prefetch.rs Adds stride-based multi-line prefetch and separate L1 (T0) vs L2 (T1) hint paths, incl. AArch64.
zstd/src/decoding/decode_buffer.rs Prefetches match source with L2 hint and gates by match length.
zstd/src/bit_io/bit_reader_reverse.rs Adds runtime-dispatched BMI2 pext path for triple field extraction on x86_64 (std).

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 96.99571% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/bit_io/bit_reader_reverse.rs 93.80% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

polaz added 2 commits April 9, 2026 10:45
…raction

- make sequence offset-history updates table-driven with targeted regression tests
- add N+2 literal prefetch and multi-line stride prefetch with L1/L2 hints
- route match-source prefetch through T1 with size gating
- add runtime-dispatched BMI2 pext path for triple bit extraction with AMD Zen1/2 fallback

Refs #69
- remove unnecessary unsafe wrappers around __cpuid dispatch checks
- switch x86/x86_64 prefetch helpers to const-generic prefetch hints and fix fallback cfg gating
- preserve ZeroOffset error path in branchless offset history and add regression test
- sync PR/issue wording with current BMI2 3x pext implementation
@polaz polaz force-pushed the perf/#69-branchless-prefetch-pext branch from 1fd4266 to 18f32aa Compare April 9, 2026 07:45
@polaz polaz requested a review from Copilot April 9, 2026 07:47
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 9, 2026

@coderabbitai re-review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

@polaz: Sure! I'll re-review the PR now.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T20:26:15.902Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_covers_all_copy_strategies` derives copy lengths from a helper `active_chunk_size_for_tests()` (not hardcoded values) so that single-chunk, multi-chunk, and sub-chunk (fallback) cases adapt to whichever SIMD backend width is active at runtime (AVX-512/AVX2/SSE2/NEON/scalar). Do not flag hardcoded test lengths as missing coverage for other backends — the test is backend-adaptive by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:56.626Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_preserves_prefix_for_runtime_chunk_lengths` (or equivalent) includes a misaligned extra-capacity case that uses `add(1)`-offset pointers and a `chunk_plus_one` length to exercise the rounded-up wildcopy overshoot path (where `copy_multiple > copy_at_least`) and validates prefix correctness. Do not flag the absence of a misaligned-pointer case in this test — it is present by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:51.926Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test suite for `copy_bytes_overshooting` includes a misaligned extra-capacity case that exercises the rounded-up wildcopy path (where `copy_multiple > copy_at_least`) and validates prefix correctness for that scenario. Do not flag the absence of a misaligned/overshooting extra-capacity sub-case as missing test coverage.
✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/decoding/prefetch.rs`:
- Around line 96-120: The loop currently matches on PrefetchHintAarch64 per
iteration causing a branch; change prefetch_stride_aarch64 to a const-generic
form so the hint is resolved at compile time (e.g. fn
prefetch_stride_aarch64<const L1: bool>(slice: &[u8], /* or no hint param */) or
if your PrefetchHintAarch64 is usable as a const, use fn
prefetch_stride_aarch64<const H: PrefetchHintAarch64>(slice: &[u8]) ), move the
conditional/asm selection out of the loop (use an if L1 { asm!("prfm
pldl1keep...") } else { asm!("prfm pldl2keep...") } inside the loop but with L1
const eliminates branching) and update callers to call the function with the
appropriate const generic (<true> or <false> or <H>) instead of passing a
runtime hint; keep the rest of the logic (CACHE_LINE, MAX_LINES, bounds)
unchanged and retain #[inline(always)].
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 294ed149-289e-41c5-9b53-412ff74a2905

📥 Commits

Reviewing files that changed from the base of the PR and between 1b396b2 and 18f32aa.

📒 Files selected for processing (4)
  • zstd/src/bit_io/bit_reader_reverse.rs
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/prefetch.rs
  • zstd/src/decoding/sequence_execution.rs

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

- make aarch64 prefetch hint compile-time via const generics and use readonly asm options
- make N+2 literal prefetch range computation overflow-safe on usize targets
- remove lit_len branch from do_offset_history via a single indexed rule table
- add BMI2 triple-extract regression coverage against scalar reference
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 9, 2026

@coderabbitai re-review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

@polaz Sure! I'll re-review the PR now.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:51.926Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test suite for `copy_bytes_overshooting` includes a misaligned extra-capacity case that exercises the rounded-up wildcopy path (where `copy_multiple > copy_at_least`) and validates prefix correctness for that scenario. Do not flag the absence of a misaligned/overshooting extra-capacity sub-case as missing test coverage.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T20:26:15.902Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_covers_all_copy_strategies` derives copy lengths from a helper `active_chunk_size_for_tests()` (not hardcoded values) so that single-chunk, multi-chunk, and sub-chunk (fallback) cases adapt to whichever SIMD backend width is active at runtime (AVX-512/AVX2/SSE2/NEON/scalar). Do not flag hardcoded test lengths as missing coverage for other backends — the test is backend-adaptive by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:56.626Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_preserves_prefix_for_runtime_chunk_lengths` (or equivalent) includes a misaligned extra-capacity case that uses `add(1)`-offset pointers and a `chunk_plus_one` length to exercise the rounded-up wildcopy overshoot path (where `copy_multiple > copy_at_least`) and validates prefix correctness. Do not flag the absence of a misaligned-pointer case in this test — it is present by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T06:17:40.408Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), both `copy_with_checks` and `copy_with_nobranch_check` thread aggregate source/destination capacities (`m1_src_cap`, `m2_src_cap`, `f1_dst_cap`, `f2_dst_cap`) into `simd_copy::copy_bytes_overshooting` so the SIMD eligibility check (`min(src_len, dst_len) >= copy_multiple`) is applied uniformly. Do not flag the capacity parameters in `copy_with_nobranch_check` as unnecessary or inconsistent with `copy_with_checks`.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-06T01:40:24.378Z
Learning: In `zstd/benches/compare_ffi.rs` (structured-world/structured-zstd), Rust FastCOVER trains with the post-finalization content budget in both the `REPORT_DICT_TRAIN` emission path (around lines 208-225) and the Criterion benchmark path (around lines 266-280). Both paths were aligned in commit 8622344. Do not flag these ranges as using inconsistent budget values.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-05T22:29:06.406Z
Learning: In `zstd/src/dictionary/fastcover.rs` (structured-world/structured-zstd), `FastCoverTuned.accel` is correctly populated in `train_fastcover_internal` using `accel: params.accel`, and `normalize_fastcover_params` clamps `accel` to `1..=10`. Do not flag `FastCoverTuned.accel` as unclamped or incorrectly set.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T16:00:30.438Z
Learning: In `zstd/src/encoding/match_generator.rs`, the `RowMatchGenerator` (used for `CompressionLevel::Level(4)` / `MatcherBackend::Row`) uses a stable 4-byte hash key, consistent with the 4-byte lookahead constraint shared by `HcMatchGenerator`. Previous-block tail positions are backfilled into the row tables before matching/skip begins (analogous to `backfill_boundary_positions` for the HC backend), and a regression test covers cross-boundary tail reuse. Do not flag missing backfill or key-width issues for the Row backend.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.
✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/bit_io/bit_reader_reverse.rs`:
- Around line 65-97: The CPUID-dependent policy in
detect_triple_extract_dispatch is hard to unit-test; extract the vendor/family
predicate into a pure helper function (e.g., should_use_pext(vendor: [u8;12],
family: u32) -> bool) and have detect_triple_extract_dispatch call that helper
to decide TripleExtractDispatch.use_pext. Keep existing CPUID reads in
detect_triple_extract_dispatch but move the logic that computes is_amd and the
family != 0x17 check into should_use_pext, and add table-driven unit tests for
should_use_pext covering AuthenticAMD+0x17 => false, AuthenticAMD+0x19 => true,
and a non-AMD vendor => true.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 80849646-c9d0-4a4a-b4b3-211a6719e801

📥 Commits

Reviewing files that changed from the base of the PR and between 1b396b2 and 09e93b9.

📒 Files selected for processing (4)
  • zstd/src/bit_io/bit_reader_reverse.rs
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/prefetch.rs
  • zstd/src/decoding/sequence_execution.rs

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

@sw-release-bot sw-release-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.

Benchmark suite Current: 9d8f51e Previous: 1b396b2 Ratio
compress/default/small-4k-log-lines/matrix/pure_rust 6.611 ms 4.594 ms 1.44
compress/best/decodecorpus-z000033/matrix/pure_rust 100.615 ms 64.284 ms 1.57
compress/best/low-entropy-1m/matrix/c_ffi 1.589 ms 1.159 ms 1.37

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

- extract should_use_pext(vendor, family) from cpuid dispatch path
- add table-driven tests for AMD Zen1/2 guard and non-AMD behavior
- keep runtime dispatch semantics unchanged while improving branch coverage
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@polaz polaz merged commit e101e88 into main Apr 9, 2026
17 checks passed
@polaz polaz deleted the perf/#69-branchless-prefetch-pext branch April 9, 2026 11:34
@sw-release-bot sw-release-bot bot mentioned this pull request Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(decoding): branchless offset history, stride prefetch, BMI2 pext

2 participants