Skip to content

chore(bench): split specifier microbenches into a separate bench binary#247

Merged
stormslowly merged 2 commits into
mainfrom
bench/stable-short-specifier
May 28, 2026
Merged

chore(bench): split specifier microbenches into a separate bench binary#247
stormslowly merged 2 commits into
mainfrom
bench/stable-short-specifier

Conversation

@stormslowly
Copy link
Copy Markdown
Collaborator

@stormslowly stormslowly commented May 27, 2026

Why

specifier/realistic[rw/hash-only] (~340 instructions) and specifier/branches[fragment/short] (~170 instructions) are short enough that a single cold instruction-cache fill on the CodSpeed measurement iteration dominates the result. ~200 cycles of fixed overhead per case translates to +5% to +10% deltas under any unrelated binary-layout shift, even when the parser itself is unchanged or faster. CodSpeed surfaces these as false-positive regressions.

Concrete signature from the #246 investigation: every "regressed" short specifier bench showed Ir Δ = 0, cycles Δ ≈ +105, I1mr Δ = +1, ILmr Δ = +1 — same value across all cases, a cache-line shift artifact, not a real perf change.

What

Stabilize the short-specifier microbenches against binary-layout noise two complementary ways:

1. Separate [[bench]] binary

Move the four specifier/* bench groups into their own benches/specifier.rs and register it as a second [[bench]] in Cargo.toml. Each [[bench]] runs in its own process, so the specifier binary gets a fresh, much smaller instruction-cache footprint instead of competing with the large bench_resolver code for cache lines.

Before After
Bench binaries 1 (resolver) 2 (resolver, specifier)
Specifier code shares I-cache with resolver setup + symlink fixtures + tokio runtime + … only the 4 specifier bench groups
Cold-start misses per short case unpredictable (shifts with any unrelated change) scoped to the specifier binary
  • benches/specifier.rs: new file. Allocator wrapper mirrors the one in bench_resolver so allocation costs are measured identically across binaries.
  • benches/resolver.rs: drops the specifier/* groups and their helpers; only bench_resolver remains.
  • Cargo.toml: adds [[bench]] name = "specifier".

2. Warm parse case in setup, outside Callgrind window

Each bench_with_input now runs 32 parses of the actual input before b.iter, via a warm_parse helper. CodSpeed's internal WARMUP_RUNS=5 happens inside b.iter and is intended to prime the harness, not absorb cold cache misses from a freshly-relayouted binary. The setup warmup runs before CodSpeed flips on Callgrind instrumentation, so it pages in parse code, lazy-inits the allocator, and trains the branch predictor on the actual input without inflating the measured counters.

Phase When Instrumented? Purpose
warm_parse(s) (NEW) before b.iter no absorb single cold I-fetch miss caused by layout shifts
WARMUP_RUNS=5 inside b.iter, before measurement no harness priming
measurement iter inside b.iter yes the one number CodSpeed records

No code in Specifier::parse is touched; this is purely test-infrastructure stabilization.

Test plan

  • cargo build --benches --features __internal_bench — both binaries build
  • cargo clippy --all-features -- -D warnings — clean
  • CodSpeed benchmark workflow on this PR shows specifier deltas stable across follow-up unrelated commits (verified by landing a no-op binary-layout change after this and confirming the specifier deltas don't move)

Note

The first CodSpeed run after this lands will show a one-shot reset for the specifier benches (different binary → different baseline). Subsequent runs are the actual stable measurements.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 27, 2026

Merging this PR will improve performance by 4.25%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 1 improved benchmark
✅ 9 untouched benchmarks
🆕 80 new benchmarks
⏩ 84 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
🆕 Memory both-tail[len_64] N/A 0 B N/A
🆕 Memory query-tail[len_8] N/A 0 B N/A
🆕 Memory specifier/branches[none/short] N/A 0 B N/A
🆕 Memory frag-tail[len_64] N/A 0 B N/A
🆕 Memory path-only[len_8] N/A 0 B N/A
🆕 Memory specifier/branches[bare-module] N/A 0 B N/A
🆕 Memory specifier/branches[query+fragment/medium] N/A 0 B N/A
🆕 Memory specifier/branches[multi-question] N/A 0 B N/A
🆕 Memory both-tail[len_1536] N/A 0 B N/A
🆕 Memory specifier/branches[query/short] N/A 0 B N/A
🆕 Memory specifier/branches[fragment/medium] N/A 0 B N/A
🆕 Memory specifier/branches[query+fragment/short] N/A 0 B N/A
🆕 Memory both-tail[len_256] N/A 0 B N/A
🆕 Memory specifier/realistic[rw/css-modules] N/A 0 B N/A
🆕 Memory specifier/branches[query/medium] N/A 0 B N/A
🆕 Memory frag-tail[len_1536] N/A 0 B N/A
🆕 Memory specifier/realistic[rw/loader-chain] N/A 0 B N/A
🆕 Memory path-only[len_1536] N/A 0 B N/A
🆕 Memory both-tail[len_8] N/A 0 B N/A
🆕 Memory specifier/escapes[escapes_64] N/A 1 KB N/A
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing bench/stable-short-specifier (68da90c) with main (9dd63ca)

Open in CodSpeed

Footnotes

  1. 84 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@stormslowly stormslowly force-pushed the bench/stable-short-specifier branch from 6198697 to 18984c2 Compare May 27, 2026 03:36
`specifier/realistic[rw/hash-only]` (~340 instructions) and
`specifier/branches[fragment/short]` (~170 instructions) are short
enough that the cold instruction-cache fill on each CodSpeed
measurement iteration dominates the result: ~200 cycles of fixed
overhead translates to +5% to +10% deltas under any unrelated
binary-layout shift, even when the parser itself is unchanged or
faster. CodSpeed surfaces these as false-positive regressions.

Move the four `specifier/*` bench groups into their own
`benches/specifier.rs` and register it as a second `[[bench]]` in
Cargo.toml. Each `[[bench]]` runs in its own process, so the
specifier binary gets a fresh, much smaller instruction-cache
footprint instead of competing with the large `bench_resolver` code
for cache lines. The per-case Ir is unchanged — what changes is the
working-set the kernel and the L1/LL caches see before measurement
starts, which makes cold-start misses predictable across runs.

- `benches/specifier.rs`: new file. Allocator wrapper mirrors the one
  in `bench_resolver` so allocation costs are measured identically.
- `benches/resolver.rs`: drops `specifier/*` groups, helpers,
  unused imports.
- `Cargo.toml`: adds `[[bench]] name = "specifier"`.

No code paths in `Specifier::parse` are touched; this is purely
test-infrastructure stabilization.
@stormslowly stormslowly force-pushed the bench/stable-short-specifier branch from 18984c2 to 4e33856 Compare May 27, 2026 05:46
@stormslowly stormslowly changed the title chore(bench): amortize cold-start cache noise on short specifier benches chore(bench): split specifier microbenches into a separate bench binary May 27, 2026
@stormslowly stormslowly force-pushed the bench/stable-short-specifier branch 2 times, most recently from f5fd078 to 4e33856 Compare May 27, 2026 06:42
CodSpeed's `WARMUP_RUNS=5` inside `b.iter` primes the harness but does
not absorb the single cold I-fetch miss (~105 estimated cycles) that a
binary-layout shift can introduce on short cases like
`specifier/realistic[rw/hash-only]`. Add a per-input `warm_parse` setup
pass that runs 32 parses outside the Callgrind instrumentation window,
paging in parse code, lazy-initializing the allocator, and training the
branch predictor on the actual input before measurement begins.
@stormslowly stormslowly requested a review from hardfist May 27, 2026 15:29
@stormslowly stormslowly marked this pull request as ready for review May 27, 2026 15:29
Copilot AI review requested due to automatic review settings May 27, 2026 15:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR stabilizes CodSpeed measurements for very short specifier/* microbenchmarks by isolating them from unrelated binary-layout changes and by pre-warming the parse hot path outside the instrumented benchmark window.

Changes:

  • Add a second Criterion bench binary (specifier) dedicated to Specifier::parse microbenches.
  • Move all specifier/* benchmarks from benches/resolver.rs into the new benches/specifier.rs.
  • Add a warm_parse pre-pass before b.iter to reduce cold instruction-cache effects on short cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
Cargo.toml Registers a new [[bench]] (specifier) gated by __internal_bench.
benches/specifier.rs New dedicated specifier benchmark binary, including allocator wrapper + pre-b.iter warmup.
benches/resolver.rs Removes the specifier/* benchmark groups and associated helpers/imports from the resolver bench binary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread benches/specifier.rs
@stormslowly stormslowly merged commit 2808986 into main May 28, 2026
22 checks passed
@stormslowly stormslowly deleted the bench/stable-short-specifier branch May 28, 2026 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants