chore(bench): split specifier microbenches into a separate bench binary#247
Conversation
Merging this PR will improve performance by 4.25%
|
6198697 to
18984c2
Compare
`specifier/realistic[rw/hash-only]` (~340 instructions) and `specifier/branches[fragment/short]` (~170 instructions) are short enough that the cold instruction-cache fill on each CodSpeed measurement iteration dominates the result: ~200 cycles of fixed overhead translates to +5% to +10% deltas under any unrelated binary-layout shift, even when the parser itself is unchanged or faster. CodSpeed surfaces these as false-positive regressions. Move the four `specifier/*` bench groups into their own `benches/specifier.rs` and register it as a second `[[bench]]` in Cargo.toml. Each `[[bench]]` runs in its own process, so the specifier binary gets a fresh, much smaller instruction-cache footprint instead of competing with the large `bench_resolver` code for cache lines. The per-case Ir is unchanged — what changes is the working-set the kernel and the L1/LL caches see before measurement starts, which makes cold-start misses predictable across runs. - `benches/specifier.rs`: new file. Allocator wrapper mirrors the one in `bench_resolver` so allocation costs are measured identically. - `benches/resolver.rs`: drops `specifier/*` groups, helpers, unused imports. - `Cargo.toml`: adds `[[bench]] name = "specifier"`. No code paths in `Specifier::parse` are touched; this is purely test-infrastructure stabilization.
18984c2 to
4e33856
Compare
f5fd078 to
4e33856
Compare
CodSpeed's `WARMUP_RUNS=5` inside `b.iter` primes the harness but does not absorb the single cold I-fetch miss (~105 estimated cycles) that a binary-layout shift can introduce on short cases like `specifier/realistic[rw/hash-only]`. Add a per-input `warm_parse` setup pass that runs 32 parses outside the Callgrind instrumentation window, paging in parse code, lazy-initializing the allocator, and training the branch predictor on the actual input before measurement begins.
There was a problem hiding this comment.
Pull request overview
This PR stabilizes CodSpeed measurements for very short specifier/* microbenchmarks by isolating them from unrelated binary-layout changes and by pre-warming the parse hot path outside the instrumented benchmark window.
Changes:
- Add a second Criterion bench binary (
specifier) dedicated toSpecifier::parsemicrobenches. - Move all
specifier/*benchmarks frombenches/resolver.rsinto the newbenches/specifier.rs. - Add a
warm_parsepre-pass beforeb.iterto reduce cold instruction-cache effects on short cases.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| Cargo.toml | Registers a new [[bench]] (specifier) gated by __internal_bench. |
| benches/specifier.rs | New dedicated specifier benchmark binary, including allocator wrapper + pre-b.iter warmup. |
| benches/resolver.rs | Removes the specifier/* benchmark groups and associated helpers/imports from the resolver bench binary. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Why
specifier/realistic[rw/hash-only](~340 instructions) andspecifier/branches[fragment/short](~170 instructions) are short enough that a single cold instruction-cache fill on the CodSpeed measurement iteration dominates the result. ~200 cycles of fixed overhead per case translates to +5% to +10% deltas under any unrelated binary-layout shift, even when the parser itself is unchanged or faster. CodSpeed surfaces these as false-positive regressions.Concrete signature from the #246 investigation: every "regressed" short specifier bench showed
Ir Δ = 0,cycles Δ ≈ +105,I1mr Δ = +1,ILmr Δ = +1— same value across all cases, a cache-line shift artifact, not a real perf change.What
Stabilize the short-specifier microbenches against binary-layout noise two complementary ways:
1. Separate
[[bench]]binaryMove the four
specifier/*bench groups into their ownbenches/specifier.rsand register it as a second[[bench]]inCargo.toml. Each[[bench]]runs in its own process, so the specifier binary gets a fresh, much smaller instruction-cache footprint instead of competing with the largebench_resolvercode for cache lines.resolver)resolver,specifier)benches/specifier.rs: new file. Allocator wrapper mirrors the one inbench_resolverso allocation costs are measured identically across binaries.benches/resolver.rs: drops thespecifier/*groups and their helpers; onlybench_resolverremains.Cargo.toml: adds[[bench]] name = "specifier".2. Warm parse case in setup, outside Callgrind window
Each
bench_with_inputnow runs 32 parses of the actual input beforeb.iter, via awarm_parsehelper. CodSpeed's internalWARMUP_RUNS=5happens insideb.iterand is intended to prime the harness, not absorb cold cache misses from a freshly-relayouted binary. The setup warmup runs before CodSpeed flips on Callgrind instrumentation, so it pages in parse code, lazy-inits the allocator, and trains the branch predictor on the actual input without inflating the measured counters.warm_parse(s)(NEW)b.iterWARMUP_RUNS=5b.iter, before measurementb.iterNo code in
Specifier::parseis touched; this is purely test-infrastructure stabilization.Test plan
cargo build --benches --features __internal_bench— both binaries buildcargo clippy --all-features -- -D warnings— cleanNote
The first CodSpeed run after this lands will show a one-shot reset for the specifier benches (different binary → different baseline). Subsequent runs are the actual stable measurements.