Optimize dict array validation to use true_count()#8263
Merged
Conversation
Replace the bit-by-bit `referenced_mask.iter().all(|v| v)` scan in `validate_all_values_referenced` with `true_count() == len()`, which dispatches to the SIMD (AVX2/AVX-512) popcount path in `count_ones` instead of iterating each bit with a per-bit branch. Signed-off-by: Claude <noreply@anthropic.com>
`compute_referenced_values_mask` previously scattered every code into a `Vec<bool>` and then packed the whole thing into a `BitBuffer`. Track how many distinct values remain unreferenced and stop scanning the moment all values have been seen; when that happens the mask is constant so we return a filled `BitBuffer` directly and skip the pack entirely. Dictionaries are commonly referenced by far more codes than they have values, so this skips the bulk of the scatter in the typical case. The store is still skipped for already-seen values, which keeps the sparse-coverage case (few distinct values, many repeats) branch-friendly and avoids a read-modify-write storm on a handful of hot bytes. Benchmark (divan, dict_unreferenced_mask, median): many_codes_few_values/1024 34.0us -> 5.1us (6.7x) many_codes_few_values/2048 34.2us -> 12.4us (2.8x) many_codes_few_values/4096 35.8us -> 38.0us (-6%) many_nulls/0.5 39.9us -> 6.6us (6.0x) many_nulls/0.9 67.0us -> 8.0us (8.4x) sparse_coverage/* ~31us -> ~30us (flat) The 4096 case regresses slightly: the dictionary is only fully covered near the end of the codes, so the early exit saves little while the per-code seen-check costs a few mispredicts. This is outweighed by the multi-x wins when coverage completes early. Signed-off-by: Claude <noreply@anthropic.com>
compute_referenced_values_mask scattered references into `seen` with a data-dependent "skip if already seen" branch. For dictionaries that are not fully referenced -- the min_max / is_constant path -- that branch mispredicts heavily at high cardinality, and the early exit it enables never fires (the dictionary is never fully covered). Dispatch on has_all_values_referenced: keep the early-exit scan when the dictionary is expected to be fully referenced (e.g. validation), and use a branchless blind store otherwise. The blind store writes every reference unconditionally, avoiding both the misprediction and the read-modify-write contention a counted store would incur. On 1k-16k value dictionaries with ~50% of values referenced and 65k codes this is 35-56% faster; fully-referenced dictionaries keep their early-exit fast path. Flag fully-referenced benchmark fixtures so they exercise the early-exit path as they would in practice, and add bench_partial_coverage for the partially-referenced shape. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01HXcDcBz5VDmgA9FaaR1UcY
Profiling compute_referenced_values_mask on the partially-referenced shape (1k-16k values, ~50% referenced, 65k codes) that min_max / is_constant hit showed the work splitting into the unavoidable O(codes) scatter (~56%) and the BitBuffer::collect_bool pack (~38%). The pack was a scalar shift-per-bit loop, so its cost grew with the dictionary size. Replace it with pack_seen, which folds eight `seen` bytes into one bitmap byte using a single multiply: masking each byte to its low bit and multiplying by 0x0102_0408_1020_4080 gathers those eight bits into the top byte, LSB-first. The scatter target becomes a byte slice (a blind store, as before) so the fold can read eight values at a time. This is branchless, needs no target features (portable across architectures, no unsafe), and replaces the ~19us scalar pack at 16k values with ~1.7us. On bench_partial_coverage this makes the mask flat at ~28us instead of growing 40us -> 52us with size: -35% at 1024 values up to -45% at 16384. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01HXcDcBz5VDmgA9FaaR1UcY
Reverts the early-exit scan, blind-store dispatch, and multiply bit-gather pack work on compute_referenced_values_mask (and the accompanying benchmark changes), restoring the original mask computation. The only retained change is using a vectorized popcount (true_count() == len()) for the all-values-referenced validation check. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01HXcDcBz5VDmgA9FaaR1UcY
Merging this PR will improve performance by 25.88%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
46.6 µs | 31.7 µs | +46.98% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
213.2 µs | 177.1 µs | +20.41% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
309.6 µs | 274.7 µs | +12.71% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/set-bits-optimization-aimji (f139171) with develop (d97d2bd)
myrrc
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace an inefficient
iter().all()check with a directtrue_count()comparison when validating that all values in a dictionary array are referenced.