Skip to content

Optimize dict array validation to use true_count()#8263

Merged
joseph-isaacs merged 6 commits into
developfrom
claude/set-bits-optimization-aimji
Jun 5, 2026
Merged

Optimize dict array validation to use true_count()#8263
joseph-isaacs merged 6 commits into
developfrom
claude/set-bits-optimization-aimji

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs commented Jun 5, 2026

Summary

Replace an inefficient iter().all() check with a direct true_count() comparison when validating that all values in a dictionary array are referenced.

claude and others added 5 commits June 4, 2026 17:23
Replace the bit-by-bit `referenced_mask.iter().all(|v| v)` scan in
`validate_all_values_referenced` with `true_count() == len()`, which
dispatches to the SIMD (AVX2/AVX-512) popcount path in `count_ones`
instead of iterating each bit with a per-bit branch.

Signed-off-by: Claude <noreply@anthropic.com>
`compute_referenced_values_mask` previously scattered every code into a
`Vec<bool>` and then packed the whole thing into a `BitBuffer`. Track how
many distinct values remain unreferenced and stop scanning the moment all
values have been seen; when that happens the mask is constant so we return
a filled `BitBuffer` directly and skip the pack entirely.

Dictionaries are commonly referenced by far more codes than they have
values, so this skips the bulk of the scatter in the typical case. The
store is still skipped for already-seen values, which keeps the
sparse-coverage case (few distinct values, many repeats) branch-friendly
and avoids a read-modify-write storm on a handful of hot bytes.

Benchmark (divan, dict_unreferenced_mask, median):

  many_codes_few_values/1024   34.0us -> 5.1us   (6.7x)
  many_codes_few_values/2048   34.2us -> 12.4us  (2.8x)
  many_codes_few_values/4096   35.8us -> 38.0us  (-6%)
  many_nulls/0.5               39.9us -> 6.6us   (6.0x)
  many_nulls/0.9               67.0us -> 8.0us   (8.4x)
  sparse_coverage/*            ~31us  -> ~30us   (flat)

The 4096 case regresses slightly: the dictionary is only fully covered
near the end of the codes, so the early exit saves little while the
per-code seen-check costs a few mispredicts. This is outweighed by the
multi-x wins when coverage completes early.

Signed-off-by: Claude <noreply@anthropic.com>
compute_referenced_values_mask scattered references into `seen` with a
data-dependent "skip if already seen" branch. For dictionaries that are
not fully referenced -- the min_max / is_constant path -- that branch
mispredicts heavily at high cardinality, and the early exit it enables
never fires (the dictionary is never fully covered).

Dispatch on has_all_values_referenced: keep the early-exit scan when the
dictionary is expected to be fully referenced (e.g. validation), and use
a branchless blind store otherwise. The blind store writes every
reference unconditionally, avoiding both the misprediction and the
read-modify-write contention a counted store would incur. On 1k-16k value
dictionaries with ~50% of values referenced and 65k codes this is 35-56%
faster; fully-referenced dictionaries keep their early-exit fast path.

Flag fully-referenced benchmark fixtures so they exercise the early-exit
path as they would in practice, and add bench_partial_coverage for the
partially-referenced shape.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

https://claude.ai/code/session_01HXcDcBz5VDmgA9FaaR1UcY
Profiling compute_referenced_values_mask on the partially-referenced shape
(1k-16k values, ~50% referenced, 65k codes) that min_max / is_constant hit
showed the work splitting into the unavoidable O(codes) scatter (~56%) and
the BitBuffer::collect_bool pack (~38%). The pack was a scalar
shift-per-bit loop, so its cost grew with the dictionary size.

Replace it with pack_seen, which folds eight `seen` bytes into one bitmap
byte using a single multiply: masking each byte to its low bit and
multiplying by 0x0102_0408_1020_4080 gathers those eight bits into the top
byte, LSB-first. The scatter target becomes a byte slice (a blind store, as
before) so the fold can read eight values at a time. This is branchless,
needs no target features (portable across architectures, no unsafe), and
replaces the ~19us scalar pack at 16k values with ~1.7us.

On bench_partial_coverage this makes the mask flat at ~28us instead of
growing 40us -> 52us with size: -35% at 1024 values up to -45% at 16384.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

https://claude.ai/code/session_01HXcDcBz5VDmgA9FaaR1UcY
Reverts the early-exit scan, blind-store dispatch, and multiply bit-gather
pack work on compute_referenced_values_mask (and the accompanying benchmark
changes), restoring the original mask computation. The only retained change
is using a vectorized popcount (true_count() == len()) for the
all-values-referenced validation check.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

https://claude.ai/code/session_01HXcDcBz5VDmgA9FaaR1UcY
@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label Jun 5, 2026
@joseph-isaacs joseph-isaacs enabled auto-merge (squash) June 5, 2026 10:15
@joseph-isaacs joseph-isaacs requested a review from myrrc June 5, 2026 10:15
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 5, 2026

Merging this PR will improve performance by 25.88%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 3 improved benchmarks
✅ 1504 untouched benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 46.6 µs 31.7 µs +46.98%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 213.2 µs 177.1 µs +20.41%
Simulation chunked_varbinview_canonical_into[(100, 100)] 309.6 µs 274.7 µs +12.71%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/set-bits-optimization-aimji (f139171) with develop (d97d2bd)

Open in CodSpeed

@joseph-isaacs joseph-isaacs merged commit 1e29b32 into develop Jun 5, 2026
64 checks passed
@joseph-isaacs joseph-isaacs deleted the claude/set-bits-optimization-aimji branch June 5, 2026 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants