Optimize dict array validation to use true_count() by joseph-isaacs · Pull Request #8263 · vortex-data/vortex

joseph-isaacs · 2026-06-05T10:15:00Z

Summary

Replace an inefficient iter().all() check with a direct true_count() comparison when validating that all values in a dictionary array are referenced.

Replace the bit-by-bit `referenced_mask.iter().all(|v| v)` scan in `validate_all_values_referenced` with `true_count() == len()`, which dispatches to the SIMD (AVX2/AVX-512) popcount path in `count_ones` instead of iterating each bit with a per-bit branch. Signed-off-by: Claude <noreply@anthropic.com>

`compute_referenced_values_mask` previously scattered every code into a `Vec<bool>` and then packed the whole thing into a `BitBuffer`. Track how many distinct values remain unreferenced and stop scanning the moment all values have been seen; when that happens the mask is constant so we return a filled `BitBuffer` directly and skip the pack entirely. Dictionaries are commonly referenced by far more codes than they have values, so this skips the bulk of the scatter in the typical case. The store is still skipped for already-seen values, which keeps the sparse-coverage case (few distinct values, many repeats) branch-friendly and avoids a read-modify-write storm on a handful of hot bytes. Benchmark (divan, dict_unreferenced_mask, median): many_codes_few_values/1024 34.0us -> 5.1us (6.7x) many_codes_few_values/2048 34.2us -> 12.4us (2.8x) many_codes_few_values/4096 35.8us -> 38.0us (-6%) many_nulls/0.5 39.9us -> 6.6us (6.0x) many_nulls/0.9 67.0us -> 8.0us (8.4x) sparse_coverage/* ~31us -> ~30us (flat) The 4096 case regresses slightly: the dictionary is only fully covered near the end of the codes, so the early exit saves little while the per-code seen-check costs a few mispredicts. This is outweighed by the multi-x wins when coverage completes early. Signed-off-by: Claude <noreply@anthropic.com>

compute_referenced_values_mask scattered references into `seen` with a data-dependent "skip if already seen" branch. For dictionaries that are not fully referenced -- the min_max / is_constant path -- that branch mispredicts heavily at high cardinality, and the early exit it enables never fires (the dictionary is never fully covered). Dispatch on has_all_values_referenced: keep the early-exit scan when the dictionary is expected to be fully referenced (e.g. validation), and use a branchless blind store otherwise. The blind store writes every reference unconditionally, avoiding both the misprediction and the read-modify-write contention a counted store would incur. On 1k-16k value dictionaries with ~50% of values referenced and 65k codes this is 35-56% faster; fully-referenced dictionaries keep their early-exit fast path. Flag fully-referenced benchmark fixtures so they exercise the early-exit path as they would in practice, and add bench_partial_coverage for the partially-referenced shape. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01HXcDcBz5VDmgA9FaaR1UcY

Profiling compute_referenced_values_mask on the partially-referenced shape (1k-16k values, ~50% referenced, 65k codes) that min_max / is_constant hit showed the work splitting into the unavoidable O(codes) scatter (~56%) and the BitBuffer::collect_bool pack (~38%). The pack was a scalar shift-per-bit loop, so its cost grew with the dictionary size. Replace it with pack_seen, which folds eight `seen` bytes into one bitmap byte using a single multiply: masking each byte to its low bit and multiplying by 0x0102_0408_1020_4080 gathers those eight bits into the top byte, LSB-first. The scatter target becomes a byte slice (a blind store, as before) so the fold can read eight values at a time. This is branchless, needs no target features (portable across architectures, no unsafe), and replaces the ~19us scalar pack at 16k values with ~1.7us. On bench_partial_coverage this makes the mask flat at ~28us instead of growing 40us -> 52us with size: -35% at 1024 values up to -45% at 16384. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01HXcDcBz5VDmgA9FaaR1UcY

Reverts the early-exit scan, blind-store dispatch, and multiply bit-gather pack work on compute_referenced_values_mask (and the accompanying benchmark changes), restoring the original mask computation. The only retained change is using a vectorized popcount (true_count() == len()) for the all-values-referenced validation check. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01HXcDcBz5VDmgA9FaaR1UcY

codspeed-hq · 2026-06-05T10:24:24Z

Merging this PR will improve performance by 25.88%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 3 improved benchmarks
✅ 1504 untouched benchmarks

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`chunked_bool_canonical_into[(1000, 10)]`	46.6 µs	31.7 µs	+46.98%
⚡	Simulation	`chunked_varbinview_into_canonical[(1000, 10)]`	213.2 µs	177.1 µs	+20.41%
⚡	Simulation	`chunked_varbinview_canonical_into[(100, 100)]`	309.6 µs	274.7 µs	+12.71%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/set-bits-optimization-aimji (f139171) with develop (d97d2bd)}

claude and others added 5 commits June 4, 2026 17:23

joseph-isaacs added the changelog/performance A performance improvement label Jun 5, 2026

joseph-isaacs enabled auto-merge (squash) June 5, 2026 10:15

joseph-isaacs requested a review from myrrc June 5, 2026 10:15

Merge branch 'develop' into claude/set-bits-optimization-aimji

f139171

myrrc approved these changes Jun 5, 2026

View reviewed changes

joseph-isaacs merged commit 1e29b32 into develop Jun 5, 2026
64 checks passed

joseph-isaacs deleted the claude/set-bits-optimization-aimji branch June 5, 2026 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize dict array validation to use true_count()#8263

Optimize dict array validation to use true_count()#8263
joseph-isaacs merged 6 commits into
developfrom
claude/set-bits-optimization-aimji

joseph-isaacs commented Jun 5, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joseph-isaacs commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 25.88%

Performance Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joseph-isaacs commented Jun 5, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading