Skip to content

Conversation

@geoffreyclaude
Copy link

@geoffreyclaude geoffreyclaude commented Dec 8, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

The IN list evaluation is a hot path in query execution. Profiling revealed two optimization opportunities:

  1. Hashing overhead dominates for small lists: For lists with ≤8 elements, the cost of hashing exceeds the benefit. Binary search is faster in this regime.

  2. String comparison is expensive: For Utf8View arrays, Arrow stores short strings (≤12 bytes) inline as a 128-bit view. We can compare these views directly as integers instead of doing byte-by-byte string comparison.

What changes are included in this PR?

Small list optimization (≤8 elements):

  • Introduces SortedLookup<T> using binary search instead of HashedLookup<T> for primitive types when list size ≤8
  • Separate filter structs ensure static dispatch (no runtime branch in the hot loop)

Utf8View short-string optimization:

  • New Utf8ViewSortedFilter and Utf8ViewHashedFilter that convert strings to their raw u128 view representation
  • Strings ≤12 bytes become fast integer comparisons

Are these changes tested?

Yes, covered by existing in_list tests which exercise all data types, list sizes, and null percentages.

Are there any user-facing changes?

No API changes. Queries using IN lists will execute faster, especially for:

  • Utf8View columns with short strings
  • Primitive columns with small IN lists
  • Columns with null values

Benchmark Results

Benchmarks run on 1024-row arrays with varying list sizes, null percentages, and string lengths.

Results are only on the last commit of the PR.

Utf8View Short Strings (≤12 bytes) — 60-70% faster

Benchmark Before After Change
list=3/nulls=0%/str=3 6.87 µs 2.18 µs -68%
list=3/nulls=0%/str=12 7.00 µs 2.19 µs -69%
list=8/nulls=0%/str=3 7.30 µs 2.79 µs -62%
list=8/nulls=20%/str=12 6.73 µs 2.92 µs -57%
list=100/nulls=0%/str=3 7.25 µs 2.47 µs -66%
list=100/nulls=20%/str=12 7.38 µs 2.57 µs -66%

Primitives with Small Lists — 30-65% faster

Benchmark Before After Change
Float32/list=3/nulls=0% 5.40 µs 2.20 µs -59%
Float32/list=8/nulls=0% 5.36 µs 2.94 µs -46%
Int32/list=3/nulls=0% 2.05 µs 1.41 µs -32%
Int32/list=3/nulls=20% 4.30 µs 1.56 µs -64%
Int32/list=8/nulls=20% 4.35 µs 2.25 µs -48%
Int32/list=100/nulls=20% 4.56 µs 2.04 µs -56%

Regressions (large lists, long strings)

Benchmark Before After Change
Utf8View/list=100/str=100 12.89 µs 13.95 µs +8%
Float32/list=100/nulls=0% 5.55 µs 5.69 µs +3%

These regressions are in less common patterns (large lists with long strings) and are outweighed by the gains in typical use cases.

- Add LargeStringArray benchmarks alongside existing StringArray benchmarks
- Use explicit ScalarValue::Utf8 for StringArray (was using ScalarValue::from which creates Utf8View)
… collect_bool

The previous implementation used BooleanArray::from_iter and BooleanBufferBuilder
with element-by-element appends, which incur iterator overhead and prevent
vectorization.

This commit switches to BooleanBuffer::collect_bool, a batch operation that
pre-allocates the exact buffer size and enables SIMD optimization. Since
collect_bool guarantees the index is always in bounds, we can safely use
unchecked array access (value_unchecked, get_unchecked) to eliminate bounds
checks in the hot loop.

The null-handling match is also simplified from a 3-way tuple to a 2-way
check by pre-combining needle and haystack null flags.
…ings

For small IN lists (≤8 elements), hashing overhead dominates execution time.
This commit uses binary search instead, which is faster for small lists.

Utf8View gains a short-string filter that compares raw u128 views directly -
the same layout Arrow uses for inline storage (≤12 bytes). This turns string
comparison into fast integer comparison. Lists with long strings fall through
to the generic hash-based filter.

Benchmarks show significant improvement for Utf8View short strings and
primitives with small lists.
// specific language governing permissions and limitations
// under the License.

use arrow::array::{Array, ArrayRef, Float32Array, Int32Array, StringArray};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make the benchmarks as a PR to datafusion/main so that we can merge them and have them in main for comparsion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm I see this is apache#19211 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened it here already :) apache#19211

Copy link
Member

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Great work. Let's merge this into my PR whenever you are ready. If you have the time, could you review apache#19050? It has more tests + fixes bugs in the current implementation. I'd also like to merge that before we do more perf optimization.

@geoffreyclaude
Copy link
Author

Looks good to me! Great work. Let's merge this into my PR whenever you are ready. If you have the time, could you review apache#19050? It has more tests + fixes bugs in the current implementation. I'd also like to merge that before we do more perf optimization.

I'll give it a look tomorrow morning. Seems you fixed quite a few bugs! Nulls in arrays/lists are always tricky to get right...

Feel free to merge my PR whenever you want. Once we've merged the updated benchmark, you can rebase over main to have a clean history and bench baseline.

@adriangb adriangb merged commit d299a91 into pydantic:specialize Dec 8, 2025
4 checks passed
adriangb added a commit that referenced this pull request Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants