-
Notifications
You must be signed in to change notification settings - Fork 2
Short InList Optimization #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add LargeStringArray benchmarks alongside existing StringArray benchmarks - Use explicit ScalarValue::Utf8 for StringArray (was using ScalarValue::from which creates Utf8View)
… collect_bool The previous implementation used BooleanArray::from_iter and BooleanBufferBuilder with element-by-element appends, which incur iterator overhead and prevent vectorization. This commit switches to BooleanBuffer::collect_bool, a batch operation that pre-allocates the exact buffer size and enables SIMD optimization. Since collect_bool guarantees the index is always in bounds, we can safely use unchecked array access (value_unchecked, get_unchecked) to eliminate bounds checks in the hot loop. The null-handling match is also simplified from a 3-way tuple to a 2-way check by pre-combining needle and haystack null flags.
…ings For small IN lists (≤8 elements), hashing overhead dominates execution time. This commit uses binary search instead, which is faster for small lists. Utf8View gains a short-string filter that compares raw u128 views directly - the same layout Arrow uses for inline storage (≤12 bytes). This turns string comparison into fast integer comparison. Lists with long strings fall through to the generic hash-based filter. Benchmarks show significant improvement for Utf8View short strings and primitives with small lists.
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| use arrow::array::{Array, ArrayRef, Float32Array, Int32Array, StringArray}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make the benchmarks as a PR to datafusion/main so that we can merge them and have them in main for comparsion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm I see this is apache#19211 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened it here already :) apache#19211
adriangb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Great work. Let's merge this into my PR whenever you are ready. If you have the time, could you review apache#19050? It has more tests + fixes bugs in the current implementation. I'd also like to merge that before we do more perf optimization.
I'll give it a look tomorrow morning. Seems you fixed quite a few bugs! Nulls in arrays/lists are always tricky to get right... Feel free to merge my PR whenever you want. Once we've merged the updated benchmark, you can rebase over main to have a clean history and bench baseline. |
Which issue does this PR close?
Rationale for this change
The
INlist evaluation is a hot path in query execution. Profiling revealed two optimization opportunities:Hashing overhead dominates for small lists: For lists with ≤8 elements, the cost of hashing exceeds the benefit. Binary search is faster in this regime.
String comparison is expensive: For
Utf8Viewarrays, Arrow stores short strings (≤12 bytes) inline as a 128-bit view. We can compare these views directly as integers instead of doing byte-by-byte string comparison.What changes are included in this PR?
Small list optimization (≤8 elements):
SortedLookup<T>using binary search instead ofHashedLookup<T>for primitive types when list size ≤8Utf8View short-string optimization:
Utf8ViewSortedFilterandUtf8ViewHashedFilterthat convert strings to their rawu128view representationAre these changes tested?
Yes, covered by existing
in_listtests which exercise all data types, list sizes, and null percentages.Are there any user-facing changes?
No API changes. Queries using
INlists will execute faster, especially for:Utf8Viewcolumns with short stringsBenchmark Results
Benchmarks run on 1024-row arrays with varying list sizes, null percentages, and string lengths.
Results are only on the last commit of the PR.
Utf8View Short Strings (≤12 bytes) — 60-70% faster
Primitives with Small Lists — 30-65% faster
Regressions (large lists, long strings)
These regressions are in less common patterns (large lists with long strings) and are outweighed by the gains in typical use cases.