-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single-pass multibyte_split
#11500
Single-pass multibyte_split
#11500
Conversation
This still needs some work, since it duplicates the existing kernel and could use a few more tests, but I think it's already good for a first review. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11500 +/- ##
===============================================
Coverage ? 86.41%
===============================================
Files ? 145
Lines ? 22993
Branches ? 0
===============================================
Hits ? 19869
Misses ? 3124
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
According to the recent discussion, if this more work then add |
I realized today that the behavior I wanted to provide on the high-level is not supported by the existing kernel, so |
|
5138705
to
a96bcd8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good, and the benchmarks are looking great too. Cherry picking some here...
Though peak memory usage has doubled, which is interesting. Do we know why that is? Maybe there's some tuning we can do with the output_chunks
class?
Existing implementation:
MultibyteSplitBenchmark/multibyte_split_simple/0/7/25/1073741824/manual_time 194 ms 194 ms 4 bytes_per_second=4.93718G/s peak_memory_usage=1.48489G
MultibyteSplitBenchmark/multibyte_split_simple/1/7/25/1073741824/manual_time 521 ms 521 ms 1 bytes_per_second=1.83643G/s peak_memory_usage=1.48908G
MultibyteSplitBenchmark/multibyte_split_simple/2/7/25/1073741824/manual_time 1328 ms 1326 ms 1 bytes_per_second=738.056M/s peak_memory_usage=1032.22M
This PR's improvements:
MultibyteSplitBenchmark/multibyte_split_simple/0/7/25/1073741824/manual_time 129 ms 129 ms 5 bytes_per_second=7.44079G/s peak_memory_usage=3.15422G
MultibyteSplitBenchmark/multibyte_split_simple/1/7/25/1073741824/manual_time 153 ms 153 ms 5 bytes_per_second=6.25106G/s peak_memory_usage=3.15422G
MultibyteSplitBenchmark/multibyte_split_simple/2/7/25/1073741824/manual_time 420 ms 420 ms 2 bytes_per_second=2.27932G/s peak_memory_usage=2.10296G
I have some comments, which are optional to address.
Requesting changes because we need some benchmarks that exemplify the byte_range
improvements. I never got around to adding benchmarks for that case because the perf would be on par with full file reads. Now that we have byte range optimizations, we have a chance to demonstrate some ever bigger improvements than "just" the ~2-3x we see in the existing benchmarks.
a96bcd8
to
221c216
Compare
multibyte_split
for non-overlapping delimitersmultibyte_split
@cwharris on the increased memory usage: With the exponential growth of the chunks, at worst we overestimate the amount of memory by the growth factor (2x in this case). A smaller growth factor might also make sense. What we could do alternatively is set a size limit for the chunks, because at some point the allocation overhead amortized over all kernel launches until the chunk is full is negligible. |
7ceb82f
to
814c28c
Compare
814c28c
to
9ac5474
Compare
9ac5474
to
918e45e
Compare
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review, mostly minor suggestions.
Lots of cool stuff to unpack here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just some very minor suggestions/questions.
@gpucibot merge |
This improves the `multibyte_split` kernel by * Reducing register pressure: Instead of storing `ITEMS_PER_THREAD` individual states, store only the initial `multistate` for the thread and recompute the individual states on-the-fly * Eliminating local memory usage: Manipulate the `multistate` via shifts instead of array random access * Eliminating trie overhead: Since we have only a single delimiter, the trie is a path. We can do the traversal implicitly * Memoizing which chars were a match: We don't need to recompute this information, but can store it in a bitmask * Changing the block load algorithm: `BLOCK_LOAD_VECTORIZE` was slightly less efficient than `BLOCK_LOAD_WARP_TRANSPOSE` * Reducing the allocation overhead by limiting the `output_builder` max allocation size * Tuning the parameters: `ITEMS_PER_THREAD = 64` works better, and we can improve performance further by operating on larger chunks Overall, this gives a roughly 2x speedup in my benchmarks. Based on #11500 Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #11587
Description
This adds a new
multibyte_split
implementation that needs to scan the input only once, and takes full advantage ofbyte_range
.To accomplish this, I introduce a new data structure
output_builder
(naming bikeshedding welcome 😄 ) for pre-allocation for unknown, but bounded outputs.The structure contains a vector of exponentially growing
device_uvector
s, such that either the last vector has size 0 and the second-to-last vector hassize() < capacity()
, or all vectors but the last are full (size() == capacity()
). It provides the operationsnext_output(stream)
returns asplit_device_span
of at least theworst_cast_size
provided at construction pointing to the next free entries from the last two vectors.advance_output(actual_size)
marks the firstactual_size
entries of the previously returnedsplit_device_span
as filled.split_device_span
takes care of writing to the smallerdevice_uvector
first, and the largerdevice_uvector
second, if the first one is full.gather
copies all elements that were previously written into a singledevice_uvector
of the correct size.This data structure should provide a good balance between allocation overheads and memory usage.
I only modified the actual multibyte_split kernel slightly to stop writing offsets once it passes the end of the
byte_range
. This way, we can determine all required offsets from a single scan, regardless of whether we need provide a range or not.Benchmark results:
multibyte_split
[0] Tesla T4
TODO
byte_range
edge casesExtend to overlapping delimiters by providingThat will be another PRprevious_chunk
support fordata_chunk_source
Checklist
Closes #11197