[FEA] Story - Improve performance with long strings #13048

GregoryKimball · 2023-04-03T17:58:26Z

Many strings APIs in libcudf use thread-per-string parallelism in their implementation. This approach works great for processing smaller strings of relatively consistent length. However, for long strings (roughly 256 bytes and above) the performance of thread-per-string algorithms begins to degrade. Some strings APIs are compatible with data-parallel algorithms and can be refactored to improve performance for long strings, while other strings APIs are difficult to refactor with data-parallel algorithms.

Let's use this issue to track the progression:
✅ - this API works well with long strings
🟢 - we think this API will be straightforward to refactor
🟡 - we have some ideas on how to refactor this API, and we'll need to experiment
🔴 - we think this will be very difficult to refactor!
⚪ - long string support is not a priority for this API

Module	Function	Status	Notes
Case	capitalize title is_title to_lower to_upper swapcase	🟡 🟡 🟡 ✅#13142 ✅#13142 ✅#13142
Character Types	all_characters_of_type filter_characters_of_type	✅#13259 🔴
Combining	join_strings concatenate join_list_elements	✅#13283 🟡 🔴
Searching	contains_re matches_re count_re like find_all	🟡 ⚪ 🔴 🟢#13594 🔴
Converting	to_XXXX from_XXXX	⚪ ⚪	these are rarely long strings
Copying	repeat_string repeat_strings	✅ ✅	One overload is an exception
Slicing	slice_strings	✅#13057	One overload allows for skipping characters. Long string support is not a priority for `step > 1 or step < 0`
Finding	find rfind contains starts_with ends_with find_multiple	✅#13226 ✅#13226 ✅#10739 ⚪ ⚪ 🟢
Modifying	pad zfill reverse strip translate filter_characters wrap	🟡 ⚪ 🟡 ✅ 🟡 🔴 🔴
Replacing	replace replace_slice replace_re replace_with_backrefs	✅#12858 🟡 🔴 🔴
Splitting	partition split split_record split_re split_record_re	🟡 ✅#4922 #13680 ✅#12729 🔴 🔴
other	count_characters count_bytes	✅#12779 🟢

Libcudf also includes NVText APIs that will benefit from improvements in performance when processing long strings. Generally long string performance is even more important for our text APIs, where each row could represent a sentence, paragraph or document.

Module	Function	Status	Notes
NGrams	generate_ngrams generate_character_ngrams ngrams_tokenize	⚪ 🟢 🟢#13480	these are generally not long strings
Normalizing	normalize_characters normalize_spaces	🟢 🟢#13480
Stemming	is_letter porter_stemmer_measure	🟢 🟢
Edit Distance	edit_distance edit_distance_matrix	⚪ ⚪	these are generally not long strings
Tokenizing	byte_pair_encoding subword_tokenize tokenize count_tokens character_tokenize detokenize	🟡 🟡 🟢#13480 🟢#13480 🟡 🟡
Replacing	replace_tokens filter_tokens	🟢#13480 🟢#13480
MinHashing	minhash	✅#13333

The text was updated successfully, but these errors were encountered:

bdice · 2023-04-03T18:13:04Z

This is a great use of a "Story" issue for documenting this work. I have some very high level comments / ideas:

Some of these algorithms might benefit from using Cooperative Groups and/or studying implementations of related algorithms in CUB.
For some of these cases, I suspect that "prefix strings" as described in the Velox paper could be useful. That would use a (non-Arrow) layout with a fixed-width prefix that could be computed as a temporary column used to quickly eliminate possible matches. Filtering and ordering are specifically called out in that paper.

GregoryKimball · 2023-04-14T03:18:53Z

After studying the results of the strings microbenchmarks in libcudf, I'd like to share some results and early thoughts on the performance we should target with long strings.

First, here is data throughput versus data size for the split_record API. It shows about 15-35 GB/s throughput at the ~100 MB data size for string lengths between 32 and 8192 bytes.

Next, here is the same plot for filter_characters_of_type. It shows 20 GB/s throughput for ~100 MB data size and strings lengths <100 bytes. For longer strings the throughput falls to the 1-6 GB/s range.

So perhaps we should target data throughputs for long strings to be better than 80% of data throughput for short strings.

Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings. Additionally, the code was refactored so the common case of `step==1 and start < stop` can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better. The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings. Reference #13048 and #12445 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Elias Stehle (https://github.com/elstehle) URL: #13057

Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings. Additionally, the code was refactored so the common case of `step==1 and start < stop` can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better. The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings. Reference rapidsai#13048 and rapidsai#12445 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Elias Stehle (https://github.com/elstehle) URL: rapidsai#13057

…#13226) Improves performance for longer strings with `cudf::strings::find()` and `cudf::strings::rfind()` APIs. The current implementation works well with small-ish strings and so this new implementation splits into a long-ish string algorithm when the average number of bytes per string is 64 bytes or greater. The new implementation searches for the target string by applying a warp per string kernel. Additionally, the special logic needed for matching an empty target (easily checked in host code) is factored out into its own transform functor. For longer strings, the performance improvement is about 2-6x. Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) - MithunR (https://github.com/mythrocks) URL: #13226

…trings (#13142) Improves on performance for longer strings with `cudf::strings::to_lower()` `cudf::strings::to_upper()` and `cudf::strings::swapcase()` APIs. The current implementation works well with smallish strings and so this new implementation splits into a longish string algorithm when average number of bytes per string is 64 bytes or greater. The new implementation is similar but computes the output size of each string with a warp per string function. In addition, a check is added for long strings testing if all bytes are ASCII and thereby can run a faster kernel for this case. Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Karthikeyan (https://github.com/karthikeyann) URL: #13142

) Improves performance for `cudf::strings::all_characters_of_type()` API which covers many cudf `is_X` functions. The solution improves performance for all string lengths as measured by the new benchmark included in this PR. Additionally, the code was cleaned up to help with maintenance and clarity. Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #13259

…gs (#13283) Improves performance for longer strings with `cudf::strings::join_strings()` API. The new implementation is inspired by recent changes in json/writer_impl.cu which uses the `make_strings_children` factor that accepts an iterator of string pointers to gather the results into a single column of data but does lose some performance for smallish-strings (less than 32 bytes). So the original code is refactored to work on smaller strings. Also, a benchmark implementation is added in this PR to test against various sized input columns. For longer strings, the performance improvement is up to 100x for very long strings (>512 bytes). Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #13283

Moves some of the nvtext benchmarks to nvbench to help provide a more useful baseline when making improvements to performance. These run with varying parameters for column size and string length. The remaining benchmarks are more involved a may be updated in follow-on PRs. Reference: #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) - Robert Maynard (https://github.com/robertmaynard) URL: #13368

Minimizes character counting in the kernel logic for `cudf::strings::like` to improve overall performance especially for longer strings. The nvbench benchmark is updated to include measurements for various strings sizes. Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) URL: #13594

Improves performance for nvtext tokenize functions by minimizing character counting in the `characters_tokenize` utility functor in `src/text/utilities/tokenize_ops.cuh`. Functions this change effects are: - [`nvtext::count_tokens`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__tokenize.html#ga5323d94dac99bf42f0cbb07c4fcd7242) (single delimiter and whitespace) - [`nvtext::tokenize`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__tokenize.html#ga64c2806c398ce476fa5174f3155ea0fb) (single delimiter and whitespace) - [`nvtext::replace_tokens`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__replace.html#ga66219b7db6155c4e14bf6f6147e1fc81) - [`nvtext::normalize_spaces`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__normalize.html#ga9104dffc71baf77e710bc63e5e2a8837) - [`nvtext::ngrams_tokenize`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__ngrams.html#gace17045b4ee5a3b10157ed40f9575298) This change improved performance by at least 10% for all string lengths for most of these functions. Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Bradley Dice (https://github.com/bdice) URL: #13480

GregoryKimball · 2023-08-24T19:12:38Z

Hello @davidwendt, I was discussing this issue with @elstehle. Do you think the "combining" or "copying" strings algorithms could benefit from DeviceBatchMemcpy?

davidwendt · 2023-08-29T17:39:50Z

Hello @davidwendt, I was discussing this issue with @elstehle. Do you think the "combining" or "copying" strings algorithms could benefit from DeviceBatchMemcpy?

There are several parts of the strings (and nvtext) API implementations that basically resolve into a fragmented set of (pointer,size) pairs which are passed to a factory that performs an optimized gather function to build a contiguous byte array for an output strings column. So a significant impact here may be to improve that gather function by taking advantage of DeviceBatchMemcpy perhaps.

The factory function that uses the gather is illustrated in my blog post which use the custom kernel here:

cudf/cpp/examples/strings/custom_prealloc.cu

Line 118 in cd56cc2

    
           auto result = cudf::make_strings_column(str_ptrs, cudf::string_view{nullptr, 0}, stream);

I'll locate other places that libcudf uses this factory function that also include benchmarks.

GregoryKimball · 2024-04-01T18:33:59Z

Spark-RAPIDS identified long strings performance issues in find (#15405) as well as upper/lower (#15406)

GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python) labels Apr 3, 2023

GregoryKimball assigned davidwendt Apr 3, 2023

davidwendt mentioned this issue Apr 4, 2023

Improve performance of slice_strings for long strings #13057

Merged

3 tasks

GregoryKimball mentioned this issue Apr 6, 2023

[BUG] cudf::strings::replace_re can be 20X slower than a single thread on the CPU for some cases #12778

Closed

davidwendt mentioned this issue Apr 14, 2023

Performance improvement for libcudf upper/lower conversion for long strings #13142

Merged

3 tasks

This was referenced Apr 26, 2023

Performance improvement in cudf::strings::find/rfind for long strings #13226

Merged

Performance improvement in cudf::strings::all_characters_of_type #13259

Merged

davidwendt mentioned this issue May 3, 2023

Performance improvement in cudf::strings::join_strings for long strings #13283

Merged

3 tasks

davidwendt mentioned this issue May 17, 2023

Move some nvtext benchmarks to nvbench #13368

Merged

3 tasks

davidwendt mentioned this issue May 31, 2023

Performance improvement for nvtext tokenize/token functions #13480

Merged

3 tasks

davidwendt mentioned this issue Jun 20, 2023

Performance improvement for cudf::strings::like #13594

Merged

3 tasks

davidwendt mentioned this issue Jun 23, 2023

Performance improvement for some libcudf regex functions for long strings #13322

Merged

3 tasks

GregoryKimball mentioned this issue Jun 29, 2023

GitHub infra updates #13542

Closed

5 tasks

GregoryKimball changed the title ~~Story - Improve performance with long strings~~ [FEA] Story - Improve performance with long strings Aug 2, 2023

GregoryKimball added this to the Language model acceleration milestone Aug 14, 2023

GregoryKimball mentioned this issue Mar 14, 2024

[FEA] Accelerate conversion from arrow::StringViewType to arrow::StringType in libcudf interop #15298

Open

GregoryKimball mentioned this issue Apr 1, 2024

[FEA] Improved performance for strings finder_warp_parallel_fn / contains_warp_parallel_fn kernels #15405

Open

GregoryKimball mentioned this issue Apr 18, 2024

[BUG] Poor performance scaling of str.count with long strings #15567

Closed

ayushdg mentioned this issue Jul 2, 2024

[BUG] StringMethods - Jaccard-index fails with long strings #16157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Story - Improve performance with long strings #13048

[FEA] Story - Improve performance with long strings #13048

GregoryKimball commented Apr 3, 2023 •

edited

Loading

bdice commented Apr 3, 2023 •

edited

Loading

GregoryKimball commented Apr 14, 2023

GregoryKimball commented Aug 24, 2023

davidwendt commented Aug 29, 2023

GregoryKimball commented Apr 1, 2024

[FEA] Story - Improve performance with long strings #13048

[FEA] Story - Improve performance with long strings #13048

Comments

GregoryKimball commented Apr 3, 2023 • edited Loading

bdice commented Apr 3, 2023 • edited Loading

GregoryKimball commented Apr 14, 2023

GregoryKimball commented Aug 24, 2023

davidwendt commented Aug 29, 2023

GregoryKimball commented Apr 1, 2024

GregoryKimball commented Apr 3, 2023 •

edited

Loading

bdice commented Apr 3, 2023 •

edited

Loading