Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework logic in cudf::strings::split_record to improve performance #12729

Merged
merged 31 commits into from
Feb 21, 2023

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Feb 8, 2023

Description

Updates the cudf::strings::split_record logic to match the more optimized code in cudf::strings:split.
The optimized code performs much better for longer strings (>64 bytes) by parallelizing over the character bytes to find delimiters before determining split tokens.
This led to refactoring the code so it both APIs can share the optimized code.
Also fixes a bug found when using overlapped delimiters.
Additional tests were added for multi-byte delimiters which can overlap and span multiple adjacent strings.

Closes #12694

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Feb 8, 2023
@davidwendt davidwendt self-assigned this Feb 8, 2023
@codecov
Copy link

codecov bot commented Feb 8, 2023

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@ec8704a). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head d7dcb2a differs from pull request most recent head 524e038. Consider uploading reports for the commit 524e038 to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.04   #12729   +/-   ##
===============================================
  Coverage                ?   85.85%           
===============================================
  Files                   ?      158           
  Lines                   ?    25204           
  Branches                ?        0           
===============================================
  Hits                    ?    21638           
  Misses                  ?     3566           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@davidwendt davidwendt removed the 2 - In Progress Currently a work in progress label Feb 13, 2023
@davidwendt davidwendt added the 3 - Ready for Review Ready for review by team label Feb 13, 2023
@davidwendt
Copy link
Contributor Author

Performance numbers for cudf::strings::split_record

/record/4096/32          0.186         0.262      0.71x
/record/4096/64          0.254         0.270      0.94x
/record/4096/128         0.408         0.281      1.45x
/record/4096/256         0.845         0.307      2.75x
/record/4096/512          2.25         0.280      8.04x
/record/4096/1024         7.89         0.307     25.70x
/record/4096/2048         29.3         0.376     77.93x
/record/4096/4096          115         0.498    230.92x
/record/4096/8192          464         0.752    617.02x
/record/32768/32         0.204         0.274      0.74x
/record/32768/64         0.279         0.297      0.94x
/record/32768/128        0.506         0.349      1.45x
/record/32768/256         1.05         0.444      2.36x
/record/32768/512         2.71         0.413      6.56x
/record/32768/1024        8.36         0.561     14.90x
/record/32768/2048        36.5         0.861     42.39x
/record/32768/4096         164          1.46    112.33x
/record/32768/8192         706          2.70    261.48x
/record/262144/32        0.328         0.419      0.78x
/record/262144/64        0.576         0.581      0.99x
/record/262144/128        1.47          1.16      1.27x
/record/262144/256        5.65          3.62      1.56x
/record/262144/512        12.3          1.51      8.15x
/record/262144/1024       48.3          2.71     17.82x
/record/262144/2048        216          5.59     38.64x
/record/262144/4096       1066          11.8     90.34x
/record/2097152/32        1.24          1.47      0.84x
/record/2097152/64        2.85          2.73      1.04x
/record/2097152/128       8.27          6.51      1.27x
/record/2097152/256       31.9          23.1      1.38x
/record/2097152/512       55.6          10.5      5.30x
/record/16777216/32       8.79          10.1      0.87x
/record/16777216/64       21.3          20.2      1.05x

@davidwendt davidwendt marked this pull request as ready for review February 13, 2023 21:31
@davidwendt davidwendt requested a review from a team as a code owner February 13, 2023 21:31
cpp/src/strings/split/split.cuh Outdated Show resolved Hide resolved
cpp/src/strings/split/split.cuh Show resolved Hide resolved
cpp/src/strings/split/split.cuh Outdated Show resolved Hide resolved
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some nits (not necessarily a change request though)

cpp/benchmarks/string/split.cpp Outdated Show resolved Hide resolved
cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved
cpp/src/strings/split/split.cuh Outdated Show resolved Hide resolved
cpp/src/strings/split/split.cuh Outdated Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 7da233b into rapidsai:branch-23.04 Feb 21, 2023
@davidwendt davidwendt deleted the perf-split-record branch February 21, 2023 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cudf::strings::split_record can be over 15x slower than a single thread on the CPU for some cases
4 participants