-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] concatenate on many string columns causes storm of synchronized HtoD copies #6465
Comments
I believe these tiny transfers are triggered by Ideally this would build a single block of host memory for all necessary device views, not just the top-level views. This could then be allocated on the device and transferred with a single HtoD memcpy. This would reduce the overhead of RMM allocations and reduce the overhead of this operation. |
Since you already have it, why not attach the code and trace to this issue to make it easier to work on? |
I think the single allocate-copy step could be performed by using a |
Doesn't a |
Ah yes, you are right. |
Detailed profile of the issue. See both With fix from #6605, |
Co-authored-by: Karthikeyan Natarajan <karthikeyann@users.noreply.github.com> Co-authored-by: Mark Harris <mharris@nvidia.com> closes #6465 - Add utility `cudf::detail::align_ptr_for_type` - Add contiguous_copy_column_device_views - reduces multiple HtoD copies in cudf::concatenate by adding create_contiguous_device_views for list of column_views
Describe the bug
While investigating nsys profile traces I noticed table concatenation occasionally taking far longer than expected. It appears to be localized to tables containing string columns. When concatenating string columns, the traces show a flurry of many tiny host-to-device transfers (96 bytes each) with a stream synchronization after each transfer. Often more time is spent doing these transfers than it takes to do the actual concatenation kernels.
Steps/Code to reproduce bug
Take an nsight systems trace of code performing a concatenate of 20 string columns. Note the excessive tiny transfers and stream synchronization that occurs before the first kernel is launched.
Expected behavior
Host data batched into fewer, larger transfers to minimize both the cost of transfers and any related synchronization.
The text was updated successfully, but these errors were encountered: