Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] concatenate on many string columns causes storm of synchronized HtoD copies #6465

Closed
jlowe opened this issue Oct 7, 2020 · 7 comments · Fixed by #6605
Closed

[BUG] concatenate on many string columns causes storm of synchronized HtoD copies #6465

jlowe opened this issue Oct 7, 2020 · 7 comments · Fixed by #6605
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@jlowe
Copy link
Member

jlowe commented Oct 7, 2020

Describe the bug
While investigating nsys profile traces I noticed table concatenation occasionally taking far longer than expected. It appears to be localized to tables containing string columns. When concatenating string columns, the traces show a flurry of many tiny host-to-device transfers (96 bytes each) with a stream synchronization after each transfer. Often more time is spent doing these transfers than it takes to do the actual concatenation kernels.

Steps/Code to reproduce bug
Take an nsight systems trace of code performing a concatenate of 20 string columns. Note the excessive tiny transfers and stream synchronization that occurs before the first kernel is launched.

Expected behavior
Host data batched into fewer, larger transfers to minimize both the cost of transfers and any related synchronization.

@jlowe jlowe added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Oct 7, 2020
@jlowe
Copy link
Member Author

jlowe commented Oct 7, 2020

I believe these tiny transfers are triggered by create_strings_device_views. It builds a vector of column_device_view from the vector of columns to concatenate in preparation for concatenating them all into a contiguous block of host memory to transfer to the device. However creating each device view causes a tiny HtoD transfer and synchronization for the child column view. With many input string columns, it triggers many tiny, synchronized host to device transfers.

Ideally this would build a single block of host memory for all necessary device views, not just the top-level views. This could then be allocated on the device and transferred with a single HtoD memcpy. This would reduce the overhead of RMM allocations and reduce the overhead of this operation.

@harrism harrism added the strings strings issues (C++ and Python) label Oct 9, 2020
@harrism
Copy link
Member

harrism commented Oct 13, 2020

Take an nsight systems trace of code performing a concatenate of 20 string columns. Note the excessive tiny transfers and stream synchronization that occurs before the first kernel is launched.

Since you already have it, why not attach the code and trace to this issue to make it easier to work on?

@jlowe
Copy link
Member Author

jlowe commented Oct 13, 2020

Since you already have it, why not attach the code and trace to this issue to make it easier to work on?

I didn't attach the trace because it's a full trace of a Spark query and quite large, containing much more activity than just column concatenation. That's also why I didn't post code, as I didn't have a nice, focused repro case to post at the time.

But since you asked, I created a repro case and the corresponding nsys trace for it. This repro shows using 200 columns which matches the default partition count for Spark. Here's a screenshot showing just the libcudf concatenate portion of the trace:
image

The dense red and green lines at the beginning are the 200+ tiny HtoD copies, each followed by a stream synchronize. The short green lines at the end are many cudaStreamEventRecord calls which I assume are coming from RMM as all the tiny device buffers are freed after the processing is completed.

The attached zipfile contains the sample repro code along with the trace file. The repro code is derived from StringColumnTest in concatenate_tests.cu.

concat-test.zip

@davidwendt
Copy link
Contributor

I think the single allocate-copy step could be performed by using a table_view inside create_strings_device_views
Build a table_view from the input vector of columns and then call table_device_view::create() on it.
This create() function creates a single memory buffer to hold instance pointers for all of its columns (and their children). This should be about the same size as the original memory (which is very small) but will incur only a single copy and sync.
And since the table_device_view holds this memory, it must exist for the lifetime of its children so the code logic would need to be reworked appropriately.

@jlowe
Copy link
Member Author

jlowe commented Oct 13, 2020

Build a table_view from the input vector of columns

Doesn't a table_view require all columns to be the same number of rows?

@davidwendt
Copy link
Contributor

Ah yes, you are right.

@karthikeyann
Copy link
Contributor

Detailed profile of the issue.
image

See both device_view_owners and respective "CUDA API" activity.

With fix from #6605,
image

karthikeyann added a commit that referenced this issue Nov 28, 2020
Co-authored-by: Karthikeyan Natarajan <karthikeyann@users.noreply.github.com>
Co-authored-by: Mark Harris <mharris@nvidia.com>

closes #6465

- Add utility `cudf::detail::align_ptr_for_type`
- Add contiguous_copy_column_device_views
- reduces multiple HtoD copies in cudf::concatenate by adding create_contiguous_device_views for list of column_views
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants