[BUG] concatenate on many string columns causes storm of synchronized HtoD copies #6465

jlowe · 2020-10-07T23:21:24Z

Describe the bug
While investigating nsys profile traces I noticed table concatenation occasionally taking far longer than expected. It appears to be localized to tables containing string columns. When concatenating string columns, the traces show a flurry of many tiny host-to-device transfers (96 bytes each) with a stream synchronization after each transfer. Often more time is spent doing these transfers than it takes to do the actual concatenation kernels.

Steps/Code to reproduce bug
Take an nsight systems trace of code performing a concatenate of 20 string columns. Note the excessive tiny transfers and stream synchronization that occurs before the first kernel is launched.

Expected behavior
Host data batched into fewer, larger transfers to minimize both the cost of transfers and any related synchronization.

jlowe · 2020-10-07T23:32:04Z

I believe these tiny transfers are triggered by create_strings_device_views. It builds a vector of column_device_view from the vector of columns to concatenate in preparation for concatenating them all into a contiguous block of host memory to transfer to the device. However creating each device view causes a tiny HtoD transfer and synchronization for the child column view. With many input string columns, it triggers many tiny, synchronized host to device transfers.

Ideally this would build a single block of host memory for all necessary device views, not just the top-level views. This could then be allocated on the device and transferred with a single HtoD memcpy. This would reduce the overhead of RMM allocations and reduce the overhead of this operation.

harrism · 2020-10-13T00:18:45Z

Take an nsight systems trace of code performing a concatenate of 20 string columns. Note the excessive tiny transfers and stream synchronization that occurs before the first kernel is launched.

Since you already have it, why not attach the code and trace to this issue to make it easier to work on?

jlowe · 2020-10-13T16:49:38Z

Since you already have it, why not attach the code and trace to this issue to make it easier to work on?

I didn't attach the trace because it's a full trace of a Spark query and quite large, containing much more activity than just column concatenation. That's also why I didn't post code, as I didn't have a nice, focused repro case to post at the time.

But since you asked, I created a repro case and the corresponding nsys trace for it. This repro shows using 200 columns which matches the default partition count for Spark. Here's a screenshot showing just the libcudf concatenate portion of the trace:

The dense red and green lines at the beginning are the 200+ tiny HtoD copies, each followed by a stream synchronize. The short green lines at the end are many cudaStreamEventRecord calls which I assume are coming from RMM as all the tiny device buffers are freed after the processing is completed.

The attached zipfile contains the sample repro code along with the trace file. The repro code is derived from StringColumnTest in concatenate_tests.cu.

concat-test.zip

davidwendt · 2020-10-13T17:30:11Z

I think the single allocate-copy step could be performed by using a table_view inside create_strings_device_views
Build a table_view from the input vector of columns and then call table_device_view::create() on it.
This create() function creates a single memory buffer to hold instance pointers for all of its columns (and their children). This should be about the same size as the original memory (which is very small) but will incur only a single copy and sync.
And since the table_device_view holds this memory, it must exist for the lifetime of its children so the code logic would need to be reworked appropriately.

jlowe · 2020-10-13T17:55:59Z

Build a table_view from the input vector of columns

Doesn't a table_view require all columns to be the same number of rows?

davidwendt · 2020-10-13T17:59:43Z

Ah yes, you are right.

karthikeyann · 2020-10-28T08:32:10Z

Detailed profile of the issue.

See both device_view_owners and respective "CUDA API" activity.

With fix from #6605,

Co-authored-by: Karthikeyan Natarajan <karthikeyann@users.noreply.github.com> Co-authored-by: Mark Harris <mharris@nvidia.com> closes #6465 - Add utility `cudf::detail::align_ptr_for_type` - Add contiguous_copy_column_device_views - reduces multiple HtoD copies in cudf::concatenate by adding create_contiguous_device_views for list of column_views

jlowe added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Oct 7, 2020

harrism added the strings strings issues (C++ and Python) label Oct 9, 2020

karthikeyann self-assigned this Oct 15, 2020

karthikeyann mentioned this issue Oct 27, 2020

reduce HtoD copies in cudf::concatenate #6605 #6605

Merged

karthikeyann closed this as completed in #6605 Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] concatenate on many string columns causes storm of synchronized HtoD copies #6465

[BUG] concatenate on many string columns causes storm of synchronized HtoD copies #6465

jlowe commented Oct 7, 2020

jlowe commented Oct 7, 2020

harrism commented Oct 13, 2020

jlowe commented Oct 13, 2020

davidwendt commented Oct 13, 2020

jlowe commented Oct 13, 2020

davidwendt commented Oct 13, 2020

karthikeyann commented Oct 28, 2020

[BUG] concatenate on many string columns causes storm of synchronized HtoD copies #6465

[BUG] concatenate on many string columns causes storm of synchronized HtoD copies #6465

Comments

jlowe commented Oct 7, 2020

jlowe commented Oct 7, 2020

harrism commented Oct 13, 2020

jlowe commented Oct 13, 2020

davidwendt commented Oct 13, 2020

jlowe commented Oct 13, 2020

davidwendt commented Oct 13, 2020

karthikeyann commented Oct 28, 2020