[BUG] String column copy performs a full gather #6803

jlowe · 2020-11-19T00:38:35Z

Describe the bug
While analyzing an Nsight trace of a Spark query, I noticed 10ms being spent on a filter. Digging deeper, the filter time was spent in a cudf::table copy constructor, with most of the time being spent copying one string column within the table.

Spark queries often filter input data to remove nulls before further processing, and often nothing is filtered. cudf::apply_boolean_mask, used to implement the Spark filter, has short-circuit logic to avoid a full gather and instead copy-constructs the output table if it realizes nothing will be filtered. However copy-constructing a string column performs a full gather which can be much more expensive than simply copying the input column buffers.

In this specific case, the string column consists of 178200 rows and each row is around 200 characters. The string copy constructor took almost 10 milliseconds on the GPU despite the column containing approximately 30MB of device memory.

Steps/Code to reproduce bug
Perform a copy of a string column where there are many strings that are relatively long (e.g.: 200+ characters per row). I've attached a gzipped Parquet file containing the string column from the specific filter case mentioned above for reference.

filtertest.parquet.gz

Expected behavior
A copy of a strings column view when the view starts at base offset 0 should be performed at device memory speed, copy the underlying buffers rather than performing the more complicated gather computation (which also requires synchronizing and a DtoH transfer).

The text was updated successfully, but these errors were encountered:

jrhemstad · 2020-11-19T14:57:15Z

In addition, even when this operation does need to perform a gather it is inefficient as it materializes the gather map:

cudf/cpp/src/strings/copying/copying.cu

Lines 48 to 60 in 89b802e

    
           rmm::device_vector<size_type> indices(strings_count); 
        
           thrust::sequence(execpol->on(stream), indices.begin(), indices.end(), start, step); 
        
           // create a column_view as a wrapper of these indices 
        
           column_view indices_view( 
        
             data_type{type_id::INT32}, strings_count, indices.data().get(), nullptr, 0); 
        
           // build a new strings column from the indices 
        
           auto sliced_table = cudf::detail::gather(table_view{{strings.parent()}}, 
        
                                                    indices_view, 
        
                                                    cudf::detail::out_of_bounds_policy::NULLIFY, 
        
                                                    cudf::detail::negative_index_policy::NOT_ALLOWED, 
        
                                                    mr, 
        
                                                    stream) 
        
                                 ->release();

This should just use a counting iterator passed to the detail version of gather that uses an iterator for the map.

…6837) Fixes #6803 This optimizes string slice copying in the case where the string column view starts at offset 0. In that case the offset values do not need to be modified, and all the column buffers can be copied.

jlowe added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Nov 19, 2020

davidwendt added the strings strings issues (C++ and Python) label Nov 19, 2020

jlowe mentioned this issue Nov 23, 2020

Avoid gather when copying strings view from start of strings column #6837

Merged

harrism closed this as completed in #6837 Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] String column copy performs a full gather #6803

[BUG] String column copy performs a full gather #6803

jlowe commented Nov 19, 2020

jrhemstad commented Nov 19, 2020

[BUG] String column copy performs a full gather #6803

[BUG] String column copy performs a full gather #6803

Comments

jlowe commented Nov 19, 2020

jrhemstad commented Nov 19, 2020