[DISCUSSION] libcudf should not introspect input data to perform error checking #5505

jrhemstad · 2020-06-18T15:55:23Z

Is your feature request related to a problem? Please describe.

A few functions in libcudf optionally validate that input data is valid before performing the operation. For example,

Lines 62 to 66 in 31d7466

    
           std::unique_ptr<table> gather( 
        
             table_view const& source_table, 
        
             column_view const& gather_map, 
        
             bool check_bounds                   = false, 
        
             rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

cudf::gather has a check_bounds bool parameter that enables verifying if the values in the gather_map are within bounds. This requires launching a kernel to introspect the gather map data:

cudf/cpp/src/copying/gather.cu

Lines 30 to 39 in 31d7466

    
           if (bounds == out_of_bounds_policy::FAIL) { 
        
             cudf::size_type begin = 
        
               neg_indices == negative_index_policy::ALLOWED ? -source_table.num_rows() : 0; 
        
             CUDF_EXPECTS(num_destination_rows == 
        
                            thrust::count_if(rmm::exec_policy()->on(0), 
        
                                             gather_map.begin<map_type>(), 
        
                                             gather_map.end<map_type>(), 
        
                                             bounds_checker<map_type>{begin, source_table.num_rows()}), 
        
                          "Index out of bounds."); 
        
           }

The reason for this verification is that cuDF Python expects to throw an exception if any of the values are out of bounds.

However, there is no reason for libcudf to be performing this verification directly inside of the gather implementation. The cuDF Python bindings for gather can easily add this bounds checking as a pre-processing step.

cudf::repeat is a similar example of this behavior:

cudf/cpp/include/cudf/filling.hpp

Lines 118 to 122 in b72e647

    
           std::unique_ptr<table> repeat( 
        
             table_view const& input_table, 
        
             column_view const& count, 
        
             bool check_count                    = false, 
        
             rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

Having this bounds checking inside the libcudf function is detrimental for a number of reasons:

It complicates the libcudf implementation. The gather implementation is quite complicated because of all the branches inside, including whether or not we need to check bounds. It's not as simple as just a single if/else because the check_bounds flag interacts with other internal flags such as allow_negative_indices. It would pretty significantly simplify gathers implementation to move this check outside of gather's implementation.
It violates the single responsibility principle. Verifying the validity of input data should be independent of performing the actual operation.
It is not in line with the semantics of other C++ libraries. For example, thrust::gather does not perform any validation of the map data.

Describe the solution you'd like

Optional input data validation like in gather or repeat should be eliminated from libcudf features.

Python features that rely on this bounds checking should implement it as a pre-processing step using existing libcudf primitives, or identify new primitives that would enable the necessary validation.

Additional context

To be clear, I am not suggesting libcudf functions do not perform any error checking whatsoever. I am suggesting we remove any error checking that requires introspection of input data (i.e., launching a kernel). Performing checks for data to be the right data type, size, etc. must still be preserved. An obvious indication of functions doing this today are optional boolean flags like in gather or repeat.

The text was updated successfully, but these errors were encountered:

harrism · 2020-07-20T02:05:44Z

👍

shwina · 2020-07-22T19:38:27Z

I'm fine with this, but it's worth bringing up that gather performance is critical to performance in some applications, and the pre-processing associated with negative indexes was previously found to be a bottleneck: #2675

shwina · 2020-07-22T19:44:58Z

Ah, misunderstood -- this is about bounds checking rather than negative index transformation. I think both incur similar overhead though (cost of a binaryop), and it's worth taking into consideration that this will impact the performance of indexing in Python (cc: @kkraus14 )

jrhemstad · 2020-07-22T19:46:33Z

Ah, misunderstood -- this is about bounds checking rather than negative index transformation. I think both incur similar overhead though (cost of a binaryop), and it's worth taking into consideration that this will impact the performance of indexing in Python (cc: @kkraus14 )

Whether we check bounds in libcudf or Python, either way it's a kernel launch. I wouldn't expect it to be substantively more expensive to do the bounds check outside of the gather call.

kkraus14 · 2020-07-22T19:47:19Z

@shwina I think we can add this in the Cython as opposed to in the Python layer to amortize some of the typical Python overheads.

shwina · 2020-07-22T19:49:20Z

Sounds good -- I'll benchmark and report here, and we can decide based on that?

shwina · 2020-07-23T14:33:13Z

So I ran a quick benchmark. Here are the results:

With libcudf bounds checking:

size: 100 :: time: 0.007442206988343969
size: 1000 :: time: 0.006695960997603834
size: 10000 :: time: 0.006788923987187445
size: 100000 :: time: 0.009929071005899459
size: 1000000 :: time: 0.028811413008952513
size: 10000000 :: time: 0.06236411599093117

Without any bounds checking:

size: 100 :: time: 0.006547413009684533
size: 1000 :: time: 0.005977320979582146
size: 10000 :: time: 0.005807141977129504
size: 100000 :: time: 0.008778560993960127
size: 1000000 :: time: 0.020219083002302796
size: 10000000 :: time: 0.0580876559833996

With cudf bounds checking:

size: 100 :: time: 0.010622989007970318
size: 1000 :: time: 0.010242449003271759
size: 10000 :: time: 0.009886790998280048
size: 100000 :: time: 0.01461042498704046
size: 1000000 :: time: 0.026591391011606902
size: 10000000 :: time: 0.0630068660248071

Benchmark used (basically "reversing" a column by performing a gather):

import timeit
import cupy as cp
import cudf

for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000]:
    a = cudf.Series(cp.arange(size))

    start = timeit.default_timer()
    for i in range(10):
        result = a.iloc[cp.arange(size-1, -1, -1)]
    end = timeit.default_timer()

    print(f"size: {size} :: time: {end-start}")

The "cudf bounds checking" is implemented in Cython with as little overhead as possible. Basically it is a max reduction (max(gather_map)).

    cdef data_type c_dtype = data_type(tid)
    cdef unique_ptr[aggregation] c_agg = move(make_aggregation("max"))
    cdef unique_ptr[scalar] c_reduce_result
    cdef Scalar sc = as_scalar(source_table._num_rows)
    cdef scalar* c_sc = sc.c_value.get()

    with nogil:

        c_reduce_result = move(
            cpp_reduce(
                gather_map_view,
                c_agg,
                c_dtype
            )
        )

    py_reduce_result = Scalar.from_unique_ptr(move(c_reduce_result))

    if py_reduce_result.value > 0:
        raise RuntimeError("Index out of bounds")

shwina · 2020-07-23T14:39:34Z

I tend to be +1 for cleaner code over small performance gains :) This represents about a 10-15% performance decrease.

I think if libcudf had a binop+reduce primitive though (even just for numeric types), that would allow us to separate bounds checking from gather, and help with performance on the Python side.

jrhemstad · 2020-07-23T14:46:35Z

I tend to be +1 for cleaner code over small performance gains :) This represents about a 10-15% performance decrease.

Agreed. The performance difference is minimal enough to not be concerning to me.

I think if libcudf had a binop+reduce primitive though (even just for numeric types), that would allow us to separate bounds checking from gather, and help with performance on the Python side.

I think you could actually eliminate the binop by instead just doing a max reduction of the gather map. All you care about is if a single value in the gather map is out of bounds, so if you just compute the max and it is out of bounds, then you know there is an error and you can throw.

shwina · 2020-07-23T14:52:27Z

I think you could actually eliminate the binop by instead just doing a max reduction of the gather map. All you care about is if a single value in the gather map is out of bounds, so if you just compute the max and it is out of bounds, then you know there is an error and you can throw.

How silly of me :) Edited with numbers for doing just a max. The difference is even less concerning now (especially for larger problem sizes).

shwina · 2020-07-23T14:54:13Z

I'll put in a PR to do bounds checking in Cython for both scatter and gather. I'm happy to also throw in bounds checking removal in C++.

jrhemstad · 2020-07-23T14:55:06Z

I think you could actually eliminate the binop by instead just doing a max reduction of the gather map. All you care about is if a single value in the gather map is out of bounds, so if you just compute the max and it is out of bounds, then you know there is an error and you can throw.

How silly of me :) Edited with numbers for doing just a max. The difference is even less concerning now (especially for larger problem sizes).

That said, I think you actually need to do a min and a max reduction to account for the "negative values wrap" logic. I.e., you need to check if a value in the gather map is outside the bounds (-n, n]

shwina · 2020-07-23T14:56:38Z

Are you also considering removing support for negative index values?

jrhemstad · 2020-07-23T14:57:23Z

Are you also considering removing support for negative index values?

No. I'm just saying that if you want to keep the same bounds checking logic that exists in libcudf, you need to check for (-n, n].

shwina · 2020-07-23T14:59:32Z

Got it -- yup makes sense.

jrhemstad · 2020-07-23T15:06:19Z

We could even add a minmax reduction to do this in a single operation.

github-actions · 2021-03-14T19:12:54Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions · 2021-03-14T19:13:06Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

vyasr · 2022-07-20T23:18:25Z

In the interest of making this PR actionable, I would like to collect all remaining instances of this pattern in libcudf so that we know what needs to be done to address this issue:

vyasr · 2022-07-20T23:19:30Z

@jrhemstad @davidwendt any other instances that you're aware of? Please feel free to add to the list above. In case we find some cases where this is happening without even an option to turn it off, we can always rip those out later. For now I just looked through public headers and didn't find any obvious examples other than these.

jrhemstad · 2022-07-21T00:15:32Z

I don't know of other instances. I didn't even know these existed :)

The other actionable item I'd suggest is having something about this guidance in the dev docs somewhere.

vyasr · 2022-07-21T20:38:59Z

That's a good call. I'll make a note to add that to our dev docs.

vyasr · 2022-10-17T22:40:08Z

We can close this once #11853 and #11938 are merged.

This PR adds a section to the developer documentation about various libcudf design decisions that affect users. These policies are important for us to document and communicate consistently. I am not sure what the best place for this information is, but I think the developer docs are a good place to start since until we address #11481 we don't have a great way to publish any non-API user-facing libcudf documentation. I've created this draft PR to solicit feedback from other libcudf devs about other policies that we should be documenting in a similar manner. Once everyone is happy with the contents, I would suggest that we merge this into the dev docs for now and then revisit a better place once we've tackled #11481. Partly addresses #5505, #1781. Resolves #4511. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) URL: #11853

This PR removes optional validation for some APIs. Performing these validations requires data introspection, which we do not want. This PR resolves #5505. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) - Jason Lowe (https://github.com/jlowe) URL: #11938

jrhemstad added feature request New feature or request proposal Change current process or code libcudf Affects libcudf (C++/CUDA) code. labels Jun 18, 2020

harrism added this to Needs prioritizing in Bug Squashing via automation Jul 20, 2020

shwina mentioned this issue Jul 22, 2020

[FEA] extract item from a list by index #5742

Closed

kkraus14 added the Python Affects Python cuDF API. label Jul 22, 2020

davidwendt mentioned this issue Jul 23, 2020

[REVIEW] Add nvtext::detokenize API #5739

Merged

jrhemstad mentioned this issue Jul 23, 2020

[FEA] Add a minmax reduction #5751

Closed

shwina mentioned this issue Oct 9, 2020

[FEA] Remove bounds checking from cudf::gather #6478

Closed

shwina mentioned this issue Dec 2, 2020

[REVIEW] Remove bounds check for cudf::gather #6875

Merged

github-actions bot added the inactive-90d label Mar 14, 2021

github-actions bot added the inactive-30d label Mar 14, 2021

shwina mentioned this issue Mar 15, 2021

Add Python bindings for lists::extract_lists_element #7505

Merged

jrhemstad mentioned this issue Jun 3, 2021

Implement strings::repeat_strings #8423

Merged

davidwendt mentioned this issue Jun 21, 2021

[FEA] Bounds check for string processing APIs #8574

Closed

jrhemstad mentioned this issue Feb 2, 2022

Support segmented reductions and null mask reductions #9621

Merged

davidwendt mentioned this issue Aug 24, 2022

Add strings 'like' function #11558

Merged

3 tasks

davidwendt mentioned this issue Sep 28, 2022

[BUG] from_durations seem to be overflowing while converting to a string #11794

Closed

vyasr mentioned this issue Oct 3, 2022

Initial draft of policies and guidelines for libcudf usage. #11853

Merged

3 tasks

vyasr mentioned this issue Oct 17, 2022

Remove validation that requires introspection #11938

Merged

3 tasks

rapids-bot bot closed this as completed in #11938 Oct 20, 2022

Bug Squashing automation moved this from Needs prioritizing to Closed Oct 20, 2022

jrhemstad mentioned this issue Nov 7, 2022

[BUG] strings::concatenate can overflow and cause data corruption #12087

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] libcudf should not introspect input data to perform error checking #5505

[DISCUSSION] libcudf should not introspect input data to perform error checking #5505

jrhemstad commented Jun 18, 2020

harrism commented Jul 20, 2020

shwina commented Jul 22, 2020

shwina commented Jul 22, 2020

jrhemstad commented Jul 22, 2020

kkraus14 commented Jul 22, 2020

shwina commented Jul 22, 2020

shwina commented Jul 23, 2020 •

edited

Loading

shwina commented Jul 23, 2020

jrhemstad commented Jul 23, 2020

shwina commented Jul 23, 2020

shwina commented Jul 23, 2020

jrhemstad commented Jul 23, 2020

shwina commented Jul 23, 2020

jrhemstad commented Jul 23, 2020

shwina commented Jul 23, 2020

jrhemstad commented Jul 23, 2020

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 14, 2021

vyasr commented Jul 20, 2022 •

edited

Loading

vyasr commented Jul 20, 2022

jrhemstad commented Jul 21, 2022

vyasr commented Jul 21, 2022

vyasr commented Oct 17, 2022 •

edited

Loading

[DISCUSSION] libcudf should not introspect input data to perform error checking #5505

[DISCUSSION] libcudf should not introspect input data to perform error checking #5505

Comments

jrhemstad commented Jun 18, 2020

harrism commented Jul 20, 2020

shwina commented Jul 22, 2020

shwina commented Jul 22, 2020

jrhemstad commented Jul 22, 2020

kkraus14 commented Jul 22, 2020

shwina commented Jul 22, 2020

shwina commented Jul 23, 2020 • edited Loading

shwina commented Jul 23, 2020

jrhemstad commented Jul 23, 2020

shwina commented Jul 23, 2020

shwina commented Jul 23, 2020

jrhemstad commented Jul 23, 2020

shwina commented Jul 23, 2020

jrhemstad commented Jul 23, 2020

shwina commented Jul 23, 2020

jrhemstad commented Jul 23, 2020

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 14, 2021

vyasr commented Jul 20, 2022 • edited Loading

vyasr commented Jul 20, 2022

jrhemstad commented Jul 21, 2022

vyasr commented Jul 21, 2022

vyasr commented Oct 17, 2022 • edited Loading

shwina commented Jul 23, 2020 •

edited

Loading

vyasr commented Jul 20, 2022 •

edited

Loading

vyasr commented Oct 17, 2022 •

edited

Loading