Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] libcudf must customize the Thrust/CUB namespace #11368

Closed
jrhemstad opened this issue Jul 27, 2022 · 22 comments · Fixed by #11665
Closed

[BUG] libcudf must customize the Thrust/CUB namespace #11368

jrhemstad opened this issue Jul 27, 2022 · 22 comments · Fixed by #11665
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@jrhemstad
Copy link
Contributor

jrhemstad commented Jul 27, 2022

Describe the bug

For header-only libraries like Thrust/CUB, it can be problematic when an application inadvertently ends up with multiple or mismatched versions of the headers.

Thrust/CUB provide a way to universally customize the namespace such that a symbol like thrust::reduce will instead be thrust::CUSTOM::reduce.

This prevents inadvertent symbol collisions and version mismatches by forcing libcudf to use Thrust/CUB symbols from within its custom namespace.

Solution

libcudf should set THRUST_CUB_WRAPPED_NAMESPACE=cudf as part of the cmake configuration step for Thrust/CUB.

@jrhemstad jrhemstad added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. labels Jul 27, 2022
@bdice
Copy link
Contributor

bdice commented Jul 27, 2022

Yes, we should do this. It was on my list of things to investigate after I upgrade RAPIDS to Thrust 1.17. (Also I think it’s an outer namespace like CUSTOM::thrust::reduce.)

@bdice bdice self-assigned this Jul 29, 2022
@robertmaynard
Copy link
Contributor

robertmaynard commented Aug 2, 2022

I agree we need to find a way to ensure that two distinct libraries that have no interaction can safely use Thrust without
accidently causing issues due to ABI changes.

The problem I see with proposed approach is that it makes using Thrust types harder when two projects call into each other and have Thrust types in the public API like all of RAPIDS.

For example here is a super simplified example from cudf:

 #include <rmm/exec_policy.hpp>

 #include <thrust/optional.h>
 using namespace cudf;

 namespace cudf {
 namespace detail {
  ...
 }
 }

This code fails to compile since the thrust types used in <rmm/exec_policy.hpp> don't exist ( they are cudf::thrust ) and therefore the code fails. Likewise moving the using namespace above the <rmm/exec_policy.hpp> doesn't work as the namespace hasn't been created yet.

This means that at a minimum the namespace that we import thrust into has to be consistent across all of RAPIDS. To make that robust our logic should look like:

 #include <thrust/optional.h>
 #ifdef THRUST_WRAPPED_NAMESPACE
   using namespace THRUST_WRAPPED_NAMESPACE;
 #endif

 namespace cudf {
  ...
 }

Now that we have solved internal RAPIDS dependencies we should explore the impacts for callers of libcudf

  • I believe that we should treat this as a full ABI break. I see that in a couple of locations use thrust types as data members or parameters which might cause a break.
  • While the above ifdef guards will ensure that cudf headers always match the thrust namespace convention of the caller, I believe that to use to libcudf API that has Thrust types exposed they will also need to bring Thrust in as THRUST_WRAPPED_NAMESPACE::thrust and add the above using logic.
  • Everyones needs to be aware that this won't stop runtime ODR / ABI breaks due to compiling with different Thrust versions. It would be entirely possible to build libcudf with Thrust 1.17 and build a direct libcudf consumer with Thrust 1.15 as they both support the THRUST_WRAPPED_NAMESPACE syntax.

This isn't to say I oppose this work, we just need to make sure we understand the impact. To help with this effort here are my current draft PR's for rmm and cudf:

@jrhemstad
Copy link
Contributor Author

have Thrust types in the public API like all of RAPIDS.

We should just get away from doing this. I don't know of cases where we use anything other than thrust::optional in a public API, and that can be replaced with std::optional now that we use C++17.

Everyones needs to be aware that this won't stop runtime ODR / ABI breaks due to compiling with different Thrust versions.

I don't think we're ever concerned with ABI breaks in RAPIDS libraries.

What kind of ODR issues do you foresee? If two versions of Thrust are present, each in different namespaces, then they are different symbols so far as the program is concerned and therefore distinct definitions of the symbol won't be a problem.

@bdice
Copy link
Contributor

bdice commented Aug 3, 2022

If I understand correctly, every RAPIDS library would have a different definition of rmm::device_vector because it’s reused straight from Thrust. Would those types be interchangeable among RAPIDS libraries using rmm headers if the Thrust namespaces (and thus symbols) are different among each RAPIDS library? (Apologies if this explanation is too short, I can provide a longer form explanation of what I mean later if this is unclear)

@robertmaynard
Copy link
Contributor

We should just get away from doing this.

I agree. I will look at changing the offending code to not expose thrust

I don't think we're ever concerned with ABI breaks in RAPIDS libraries.

What kind of ODR issues do you foresee? If two versions of Thrust are present, each in different namespaces, then they are different symbols so far as the program is concerned and therefore distinct definitions of the symbol won't be a problem.

This doesn't stop two different versions of Thrust being compiled using the same namespace. So cudf and a consumer could use differing versions of thrust and place it in the same namespace. This might happen due to a leakage of the THRUST_WRAPPED_NAMESPACE value to the consumer.

Would those types be interchangeable among RAPIDS libraries using rmm headers if the Thrust namespaces (and thus symbols) are different among each RAPIDS library?

The won't be interchangable. We can see that when looking at the current cudf PR. The STRINGS_TEST fail to link due to the fact that it thinks cudf::make_strings_column(cudf::device_span<thrust::pair ...) should exist but the signature is actually cudf::make_strings_column(cudf::device_span<rapids::thrust::pair ...).

One of the reasons I am proposing a common prefix is that interop between libraries ( cudf / raft / cuml / cugraph ) is possible when they are used within the same TU. If each project goes with a unique thrust namespace, no interop that includes thrust types will be possible,

@bdice
Copy link
Contributor

bdice commented Aug 3, 2022

Excellent, the common namespace prefix seems to be essential here. However, it makes me wonder if this truly solves any problems (compared with no namespace) because some rmm types (for example) will not be compatible with non-RAPIDS libraries. It’s a slight improvement but also partially shifts the problem to require that non-RAPIDS libraries match the Thrust namespace prefix and version OR avoid using such Thrust-derived types from RAPIDS libraries. I suppose we want to remove Thrust APIs from our public interfaces so perhaps this is solved in that way.

@robertmaynard
Copy link
Contributor

However, it makes me wonder if this truly solves any problems (compared with no namespace) because some rmm types (for example) will not be compatible with non-RAPIDS libraries.

It does solve a real problem which is that end user programs bring RAPIDS projects together with other C++ libraries that use thrust. Since these projects are built ignorant of each other they use different thrust versions and cause weird runtime failures.

To be clear the proposal for rmm won't impact the ability for non RAPIDS projects to use it. The rmm headers will adapt based on the value of THRUST_WRAPPED_NAMESPACE. What they lose is the ability to use something like raft + rmm + some thirdparty library ( non-header only ) that uses thrust in the same file ( TU ).

@alliepiper
Copy link
Contributor

it thinks cudf::make_strings_column(cudf::device_span<thrust::pair ...) should exist but the signature is actually cudf::make_strings_column(cudf::device_span<rapids::thrust::pair ...).

The THRUST_NS_QUALIFIER macro may simplify things here:

cudf::make_strings_column(cudf::device_span<THRUST_NS_QUALIFIER::pair ...)

It expands to ::thrust by default, but rapids::thrust when THRUST_CUB_WRAPPED_NAMESPACE=rapids.

I suppose we want to remove Thrust APIs from our public interfaces so perhaps this is solved in that way.

I strongly support this. The namespace macros are only robust when Thrust isn't exposed publicly.

@bdice
Copy link
Contributor

bdice commented Aug 3, 2022

(Thanks for the clarification @robertmaynard / @allisonvacanti! My previous message may have sounded negative on this idea - poor writing on my part.)

@jrhemstad
Copy link
Contributor Author

fwiw, the original impetus for this issue was due to mixing CUB versions in a C++ code using both libcudf and CUB headers.

We definitely don't have any CUB types as part of any public interface, so if it's easier to just custom scope CUB and not Thrust in the short term then I think that would be valuable to do.

rapids-bot bot pushed a commit that referenced this issue Aug 4, 2022
As noted in #11368 we should strive towards not having thrust types in our 'public' API. 
This removes occurences of using `thrust::optional` from cudf/io host classes in preference of `std::optional`.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Tobias Ribizel (https://github.com/upsj)
  - Bradley Dice (https://github.com/bdice)

URL: #11455
@alliepiper
Copy link
Contributor

if it's easier to just custom scope CUB and not Thrust in the short term then I think that would be valuable to do.

This can be done in the current macro system by just changing CUB_WRAPPED_NAMESPACE instead of THRUST_CUB_WRAPPED_NAMESPACE. If the conflicts are just in CUB and CUB is not exposed in any cudf headers, this should work as a temporary fix.

I'm not sure how realistic that "CUB is not exposed" bit is, though. Thrust's headers indirectly pull in CUB headers often, so there may still be issues.

@robertmaynard
Copy link
Contributor

I am working on a pr draft of cudf using cub in a namespace.

@github-actions
Copy link

github-actions bot commented Sep 3, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@robertmaynard
Copy link
Contributor

With a solution to NVIDIA/cub#545 being merged, I think the best option for cudf is to patch our version of thrust.

@jrhemstad
Copy link
Contributor Author

With a solution to NVIDIA/cub#545 being merged, I think the best option for cudf is to patch our version of thrust.

Can't we just use a version of Thrust that makes the same changes? NVIDIA/thrust#1788

@robertmaynard
Copy link
Contributor

Can't we just use a version of Thrust that makes the same changes? NVIDIA/thrust#1788

No released version of Thrust currently exists have the required changes. Once Thrust 2.1 has been released ( or the patches are back ported ) we can stop patching our Thrust.

@jrhemstad
Copy link
Contributor Author

No released version of Thrust currently exists have the required changes. Once Thrust 2.1 has been released ( or the patches are back ported ) we can stop patching our Thrust.

If we're going to go through the work of making the changes locally for a patch, why not just contribute the changes to Thrust and then update our version of Thrust?

@robertmaynard
Copy link
Contributor

Sounds reasonable we can use the 1.17.X branch once the following PR's are merged

@gevtushenko
Copy link

@robertmaynard, @jrhemstad, mentioned PRs are merged.

@bdice
Copy link
Contributor

bdice commented Sep 12, 2022

@senior-zero Will a tag be made for these changes? Our CPM configuration in cuDF relies on a tag right now. I'd use branch 1.17.X but I would prefer for the version to be "fixed" like a tag or commit hash, and I don't think commit hashes are enough for us to shallow clone the repo (which saves network time and disk space).

@gevtushenko
Copy link

@bdice we'll create a tag soon.

@gevtushenko
Copy link

@bdice @allisonvacanti has just created 1.17.2 tag.

rapids-bot bot pushed a commit that referenced this issue Sep 19, 2022
Updates libcudf to use Thrust 1.17.2. This version contains a newer cub util_namespace.h with the upstream fix needed to correct ODR issues inside thrust-cub.
Fixes #11368.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #11665
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants