Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert sum/product aggregation to always produce int64_t type #14907

Merged
merged 1 commit into from
Jan 29, 2024

Conversation

SurajAralihalli
Copy link
Contributor

@SurajAralihalli SurajAralihalli commented Jan 26, 2024

Description

This pull request reverses the modifications made to the sum/product aggregation target type, ensuring it always produces int64. The changes implemented by PR 14679 which led to degraded performance when the aggregation column had an unsigned type, are reverted. Additional details can be found in the issue 14886.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@SurajAralihalli SurajAralihalli requested a review from a team as a code owner January 26, 2024 23:31
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 26, 2024
@SurajAralihalli SurajAralihalli changed the base branch from branch-24.04 to branch-24.02 January 26, 2024 23:32
@ttnghia
Copy link
Contributor

ttnghia commented Jan 26, 2024

I still don't understand what is the reason that causes performance issue there? Why can't we fix that issue instead of reverting code like this?

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jan 26, 2024

Thank you @ttnghia for your message. I also would prefer to solve the root cause of the issue rather than revert the change. It appears that the degenerate performance is happening in a thrust::for_each_n call here in groupby.cu. I'm concerned that there isn't an obvious libcudf fix, and the root cause could be in CCCL. If the fix isn't a simple ~1-10 line change in libcudf... I think it's going to be too high risk for 24.02.

@GregoryKimball GregoryKimball changed the title Revert sum/product aggregation to alway produce int64_t type Revert sum/product aggregation to always produce int64_t type Jan 26, 2024
@davidwendt davidwendt added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Jan 27, 2024
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to have this going into 24.02, but we should investigate further for it later. I suspect that it could be some thing messing with the device sum operator working on different operand types.

@karthikeyann
Copy link
Contributor

karthikeyann commented Jan 29, 2024

@ttnghia The atomicAdd<T>( has no specialization for uint32_t and uint64_t. So it goes for CAS-while loop generic atomic implementation, which slows the groupby sum operation on uint column.

@ttnghia
Copy link
Contributor

ttnghia commented Jan 29, 2024

@ttnghia The atomicAdd<T>( has no specialization for uint32_t and uint64_t. So it goes for CAS-while loop generic atomic implementation, which slows the groupby sum operation on uint column.

Wait, from the cuda documentation I see that unsigned int/long are supported?
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd

So there maybe something wrong with the groupby code that doesn't call native CUDA atomicAdd but instead calls to the generic CAS-while loop.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jan 29, 2024

Thank you @karthikeyann and @ttnghia for this investigation. We would love your help making the groupby code work correctly with uint64 in 24.04. If this PR is ready to go, would you please request an admin merge in 24.02?

@raydouglass raydouglass merged commit 7cd3834 into rapidsai:branch-24.02 Jan 29, 2024
68 of 69 checks passed
@bdice
Copy link
Contributor

bdice commented Jan 29, 2024

@karthikeyann's analysis is correct, from what I can tell. I think he was referring to the implementation in device_atomics.cuh, which has a limited implementation.

// specialized functions for operators
// `atomicAdd` supports int32, float, double (signed int64 is not supported.)
// `atomicMin`, `atomicMax` support int32_t, int64_t

That file only implements DeviceSum for float, double, int32_t, and int64_t. It seems reasonable to add uint32_t and uint64_t implementations, which could call the CUDA atomicAdd for their respective types. We would want to add static assertions like this to ensure that the type aliases like unsigned long long int are equivalent to uint64_t.

@davidwendt
Copy link
Contributor

Maybe cuda::atomic_ref could be used instead of the device_atomics.cuh functions though I'm not sure if it is supported for unsigned types.
Reference: #13583
@PointKernel

@ttnghia
Copy link
Contributor

ttnghia commented Jan 29, 2024

From what I understand:

  • device_atomics.cuh is to specialize for the types that are not natively supported by cuda atomicXXX functions. As such, in case of calling to atomicAdd for uint64_t, the compiler should call to the native atomicAdd function, not cudf::detail::atomicAdd function.
  • For uint64_t , if cudf::detail::atomicAdd is called, which in turn calls to the generic CAS-while loop, there should be something wrong with the call chain.
  • The caller is in cudf::detail:: namespace, and it just calls atomicAdd thus it is difficult to tell which atomicAdd is being executed.

@bdice
Copy link
Contributor

bdice commented Jan 29, 2024

device_atomics.cuh is to specialize for the types that are not natively supported by cuda atomicXXX functions. As such, in case of calling to atomicAdd for uint64_t, the compiler should call to the native atomicAdd function, not cudf::detail::atomicAdd function.

The overloads for float and double just pass through. We should do the same here for uint32_t and uint64_t, if my analysis is correct. We should always use cudf's internal implementation instead of the CUDA atomicAdd.

For uint64_t , if cudf::detail::atomicAdd is called, which in turn calls to the generic CAS-while loop, there should be something wrong with the call chain.

I'm tracing the call chain myself. I think it goes from aggregation::SUM, which calls atomicAdd and falls back to this generic case which calls this CAS loop.

The caller is in cudf::detail:: namespace, and it just calls atomicAdd thus it is difficult to tell which atomicAdd is being executed.

I'd be okay with adding an explicit namespace where we intend this to be called. I agree the name conflict is not ideal.

@PointKernel
Copy link
Member

I'm not sure if it is supported for unsigned types.

cuda::atomic_ref works fine for 4-byte and 8-byte types but doesn't support bool or 2-byte types like int16_t. We can get rid of device_atomics.cuh once NVIDIA/cccl#1024 is resolved.

I'd be okay with adding an explicit namespace where we intend this to be called. I agree the name conflict is not ideal.

👍 explicit namespace or snake-case naming is probably the best temporary workaround before the cccl fix.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jan 31, 2024

OK, thank you everyone for this discussion. If I understand the consensus solution correctly we should:

  • add specializations for uint32_t and uint64_t as per the comment
  • update namespace for atomicAdd in aggregation.cuh to specify the cudf::detail:: namespace
  • extend the namespace updates to atomicMul atomicMin atomicMax in aggregation.cuh

rapids-bot bot pushed a commit that referenced this pull request Mar 12, 2024
…operators to detail namespace. (#14962)

This PR does a thorough refactoring of `device_atomics.cuh`.

- I moved all atomic-related functions to `cudf::detail::` (making this an API-breaking change, but most likely a low-impact break)
- I added all missing operators for natively supported types to `atomicAdd`, `atomicMin`, `atomicMax`, etc. as discussed in #10149 and #14907.
  - This should prevent fallback to the `atomicCAS` path for types that are natively supported for those atomic operators, which we suspect as the root cause of the performance regression in #14886.
- I kept `atomicAdd` rather than `cudf::detail::atomic_add` in locations where a native CUDA overload exists, and the same for min/max/CAS operations. Aggregations are the only place where we use the special overloads. We were previously calling the native CUDA function rather than our special overloads in many cases, so I retained the previous behavior. This avoids including the additional headers that implement an unnecessary level of wrapping for natively supported overloads.
- I enabled native 2-byte CAS operations (on `unsigned short int`) that eliminate the do-while loop and extra alignment-checking logic
  - The CUDA docs don't state this, but some forum posts claim this is only supported by compute capability 7.0+. We now have 7.0 as a lower bound for RAPIDS so I'm not concerned by this as long as builds/tests pass.
- I improved/cleaned the documentation and moved around some code so that the operators were in a logical order.
- I assessed the existing tests and it looks like all the types are being covered. I'm not sure if there is a good way to enforce that certain types (like `uint64_t`) are passing through native `atomicAdd` calls.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Suraj Aralihalli (https://github.com/SurajAralihalli)

URL: #14962
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

8 participants