Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalizing offsets iterator #14234

Merged
merged 47 commits into from
Nov 13, 2023

Conversation

davidwendt
Copy link
Contributor

Description

Creates a normalizing offsets iterator that returns an int64 value given either a int32 or int64 column data.
Depends on #14206

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 28, 2023
@davidwendt davidwendt self-assigned this Sep 28, 2023
@DaceT
Copy link

DaceT commented Oct 11, 2023

benchmark bot, please test this PR

rapids-bot bot pushed a commit that referenced this pull request Oct 17, 2023
Enables indexalator to be instantiated from device code.
Also add gtests for the output indexalator.
This change helps enable for the offset-normalizing-iterator #14234

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #14206
@davidwendt
Copy link
Contributor Author

This definitely increases compile time but I was able to minimize the impact somewhat. Here are the top 10 offenders:

File Compile time Difference
src/reductions/scan/scan_inclusive.cu.o 30:29 min 6:14 min
src/join/mixed_join_size_kernel_nulls.cu.o 28:15 min 15.970 s
src/join/mixed_join_kernel_nulls.cu.o 25:10 min 9.061 s
src/groupby/sort/group_argmin.cu.o 21:05 min 112.269 s
src/groupby/sort/group_argmax.cu.o 21:00 min 106.078 s
tests/table/row_operator_tests_utilities.cu.o 20:53 min 91.883 s
src/join/mixed_join_size_kernel.cu.o 20:31 min 40.165 s
src/sort/sort_column.cu.o 20:25 min -10.141 s
src/sort/stable_sort_column.cu.o 20:19 min -17.539 s
src/groupby/hash/groupby.cu.o 20:13 min 65.311 s

Negative numbers mean those 2 files compiled faster. The scan_inclusive jumped from 3rd place to 1st place.
I will look at improving some of these (by splitting out specializations) in a separate PR.

@davidwendt davidwendt marked this pull request as ready for review November 2, 2023 23:35
cpp/include/cudf/detail/indexalator.cuh Outdated Show resolved Hide resolved
*/
struct normalize_type {
template <typename T, std::enable_if_t<cudf::is_index_type<T>()>* = nullptr>
__device__ cudf::size_type operator()(void const* tp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not T const* tp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that is not the type that is being passed to the function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to understand the impact of type_dispatcher on the reworked design, it seems to me like we are still using it but there's no cascading calls to type_dispatcher and it's only called exactly once. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We only call the type-dispatcher in the factory now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and once when setting up the non-templated class input_indexelator::normalize_input. If you use a normal if-else dispatch there instead of type_dispatcher, are you able to see any benefits? Especially in src/reductions/scan/scan_inclusive.cu.o where there's a 6 minute compile-time increase

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you are correct, the base class's type-dispatcher is still called inside every element() call.
I think that is worth considering here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to study the assembly here. Is the type_dispatcher expanded only once when the class is compiled (so when the header is included) or is it expanded every time element() is called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a separate ctor that just passes the width instead of type-dispatching to resolve it.
This did improved the compile time: https://downloads.rapids.ai/ci/cudf/pull-request/14234/ab0edb7/cuda12_x86_64.ninja_log.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we halved the compile time increment in scan_inclusive? That is good!

cpp/include/cudf/detail/indexalator.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/indexalator.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/indexalator.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/normalizing_iterator.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/indexalator.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/offsets_iterator.cuh Show resolved Hide resolved
cpp/include/cudf/detail/offsets_iterator.cuh Show resolved Hide resolved
cpp/include/cudf/detail/offsets_iterator.cuh Show resolved Hide resolved
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some doc nits. LGTM

cpp/include/cudf/detail/indexalator.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/indexalator.cuh Outdated Show resolved Hide resolved
rapids-bot bot pushed a commit that referenced this pull request Nov 9, 2023
Splits out the `strings` and `struct` specializations in `scan_inclusive.cu` into separate source files to improve compile time.
Each specialization is unique code with limited aggregation types.
No functional changes. Just code moved around.
Found while working on #14234

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #14358
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 04d13d8 into rapidsai:branch-23.12 Nov 13, 2023
61 checks passed
@davidwendt davidwendt deleted the offsets-iterator branch November 13, 2023 14:05
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants