Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround groupby aggregate thrust::copy_if overflow #12079

Merged
merged 7 commits into from Nov 10, 2022

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Nov 4, 2022

Description

Workaround for limitation in thrust::copy_if which fails if the input-iterator spans more than int-max.
The thrust::copy_if hardcodes the iterator distance type to be an int
https://github.com/NVIDIA/thrust/blob/dbd144ed543b60c4ff9d456edd19869e82fe8873/thrust/system/cuda/detail/copy_if.h#L699-L708

Found existing thrust issue: NVIDIA/cccl#747

This calls the copy_if in chunks if the iterator can span greater than int-max.

Closes #12058

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Nov 4, 2022
@davidwendt davidwendt added this to PR-WIP in v22.12 Release via automation Nov 4, 2022
@davidwendt davidwendt self-assigned this Nov 4, 2022
@codecov
Copy link

codecov bot commented Nov 5, 2022

Codecov Report

Base: 87.47% // Head: 88.10% // Increases project coverage by +0.62% 🎉

Coverage data is based on head (0602fad) compared to base (f817d96).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #12079      +/-   ##
================================================
+ Coverage         87.47%   88.10%   +0.62%     
================================================
  Files               133      135       +2     
  Lines             21826    22072     +246     
================================================
+ Hits              19093    19447     +354     
+ Misses             2733     2625     -108     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/interval.py 85.45% <0.00%> (-9.10%) ⬇️
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/cudf/cudf/core/_base_index.py 81.28% <0.00%> (-4.27%) ⬇️
python/cudf/cudf/io/json.py 92.06% <0.00%> (-2.68%) ⬇️
python/cudf/cudf/utils/utils.py 89.91% <0.00%> (-0.69%) ⬇️
python/cudf/cudf/core/column/timedelta.py 90.17% <0.00%> (-0.58%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.21% <0.00%> (-0.51%) ⬇️
python/cudf/cudf/core/column/column.py 87.96% <0.00%> (-0.46%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
python/cudf/cudf/io/parquet.py 90.45% <0.00%> (-0.39%) ⬇️
... and 42 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, I will be running this today (building things) but I do not think my run is a blocker. I'll just report here or on the original issue.

@abellina
Copy link
Contributor

abellina commented Nov 7, 2022

Update: I ran with this patch on an 80GB GPU, and while on the aggregate I see a cudaErrorIllegalAddress. I am in the precess of running under compute-sanitizer.

@abellina
Copy link
Contributor

abellina commented Nov 7, 2022

OK, @davidwendt by itself this seems to be working fine in the 80GB gpu. I believe I am hitting a second issue that is not related to this one for the join that comes later. I see issues with gather that will have to be investigated separately. +1 still

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 7, 2022
@davidwendt davidwendt moved this from PR-WIP to PR-Needs review in v22.12 Release Nov 7, 2022
@davidwendt davidwendt marked this pull request as ready for review November 7, 2022 22:57
@davidwendt davidwendt requested a review from a team as a code owner November 7, 2022 22:57
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this bug manifest in other places thrust::copy_if is called?

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

Could this bug manifest in other places thrust::copy_if is called?

Yes, I found a few other places this could be a problem. I will open PRs for those as well.

@abellina
Copy link
Contributor

abellina commented Nov 8, 2022

Yes, I found a few other places this could be a problem. I will open PRs for those as well.

Does it make sense for someone to start working on the underlying fix as well? (as you linked NVIDIA/cccl#747)

@PointKernel
Copy link
Member

Do we have a test exercising this overflow issue?
This algorithm will be refactored by using cuco hash maps and thrust::copy_if may not be needed at all. I just want to make sure that we have tests covering this thus we won't reintroduce the same issue into the code during refactoring.

@davidwendt
Copy link
Contributor Author

Do we have a test exercising this overflow issue? This algorithm will be refactored by using cuco hash maps and thrust::copy_if may not be needed at all. I just want to make sure that we have tests covering this thus we won't reintroduce the same issue into the code during refactoring.

A test would not be practical because it would require too much memory for the CI machines. Even with the fix, this runs out of memory on my 48GB GPU. This is really a thrust bug so I feel like the test should go there.

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved
cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved
v22.12 Release automation moved this from PR-Needs review to PR-Reviewer approved Nov 9, 2022
@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4497ed6 into rapidsai:branch-22.12 Nov 10, 2022
v22.12 Release automation moved this from PR-Reviewer approved to Done Nov 10, 2022
@davidwendt davidwendt deleted the bug-pop-keys-overflow branch November 10, 2022 10:38
rapids-bot bot pushed a commit that referenced this pull request Nov 28, 2022
Workaround in nvtext's wordpiece-tokenizer due to limitation in `thrust::copy_if` which fails if the input-iterator spans more than int-max.

Found existing thrust issue: https://github.com/NVIDIA/thrust/issues/1302
This calls the `thrust::copy_if` in chunks if the iterator can span greater than int-max.

Found while working on #12079

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #12168
rapids-bot bot pushed a commit that referenced this pull request Dec 6, 2022
Workaround in json's get_tree_representation due to limitation in `thrust::copy_if` which fails if the input-iterator spans more than int-max.

Found existing thrust issue: https://github.com/NVIDIA/thrust/issues/1302
This calls the thrust::copy_if in chunks if the iterator can span greater than int-max.

Found while working on #12079

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Elias Stehle (https://github.com/elstehle)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #12190
rapids-bot bot pushed a commit that referenced this pull request Jan 4, 2023
This extracts `thrust_copy_if` out of `json_tree.cu` and puts it into `cudf/detail/utilities/algorithm.cuh` with a new name `cudf::detail::copy_if_safe`. Such utility is useful not just for use in `json_tree.cu` but potentially in many other places. The next immediate use case will be to implement `from_json` in NVIDIA/spark-rapids-jni#844.

This also changes the work in #12079 to adopt the new function `cudf::detail::copy_if_safe`.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - David Wendt (https://github.com/davidwendt)

URL: #12455
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

[BUG] size_type overflow in cudf::groupby::detail::hash::extract_populated_keys
6 participants