Workaround groupby aggregate thrust::copy_if overflow #12079

davidwendt · 2022-11-04T22:36:11Z

Description

Workaround for limitation in thrust::copy_if which fails if the input-iterator spans more than int-max.
The thrust::copy_if hardcodes the iterator distance type to be an int
https://github.com/NVIDIA/thrust/blob/dbd144ed543b60c4ff9d456edd19869e82fe8873/thrust/system/cuda/detail/copy_if.h#L699-L708

Found existing thrust issue: NVIDIA/cccl#747

This calls the copy_if in chunks if the iterator can span greater than int-max.

Closes #12058

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/src/groupby/hash/groupby.cu

codecov · 2022-11-05T01:02:44Z

Codecov Report

Base: 87.47% // Head: 88.10% // Increases project coverage by +0.62% 🎉

Coverage data is based on head (0602fad) compared to base (f817d96).
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.12   #12079      +/-   ##
================================================
+ Coverage         87.47%   88.10%   +0.62%     
================================================
  Files               133      135       +2     
  Lines             21826    22072     +246     
================================================
+ Hits              19093    19447     +354     
+ Misses             2733     2625     -108

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/interval.py	`85.45% <0.00%> (-9.10%)`	⬇️
python/cudf/cudf/io/text.py	`91.66% <0.00%> (-8.34%)`	⬇️
python/cudf/cudf/core/_base_index.py	`81.28% <0.00%> (-4.27%)`	⬇️
python/cudf/cudf/io/json.py	`92.06% <0.00%> (-2.68%)`	⬇️
python/cudf/cudf/utils/utils.py	`89.91% <0.00%> (-0.69%)`	⬇️
python/cudf/cudf/core/column/timedelta.py	`90.17% <0.00%> (-0.58%)`	⬇️
python/cudf/cudf/core/column/datetime.py	`89.21% <0.00%> (-0.51%)`	⬇️
python/cudf/cudf/core/column/column.py	`87.96% <0.00%> (-0.46%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`73.72% <0.00%> (-0.41%)`	⬇️
python/cudf/cudf/io/parquet.py	`90.45% <0.00%> (-0.39%)`	⬇️
... and 42 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

cpp/src/groupby/hash/groupby.cu

abellina

This LGTM, I will be running this today (building things) but I do not think my run is a blocker. I'll just report here or on the original issue.

abellina · 2022-11-07T19:43:09Z

Update: I ran with this patch on an 80GB GPU, and while on the aggregate I see a cudaErrorIllegalAddress. I am in the precess of running under compute-sanitizer.

abellina · 2022-11-07T21:25:09Z

OK, @davidwendt by itself this seems to be working fine in the 80GB gpu. I believe I am hitting a second issue that is not related to this one for the join that comes later. I see issues with gather that will have to be investigated separately. +1 still

bdice

Could this bug manifest in other places thrust::copy_if is called?

cpp/src/groupby/hash/groupby.cu

davidwendt · 2022-11-08T14:12:45Z

Could this bug manifest in other places thrust::copy_if is called?

Yes, I found a few other places this could be a problem. I will open PRs for those as well.

abellina · 2022-11-08T16:34:37Z

Yes, I found a few other places this could be a problem. I will open PRs for those as well.

Does it make sense for someone to start working on the underlying fix as well? (as you linked NVIDIA/cccl#747)

PointKernel · 2022-11-08T22:19:27Z

Do we have a test exercising this overflow issue?
This algorithm will be refactored by using cuco hash maps and thrust::copy_if may not be needed at all. I just want to make sure that we have tests covering this thus we won't reintroduce the same issue into the code during refactoring.

davidwendt · 2022-11-08T22:28:10Z

Do we have a test exercising this overflow issue? This algorithm will be refactored by using cuco hash maps and thrust::copy_if may not be needed at all. I just want to make sure that we have tests covering this thus we won't reintroduce the same issue into the code during refactoring.

A test would not be practical because it would require too much memory for the CI machines. Even with the fix, this runs out of memory on my 48GB GPU. This is really a thrust bug so I feel like the test should go there.

cpp/src/groupby/hash/groupby.cu

davidwendt · 2022-11-10T10:38:02Z

@gpucibot merge

Workaround in nvtext's wordpiece-tokenizer due to limitation in `thrust::copy_if` which fails if the input-iterator spans more than int-max. Found existing thrust issue: https://github.com/NVIDIA/thrust/issues/1302 This calls the `thrust::copy_if` in chunks if the iterator can span greater than int-max. Found while working on #12079 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vyas Ramasubramani (https://github.com/vyasr) URL: #12168

Workaround in json's get_tree_representation due to limitation in `thrust::copy_if` which fails if the input-iterator spans more than int-max. Found existing thrust issue: https://github.com/NVIDIA/thrust/issues/1302 This calls the thrust::copy_if in chunks if the iterator can span greater than int-max. Found while working on #12079 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Elias Stehle (https://github.com/elstehle) - Mike Wilson (https://github.com/hyperbolic2346) URL: #12190

This extracts `thrust_copy_if` out of `json_tree.cu` and puts it into `cudf/detail/utilities/algorithm.cuh` with a new name `cudf::detail::copy_if_safe`. Such utility is useful not just for use in `json_tree.cu` but potentially in many other places. The next immediate use case will be to implement `from_json` in NVIDIA/spark-rapids-jni#844. This also changes the work in #12079 to adopt the new function `cudf::detail::copy_if_safe`. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Yunsong Wang (https://github.com/PointKernel) - David Wendt (https://github.com/davidwendt) URL: #12455

Workaround groupby aggregate thrust::copy_if overflow

bb36527

davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Nov 4, 2022

davidwendt added this to PR-WIP in v22.12 Release via automation Nov 4, 2022

davidwendt self-assigned this Nov 4, 2022

abellina reviewed Nov 4, 2022

View reviewed changes

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved

ttnghia reviewed Nov 6, 2022

View reviewed changes

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved

ttnghia reviewed Nov 6, 2022

View reviewed changes

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved

fix key_itr/key_end check

a8b635d

abellina approved these changes Nov 7, 2022

View reviewed changes

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 7, 2022

davidwendt moved this from PR-WIP to PR-Needs review in v22.12 Release Nov 7, 2022

davidwendt marked this pull request as ready for review November 7, 2022 22:57

davidwendt requested a review from a team as a code owner November 7, 2022 22:57

davidwendt requested review from vyasr and bdice November 7, 2022 22:57

Merge branch 'branch-22.12' into bug-pop-keys-overflow

5ac6f27

bdice reviewed Nov 8, 2022

View reviewed changes

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved

Merge branch 'branch-22.12' into bug-pop-keys-overflow

9be531e

robertmaynard reviewed Nov 9, 2022

View reviewed changes

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved

davidwendt added 2 commits November 9, 2022 09:07

use limits<int>

65e0984

Merge branch 'branch-22.12' into bug-pop-keys-overflow

491a9c6

davidwendt requested a review from robertmaynard November 9, 2022 14:11

v22.12 Release automation moved this from PR-Needs review to PR-Reviewer approved Nov 9, 2022

robertmaynard approved these changes Nov 9, 2022

View reviewed changes

This was referenced Nov 9, 2022

[BUG] query95 @ 30TB negative allocation from BaseHashJoinIterator.countGroups with default 200 partitions NVIDIA/spark-rapids#6983

Closed

[BUG] 30TB query95 fails on the join with illegal memory access with 200 partitions NVIDIA/spark-rapids#7036

Closed

add const to some vars

0602fad

ttnghia approved these changes Nov 10, 2022

View reviewed changes

rapids-bot bot merged commit 4497ed6 into rapidsai:branch-22.12 Nov 10, 2022

v22.12 Release automation moved this from PR-Reviewer approved to Done Nov 10, 2022

davidwendt deleted the bug-pop-keys-overflow branch November 10, 2022 10:38

This was referenced Nov 16, 2022

Workaround thrust-copy-if limit in wordpiece-tokenizer #12168

Merged

Workaround thrust-copy-if limit in json get_tree_representation #12190

Merged

ttnghia mentioned this pull request Dec 30, 2022

Refactor thrust_copy_if into cudf::detail::copy_if_safe #12455

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround groupby aggregate thrust::copy_if overflow #12079

Workaround groupby aggregate thrust::copy_if overflow #12079

davidwendt commented Nov 4, 2022 •

edited

codecov bot commented Nov 5, 2022 •

edited

abellina left a comment •

edited

abellina commented Nov 7, 2022

abellina commented Nov 7, 2022

bdice left a comment

davidwendt commented Nov 8, 2022

abellina commented Nov 8, 2022

PointKernel commented Nov 8, 2022

davidwendt commented Nov 8, 2022

davidwendt commented Nov 10, 2022

Workaround groupby aggregate thrust::copy_if overflow #12079

Workaround groupby aggregate thrust::copy_if overflow #12079

Conversation

davidwendt commented Nov 4, 2022 • edited

Description

Checklist

codecov bot commented Nov 5, 2022 • edited

Codecov Report

abellina left a comment • edited

Choose a reason for hiding this comment

abellina commented Nov 7, 2022

abellina commented Nov 7, 2022

bdice left a comment

Choose a reason for hiding this comment

davidwendt commented Nov 8, 2022

abellina commented Nov 8, 2022

PointKernel commented Nov 8, 2022

davidwendt commented Nov 8, 2022

davidwendt commented Nov 10, 2022

davidwendt commented Nov 4, 2022 •

edited

codecov bot commented Nov 5, 2022 •

edited

abellina left a comment •

edited