-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Arena allocator doesn't handle per thread default streams and other streams allocating at the same time #1393
Comments
tgravescs
added
? - Needs Triage
Need team to review and classify
bug
Something isn't working
labels
Nov 28, 2023
harrism
added
cpp
Pertains to C++ code
and removed
? - Needs Triage
Need team to review and classify
labels
Nov 28, 2023
rapids-bot bot
pushed a commit
that referenced
this issue
Nov 29, 2023
…#1395) Replaces #1394, this is targeted for 24.02. fixes #1393 In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with: ``` terminate called after throwing an instance of 'rmm::logic_error' what(): RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found ``` The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled. CUDF recently added its own stream pool that is used in addition to when per default streams are used. It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena. I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know. Authors: - Thomas Graves (https://github.com/tgravescs) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) - Rong Ou (https://github.com/rongou) - Mark Harris (https://github.com/harrism) URL: #1395
bdice
pushed a commit
to bdice/rmm
that referenced
this issue
Dec 1, 2023
…rapidsai#1395) Replaces rapidsai#1394, this is targeted for 24.02. fixes rapidsai#1393 In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with: ``` terminate called after throwing an instance of 'rmm::logic_error' what(): RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found ``` The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled. CUDF recently added its own stream pool that is used in addition to when per default streams are used. It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena. I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know. Authors: - Thomas Graves (https://github.com/tgravescs) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) - Rong Ou (https://github.com/rongou) - Mark Harris (https://github.com/harrism) URL: rapidsai#1395
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with:
I believe the above issues stems from the change to orc writer to use the stream pool: rapidsai/cudf@cb06c20
Spark is running with per default stream turned on and now the orc writer has its own stream pool that is not per default stream.
Debugging the RMM code it appears that the following is happening:
I believe the fix for this is that since we are now mixing per thread default streams with normal streams it has to check all the arenas - threads and streams.
I have tested the above fix and it is no longer able to reproduce the issue.
Steps/Code to reproduce bug
I have no come up with a minimum repro outside Spark yet but I'm working on it.
Expected behavior
It doesn't crash.
Environment details (please complete the following information):
docker pull
&docker run
commands usedrmm/print_env.sh
script to gather relevant environment detailsAdditional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: