[BUG] Arena allocator doesn't handle per thread default streams and other streams allocating at the same time #1393

tgravescs · 2023-11-28T18:53:21Z

Describe the bug
In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with:

terminate called after throwing an instance of 'rmm::logic_error'
  what():  RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found

I believe the above issues stems from the change to orc writer to use the stream pool: rapidsai/cudf@cb06c20

Spark is running with per default stream turned on and now the orc writer has its own stream pool that is not per default stream.

Debugging the RMM code it appears that the following is happening:

a thread using a per thread default stream allocates memory and creates a thread Arena.
later someone else tries to allocate memory but it fails and causes a defrag. https://github.com/rapidsai/rmm/blob/branch-23.12/include/rmm/mr/device/arena_memory_resource.hpp#L160 , which calls clean() https://github.com/rapidsai/rmm/blob/branch-23.12/include/rmm/mr/device/arena_memory_resource.hpp#L177
The defrag sends all the superblocks_ from each of the arenas including the per thread default stream arena from step 1 back to the global arena: https://github.com/rapidsai/rmm/blob/branch-23.12/include/rmm/mr/device/detail/arena.hpp#L862
now one of the new streams from the orc writer pool allocates memory. This happens to move the superblock from the global arena into the stream arena.
The original thread using per thread default stream tries to deallocate the memory (https://github.com/rapidsai/rmm/blob/branch-23.12/include/rmm/mr/device/arena_memory_resource.hpp#L194) which isn't in the per thread default arena anymore so it falls into try the other arenas (https://github.com/rapidsai/rmm/blob/branch-23.12/include/rmm/mr/device/arena_memory_resource.hpp#L214). This code though check to see if use_per_thread_arena is on for that stream (https://github.com/rapidsai/rmm/blob/branch-23.12/include/rmm/mr/device/arena_memory_resource.hpp#L228), which it is and it only check the thread arenas to be deallocated. This fails and it falls through to try the global_arena (https://github.com/rapidsai/rmm/blob/branch-23.12/include/rmm/mr/device/arena_memory_resource.hpp#L238) which fails since the allocation is in the stream areans and it throws the "allocation not found"

I believe the fix for this is that since we are now mixing per thread default streams with normal streams it has to check all the arenas - threads and streams.

I have tested the above fix and it is no longer able to reproduce the issue.

Steps/Code to reproduce bug
I have no come up with a minimum repro outside Spark yet but I'm working on it.

Expected behavior
It doesn't crash.

Environment details (please complete the following information):

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
Method of RMM install: [conda, Docker, or from source]
- If method of install is [Docker], provide docker pull & docker run commands used
Please run and attach the output of the rmm/print_env.sh script to gather relevant environment details

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

…#1395) Replaces #1394, this is targeted for 24.02. fixes #1393 In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with: ``` terminate called after throwing an instance of 'rmm::logic_error' what(): RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found ``` The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled. CUDF recently added its own stream pool that is used in addition to when per default streams are used. It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena. I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know. Authors: - Thomas Graves (https://github.com/tgravescs) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) - Rong Ou (https://github.com/rongou) - Mark Harris (https://github.com/harrism) URL: #1395

…rapidsai#1395) Replaces rapidsai#1394, this is targeted for 24.02. fixes rapidsai#1393 In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with: ``` terminate called after throwing an instance of 'rmm::logic_error' what(): RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found ``` The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled. CUDF recently added its own stream pool that is used in addition to when per default streams are used. It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena. I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know. Authors: - Thomas Graves (https://github.com/tgravescs) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) - Rong Ou (https://github.com/rongou) - Mark Harris (https://github.com/harrism) URL: rapidsai#1395

tgravescs added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 28, 2023

harrism added this to RMM Project Board Nov 28, 2023

github-project-automation bot moved this to Todo in RMM Project Board Nov 28, 2023

harrism assigned tgravescs Nov 28, 2023

harrism added cpp Pertains to C++ code and removed ? - Needs Triage Need team to review and classify labels Nov 28, 2023

harrism moved this from Todo to In Progress in RMM Project Board Nov 28, 2023

This was referenced Nov 28, 2023

[DRAFT] Fix Arena allocator to work with both per thread default streams and another pool of streams being used at the same time #1394

Closed

Fix Arena MR to support simultaneous access by PTDS and other streams #1395

Merged

rapids-bot bot closed this as completed in #1395 Nov 29, 2023

github-project-automation bot moved this from In Progress to Done in RMM Project Board Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Arena allocator doesn't handle per thread default streams and other streams allocating at the same time #1393

[BUG] Arena allocator doesn't handle per thread default streams and other streams allocating at the same time #1393

tgravescs commented Nov 28, 2023 •

edited

Loading

[BUG] Arena allocator doesn't handle per thread default streams and other streams allocating at the same time #1393

[BUG] Arena allocator doesn't handle per thread default streams and other streams allocating at the same time #1393

Comments

tgravescs commented Nov 28, 2023 • edited Loading

tgravescs commented Nov 28, 2023 •

edited

Loading