[DRAFT] Fix Arena allocator to work with both per thread default streams and another pool of streams being used at the same time #1394

tgravescs · 2023-11-28T22:26:11Z

Description

fixes #1393

Note putting up as DRAFT as I don't have a test yet.

Checklist

[X ] I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

another pool of streams being used at the same time Signed-off-by: Thomas Graves <tgraves@nvidia.com>

copy-pr-bot · 2023-11-28T22:26:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

harrism · 2023-11-28T22:34:17Z

/ok to test

tgravescs · 2023-11-28T23:12:45Z

replaced by #1395 so closing this.

…#1395) Replaces #1394, this is targeted for 24.02. fixes #1393 In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with: ``` terminate called after throwing an instance of 'rmm::logic_error' what(): RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found ``` The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled. CUDF recently added its own stream pool that is used in addition to when per default streams are used. It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena. I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know. Authors: - Thomas Graves (https://github.com/tgravescs) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) - Rong Ou (https://github.com/rongou) - Mark Harris (https://github.com/harrism) URL: #1395

…rapidsai#1395) Replaces rapidsai#1394, this is targeted for 24.02. fixes rapidsai#1393 In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with: ``` terminate called after throwing an instance of 'rmm::logic_error' what(): RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found ``` The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled. CUDF recently added its own stream pool that is used in addition to when per default streams are used. It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena. I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know. Authors: - Thomas Graves (https://github.com/tgravescs) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) - Rong Ou (https://github.com/rongou) - Mark Harris (https://github.com/harrism) URL: rapidsai#1395

Fix Arena allocator to work with both per thread default streams and

0fb54f4

another pool of streams being used at the same time Signed-off-by: Thomas Graves <tgraves@nvidia.com>

github-actions bot added the cpp Pertains to C++ code label Nov 28, 2023

harrism added bug Something isn't working non-breaking Non-breaking change labels Nov 28, 2023

tgravescs mentioned this pull request Nov 28, 2023

Fix Arena MR to support simultaneous access by PTDS and other streams #1395

Merged

3 tasks

tgravescs closed this Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Fix Arena allocator to work with both per thread default streams and another pool of streams being used at the same time #1394

[DRAFT] Fix Arena allocator to work with both per thread default streams and another pool of streams being used at the same time #1394

tgravescs commented Nov 28, 2023

copy-pr-bot bot commented Nov 28, 2023

harrism commented Nov 28, 2023

tgravescs commented Nov 28, 2023

[DRAFT] Fix Arena allocator to work with both per thread default streams and another pool of streams being used at the same time #1394

[DRAFT] Fix Arena allocator to work with both per thread default streams and another pool of streams being used at the same time #1394

Conversation

tgravescs commented Nov 28, 2023

Description

Checklist

copy-pr-bot bot commented Nov 28, 2023

harrism commented Nov 28, 2023

tgravescs commented Nov 28, 2023