Fix Arena MR to support simultaneous access by PTDS and other streams #1395

tgravescs · 2023-11-28T23:00:11Z

Description

Replaces #1394, this is targeted for 24.02.

In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with:

terminate called after throwing an instance of 'rmm::logic_error'
  what():  RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found

The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled. CUDF recently added its own stream pool that is used in addition to when per default streams are used.
It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena.

I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

another pool of streams being used at the same time Signed-off-by: Thomas Graves <tgraves@nvidia.com>

copy-pr-bot · 2023-11-28T23:00:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tgravescs · 2023-11-28T23:05:24Z

/ok to test

…xSigned

making sure deallocations work Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs · 2023-11-29T01:37:35Z

Note, I added a test here. If I remove my fix to arena_memory_resources the test fails because it throws:

terminate called after throwing an instance of 'rmm::logic_error'
  what():  RMM failure /arena_memory_resource.hpp:238: allocation not found

If there are better ideas for a test happy to update or add.

wence-

Mostly commentary so that I understand what the test is doing, and a few small questions.

include/rmm/mr/device/arena_memory_resource.hpp

wence- · 2023-11-29T09:16:15Z

tests/mr/device/arena_mr_tests.cpp

+  arena_mr mr(rmm::mr::get_current_device_resource(), arena_size);
+  std::vector<std::thread> threads;
+  std::size_t num_threads{3};
+  auto view        = std::make_shared<rmm::cuda_stream_view>(rmm::cuda_stream_per_thread);


nit: Does this need to be shared or can you just use auto view = rmm::cuda_stream_per_thread? and then replace view->value() by view below.

wence- · 2023-11-29T09:20:49Z

tests/mr/device/arena_mr_tests.cpp

+  std::vector<std::thread> threads;
+  std::size_t num_threads{3};
+  auto view        = std::make_shared<rmm::cuda_stream_view>(rmm::cuda_stream_per_thread);
+  void* thread_ptr = mr.allocate(256, view->value());


So this one allocates from a thread arena (because it's the PTDS).

yes because its cuda_stream_per_thread

wence- · 2023-11-29T09:29:31Z

tests/mr/device/arena_mr_tests.cpp

+  auto* ptr1 = mr.allocate(superblock::minimum_size);
+  auto* ptr2 = mr.allocate(superblock::minimum_size);
+  auto* ptr3 = mr.allocate(superblock::minimum_size);


These can't be satisfied by the defragged existing superblock because that only has 1MiB - 256bytes space.

So now, the superblock that contains thread_ptr lives in the set maintained by the global arena.

Deallocating at this point was something that was already handled.

wence- · 2023-11-29T09:35:13Z

tests/mr/device/arena_mr_tests.cpp

+  auto* ptr1 = mr.allocate(superblock::minimum_size);
+  auto* ptr2 = mr.allocate(superblock::minimum_size);
+  auto* ptr3 = mr.allocate(superblock::minimum_size);
+  auto* ptr4 = mr.allocate(32_KiB);


This allocation can be satisfied by the superblock that contains the allocation of thread_ptr. So now, the superblock which contains the allocation for thread_ptr lives in the set maintained in the stream_arenas_ (specifically the default stream).

wence- · 2023-11-29T09:37:13Z

tests/mr/device/arena_mr_tests.cpp

+  auto* ptr3 = mr.allocate(superblock::minimum_size);
+  auto* ptr4 = mr.allocate(32_KiB);
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto)
+  EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, view->value()));


And now this deallocation will not find the superblock for this pointer in the thread arena it was initially allocated with, nor in the global arena. But rather in one of the stream arenas.

And previously this would fail. Now, your fix searches (after failing to find in the global arena) the stream arenas, and the deallocation succeeds.

wence- · 2023-11-29T09:38:15Z

tests/mr/device/arena_mr_tests.cpp

+  mr.deallocate(ptr1, superblock::minimum_size);
+  mr.deallocate(ptr2, superblock::minimum_size);
+  mr.deallocate(ptr3, superblock::minimum_size);
+  mr.deallocate(ptr4, 32_KiB);


And clean up, these deallocations have "always" been fine.

yes just cleanup

wence- · 2023-11-29T09:39:12Z

tests/mr/device/arena_mr_tests.cpp

+  }
+  for (auto& thread : threads) {
+    thread.join();
+  }


question: It seems to me that you don't actually need more than one thread for this test. Just to have this allocate/deallocate pattern using a combination of PTDS and non-default streams. Do I have it right?

That is, I think that (prior to your fix), replacing this thread-based allocation with:

for (int i = 0; i < 3; i++) { cuda_stream stream{}; void *ptr = mr.allocate(32_KiB, stream); mr.deallocate(ptr, 32_KiB, stream); }

Would also provoke failure.

correct and I don't really need 3 so simplifying this

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs · 2023-11-29T15:25:36Z

Pushed changes to the test to simplify it and add comments. @wence- let me know if it I missed any comment or you have further questions.

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

wence-

Thanks, I think this looks good! Top debugging work too.

I'll wait for @harrism to take a look as well (this was the first time I was reading the arena code...)

include/rmm/mr/device/arena_memory_resource.hpp

bdice

This looks fine to me, from my (fairly limited) knowledge of the arena allocator.

tests/mr/device/arena_mr_tests.cpp

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

rongou · 2023-11-29T17:32:21Z

include/rmm/mr/device/arena_memory_resource.hpp

-    if (!global_arena_.deallocate(ptr, bytes)) { RMM_FAIL("allocation not found"); }
+    if (!global_arena_.deallocate(ptr, bytes)) {
+      // It's possible to use per thread default streams along with another pool of streams.
+      // This means that it's possible for an arena to move from a thread or stream arena back


for an allocation to move ...

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

harrism

Praise: This looks great -- nice that a relatively simple test verifies it. Nice debugging and good explanatory comments.

I would like to just make the streams explicit in all allocations/deallocations for clarity. There are a few nits that I marked "non-blocking" that I will leave up to you whether you change them.

harrism · 2023-11-29T21:44:55Z

tests/mr/device/arena_mr_tests.cpp

+  auto per_thread_stream = rmm::cuda_stream_per_thread;
+  // Create an allocation from a per thread arena
+  void* thread_ptr = mr.allocate(256, per_thread_stream);


Nit [non-blocking]: No need to store this constant in a variable.

Suggested change

auto per_thread_stream = rmm::cuda_stream_per_thread;

// Create an allocation from a per thread arena

void* thread_ptr = mr.allocate(256, per_thread_stream);

// Create an allocation from a per thread arena

void* thread_ptr = mr.allocate(256, rmm::cuda_stream_per_thread);

harrism · 2023-11-29T21:45:24Z

tests/mr/device/arena_mr_tests.cpp

+  // The original thread ptr is now owned by a stream arena so make
+  // sure deallocation works.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto)
+  EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, per_thread_stream));


Nit [non-blocking]:

Suggested change

EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, per_thread_stream));

EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, rmm::cuda_stream_per_thread));

Nit [non-blocking]:
Tests will fail if there is an exception whether or not you EXPECT_NO_THROW. Not throwing is expected by default. So you can remove the macro and the NOLINTNEXTLINE.

harrism · 2023-11-29T21:46:38Z

tests/mr/device/arena_mr_tests.cpp

+  // the next allocation causes defrag. Defrag causes all superblocks
+  // from the thread and stream arena allocated above to go back to
+  // global arena and it allocates one superblock to the stream arena.
+  auto* ptr1 = mr.allocate(superblock::minimum_size);


Question: What stream will this use?
Suggestion: Can you make the stream explicit?

e.g.

Suggested change

auto* ptr1 = mr.allocate(superblock::minimum_size);

auto* ptr1 = mr.allocate(superblock::minimum_size, rmm::cuda_stream_view{});

harrism · 2023-11-29T21:47:14Z

tests/mr/device/arena_mr_tests.cpp

+  // Allocate again to make sure all superblocks from
+  // global arena are owned by a stream arena instead of a thread arena
+  // or the global arena.
+  auto* ptr2 = mr.allocate(32_KiB);


Question: What stream will this use?
Suggestion: Can you make the stream explicit?

harrism · 2023-11-29T21:49:00Z

tests/mr/device/arena_mr_tests.cpp

+  mr.deallocate(ptr1, superblock::minimum_size);
+  mr.deallocate(ptr2, 32_KiB);


Question: What stream will this use?
Suggestion: Can you make the stream explicit?

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs · 2023-11-29T22:11:16Z

Fixed all nits and requested changes. Reran test without the fix in arena_memory_resource and verified it fails, applied patch and reran test and it passes.

harrism

Very nice, clear test. Nice that it requires no threads or even loops or branches!

harrism · 2023-11-29T22:37:19Z

/merge

…rapidsai#1395) Replaces rapidsai#1394, this is targeted for 24.02. fixes rapidsai#1393 In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with: ``` terminate called after throwing an instance of 'rmm::logic_error' what(): RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found ``` The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled. CUDF recently added its own stream pool that is used in addition to when per default streams are used. It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena. I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know. Authors: - Thomas Graves (https://github.com/tgravescs) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) - Rong Ou (https://github.com/rongou) - Mark Harris (https://github.com/harrism) URL: rapidsai#1395

@sameerz

…#1395) (#1396) This PR backports #1395 from 24.02 to 23.12. It contains an arena MR fix for simultaneous access by PTDS and other streams. Backport requested by @sameerz @GregoryKimball. Authors: - Thomas Graves (https://github.com/tgravescs) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Mark Harris (https://github.com/harrism)

Fix Arena allocator to work with both per thread default streams and

adaccaa

another pool of streams being used at the same time Signed-off-by: Thomas Graves <tgraves@nvidia.com>

github-actions bot added the cpp Pertains to C++ code label Nov 28, 2023

Merge remote-tracking branch 'origin/branch-24.02' into arenaStreamFi…

f9d56ff

…xSigned

tgravescs mentioned this pull request Nov 28, 2023

[DRAFT] Fix Arena allocator to work with both per thread default streams and another pool of streams being used at the same time #1394

Closed

2 tasks

harrism changed the title ~~Fix Arena allocator to work with both per thread default streams and another pool of streams being used at the same time~~ Fix Arena allocator to support simultaneous access by PTDS and other streams Nov 28, 2023

harrism changed the title ~~Fix Arena allocator to support simultaneous access by PTDS and other streams~~ Fix Arena MR to support simultaneous access by PTDS and other streams Nov 28, 2023

harrism added bug Something isn't working non-breaking Non-breaking change labels Nov 28, 2023

Add a test for per thread default streams with another stream pool and

1a8e271

making sure deallocations work Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs marked this pull request as ready for review November 29, 2023 01:23

tgravescs requested a review from a team as a code owner November 29, 2023 01:23

tgravescs requested review from harrism and wence- November 29, 2023 01:23

tgravescs added 2 commits November 28, 2023 19:32

Fix spacing and nolint

c8c89e3

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Fix spacing

2e3a1b5

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

wence- reviewed Nov 29, 2023

View reviewed changes

tgravescs and others added 3 commits November 29, 2023 08:19

Update include/rmm/mr/device/arena_memory_resource.hpp

ff5e61c

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

Update include/rmm/mr/device/arena_memory_resource.hpp

89eb964

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

Simplify test and add more comments

3a8e795

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

fix style

889271c

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

wence- approved these changes Nov 29, 2023

View reviewed changes

include/rmm/mr/device/arena_memory_resource.hpp Show resolved Hide resolved

bdice approved these changes Nov 29, 2023

View reviewed changes

tests/mr/device/arena_mr_tests.cpp Outdated Show resolved Hide resolved

Update tests/mr/device/arena_mr_tests.cpp

0140328

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

rongou approved these changes Nov 29, 2023

View reviewed changes

Fix typo in comment

e72cfd5

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

harrism requested changes Nov 29, 2023

View reviewed changes

update test to be explicit about stream used and minor review comments

244b95a

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

harrism approved these changes Nov 29, 2023

View reviewed changes

rapids-bot bot merged commit da793c5 into rapidsai:branch-24.02 Nov 29, 2023
47 checks passed

bdice mentioned this pull request Dec 1, 2023

Backport arena MR fix for simultaneous access by PTDS and other streams #1396

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Arena MR to support simultaneous access by PTDS and other streams #1395

Fix Arena MR to support simultaneous access by PTDS and other streams #1395

tgravescs commented Nov 28, 2023 •

edited

Loading

copy-pr-bot bot commented Nov 28, 2023

tgravescs commented Nov 28, 2023

tgravescs commented Nov 29, 2023

wence- left a comment

wence- Nov 29, 2023

wence- Nov 29, 2023

tgravescs Nov 29, 2023

wence- Nov 29, 2023

wence- Nov 29, 2023

wence- Nov 29, 2023

tgravescs Nov 29, 2023

wence- Nov 29, 2023

tgravescs Nov 29, 2023

wence- Nov 29, 2023

wence- Nov 29, 2023

tgravescs Nov 29, 2023

tgravescs commented Nov 29, 2023

wence- left a comment

bdice left a comment

rongou Nov 29, 2023

tgravescs Nov 29, 2023

harrism left a comment

harrism Nov 29, 2023

harrism Nov 29, 2023

harrism Nov 29, 2023

harrism Nov 29, 2023

harrism Nov 29, 2023

harrism Nov 29, 2023

harrism Nov 29, 2023

tgravescs commented Nov 29, 2023 •

edited

Loading

harrism left a comment

harrism commented Nov 29, 2023

	EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, per_thread_stream));
	EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, rmm::cuda_stream_per_thread));

	auto* ptr1 = mr.allocate(superblock::minimum_size);
	auto* ptr1 = mr.allocate(superblock::minimum_size, rmm::cuda_stream_view{});

		mr.deallocate(ptr1, superblock::minimum_size);
		mr.deallocate(ptr2, 32_KiB);

Fix Arena MR to support simultaneous access by PTDS and other streams #1395

Fix Arena MR to support simultaneous access by PTDS and other streams #1395

Conversation

tgravescs commented Nov 28, 2023 • edited Loading

Description

Checklist

copy-pr-bot bot commented Nov 28, 2023

tgravescs commented Nov 28, 2023

tgravescs commented Nov 29, 2023

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Nov 29, 2023

wence- left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Nov 29, 2023 • edited Loading

harrism left a comment

Choose a reason for hiding this comment

harrism commented Nov 29, 2023

tgravescs commented Nov 28, 2023 •

edited

Loading

tgravescs commented Nov 29, 2023 •

edited

Loading