Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Arena MR to support simultaneous access by PTDS and other streams #1395

Merged
merged 12 commits into from
Nov 29, 2023

Conversation

tgravescs
Copy link
Contributor

@tgravescs tgravescs commented Nov 28, 2023

Description

Replaces #1394, this is targeted for 24.02.

fixes #1393

In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with:

terminate called after throwing an instance of 'rmm::logic_error'
  what():  RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found

The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled. CUDF recently added its own stream pool that is used in addition to when per default streams are used.
It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena.

I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

another pool of streams being used at the same time

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Copy link

copy-pr-bot bot commented Nov 28, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp Pertains to C++ code label Nov 28, 2023
@tgravescs
Copy link
Contributor Author

/ok to test

@harrism harrism changed the title Fix Arena allocator to work with both per thread default streams and another pool of streams being used at the same time Fix Arena allocator to support simultaneous access by PTDS and other streams Nov 28, 2023
@harrism harrism changed the title Fix Arena allocator to support simultaneous access by PTDS and other streams Fix Arena MR to support simultaneous access by PTDS and other streams Nov 28, 2023
@harrism harrism added bug Something isn't working non-breaking Non-breaking change labels Nov 28, 2023
making sure deallocations work

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
@tgravescs tgravescs marked this pull request as ready for review November 29, 2023 01:23
@tgravescs tgravescs requested a review from a team as a code owner November 29, 2023 01:23
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
@tgravescs
Copy link
Contributor Author

Note, I added a test here. If I remove my fix to arena_memory_resources the test fails because it throws:

terminate called after throwing an instance of 'rmm::logic_error'
  what():  RMM failure /arena_memory_resource.hpp:238: allocation not found
  

If there are better ideas for a test happy to update or add.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly commentary so that I understand what the test is doing, and a few small questions.

include/rmm/mr/device/arena_memory_resource.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/arena_memory_resource.hpp Outdated Show resolved Hide resolved
arena_mr mr(rmm::mr::get_current_device_resource(), arena_size);
std::vector<std::thread> threads;
std::size_t num_threads{3};
auto view = std::make_shared<rmm::cuda_stream_view>(rmm::cuda_stream_per_thread);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Does this need to be shared or can you just use auto view = rmm::cuda_stream_per_thread? and then replace view->value() by view below.

std::vector<std::thread> threads;
std::size_t num_threads{3};
auto view = std::make_shared<rmm::cuda_stream_view>(rmm::cuda_stream_per_thread);
void* thread_ptr = mr.allocate(256, view->value());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this one allocates from a thread arena (because it's the PTDS).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes because its cuda_stream_per_thread

Comment on lines 560 to 562
auto* ptr1 = mr.allocate(superblock::minimum_size);
auto* ptr2 = mr.allocate(superblock::minimum_size);
auto* ptr3 = mr.allocate(superblock::minimum_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can't be satisfied by the defragged existing superblock because that only has 1MiB - 256bytes space.

So now, the superblock that contains thread_ptr lives in the set maintained by the global arena.

Deallocating at this point was something that was already handled.

auto* ptr1 = mr.allocate(superblock::minimum_size);
auto* ptr2 = mr.allocate(superblock::minimum_size);
auto* ptr3 = mr.allocate(superblock::minimum_size);
auto* ptr4 = mr.allocate(32_KiB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocation can be satisfied by the superblock that contains the allocation of thread_ptr. So now, the superblock which contains the allocation for thread_ptr lives in the set maintained in the stream_arenas_ (specifically the default stream).

auto* ptr3 = mr.allocate(superblock::minimum_size);
auto* ptr4 = mr.allocate(32_KiB);
// NOLINTNEXTLINE(cppcoreguidelines-avoid-goto)
EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, view->value()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And now this deallocation will not find the superblock for this pointer in the thread arena it was initially allocated with, nor in the global arena. But rather in one of the stream arenas.

And previously this would fail. Now, your fix searches (after failing to find in the global arena) the stream arenas, and the deallocation succeeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

mr.deallocate(ptr1, superblock::minimum_size);
mr.deallocate(ptr2, superblock::minimum_size);
mr.deallocate(ptr3, superblock::minimum_size);
mr.deallocate(ptr4, 32_KiB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And clean up, these deallocations have "always" been fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes just cleanup

}
for (auto& thread : threads) {
thread.join();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: It seems to me that you don't actually need more than one thread for this test. Just to have this allocate/deallocate pattern using a combination of PTDS and non-default streams. Do I have it right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, I think that (prior to your fix), replacing this thread-based allocation with:

for (int i = 0; i < 3; i++) {
  cuda_stream stream{};
  void *ptr = mr.allocate(32_KiB, stream);
  mr.deallocate(ptr, 32_KiB, stream);
}

Would also provoke failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct and I don't really need 3 so simplifying this

tgravescs and others added 3 commits November 29, 2023 08:19
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
@tgravescs
Copy link
Contributor Author

Pushed changes to the test to simplify it and add comments. @wence- let me know if it I missed any comment or you have further questions.

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this looks good! Top debugging work too.

I'll wait for @harrism to take a look as well (this was the first time I was reading the arena code...)

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me, from my (fairly limited) knowledge of the arena allocator.

tests/mr/device/arena_mr_tests.cpp Outdated Show resolved Hide resolved
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
if (!global_arena_.deallocate(ptr, bytes)) { RMM_FAIL("allocation not found"); }
if (!global_arena_.deallocate(ptr, bytes)) {
// It's possible to use per thread default streams along with another pool of streams.
// This means that it's possible for an arena to move from a thread or stream arena back
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for an allocation to move ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: This looks great -- nice that a relatively simple test verifies it. Nice debugging and good explanatory comments.

I would like to just make the streams explicit in all allocations/deallocations for clarity. There are a few nits that I marked "non-blocking" that I will leave up to you whether you change them.

Comment on lines 544 to 546
auto per_thread_stream = rmm::cuda_stream_per_thread;
// Create an allocation from a per thread arena
void* thread_ptr = mr.allocate(256, per_thread_stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [non-blocking]: No need to store this constant in a variable.

Suggested change
auto per_thread_stream = rmm::cuda_stream_per_thread;
// Create an allocation from a per thread arena
void* thread_ptr = mr.allocate(256, per_thread_stream);
// Create an allocation from a per thread arena
void* thread_ptr = mr.allocate(256, rmm::cuda_stream_per_thread);

// The original thread ptr is now owned by a stream arena so make
// sure deallocation works.
// NOLINTNEXTLINE(cppcoreguidelines-avoid-goto)
EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, per_thread_stream));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [non-blocking]:

Suggested change
EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, per_thread_stream));
EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, rmm::cuda_stream_per_thread));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [non-blocking]:
Tests will fail if there is an exception whether or not you EXPECT_NO_THROW. Not throwing is expected by default. So you can remove the macro and the NOLINTNEXTLINE.

// the next allocation causes defrag. Defrag causes all superblocks
// from the thread and stream arena allocated above to go back to
// global arena and it allocates one superblock to the stream arena.
auto* ptr1 = mr.allocate(superblock::minimum_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What stream will this use?
Suggestion: Can you make the stream explicit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.

Suggested change
auto* ptr1 = mr.allocate(superblock::minimum_size);
auto* ptr1 = mr.allocate(superblock::minimum_size, rmm::cuda_stream_view{});

// Allocate again to make sure all superblocks from
// global arena are owned by a stream arena instead of a thread arena
// or the global arena.
auto* ptr2 = mr.allocate(32_KiB);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What stream will this use?
Suggestion: Can you make the stream explicit?

Comment on lines 565 to 566
mr.deallocate(ptr1, superblock::minimum_size);
mr.deallocate(ptr2, 32_KiB);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What stream will this use?
Suggestion: Can you make the stream explicit?

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
@tgravescs
Copy link
Contributor Author

tgravescs commented Nov 29, 2023

Fixed all nits and requested changes. Reran test without the fix in arena_memory_resource and verified it fails, applied patch and reran test and it passes.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, clear test. Nice that it requires no threads or even loops or branches!

@harrism
Copy link
Member

harrism commented Nov 29, 2023

/merge

@rapids-bot rapids-bot bot merged commit da793c5 into rapidsai:branch-24.02 Nov 29, 2023
47 checks passed
bdice pushed a commit to bdice/rmm that referenced this pull request Dec 1, 2023
…rapidsai#1395)

Replaces rapidsai#1394, this is targeted for 24.02.

fixes rapidsai#1393

In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with:

```
terminate called after throwing an instance of 'rmm::logic_error'
  what():  RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found
```

The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled.  CUDF recently added its own stream pool that is used in addition to when per default streams are used.  
It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back  into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and  there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena.

I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know.

Authors:
  - Thomas Graves (https://github.com/tgravescs)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Bradley Dice (https://github.com/bdice)
  - Rong Ou (https://github.com/rongou)
  - Mark Harris (https://github.com/harrism)

URL: rapidsai#1395
raydouglass pushed a commit that referenced this pull request Dec 4, 2023
…#1395) (#1396)

This PR backports #1395 from 24.02 to 23.12. It contains an arena MR fix for simultaneous access by PTDS and other streams.

Backport requested by @sameerz @GregoryKimball.

Authors:
   - Thomas Graves (https://github.com/tgravescs)

Approvers:
   - Lawrence Mitchell (https://github.com/wence-)
   - Mark Harris (https://github.com/harrism)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cpp Pertains to C++ code non-breaking Non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] Arena allocator doesn't handle per thread default streams and other streams allocating at the same time
5 participants