Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add system memory resource #1581

Merged
merged 23 commits into from
Jul 1, 2024
Merged

Add system memory resource #1581

merged 23 commits into from
Jul 1, 2024

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Jun 11, 2024

Description

Adds a new device memory resource that uses system allocated memory. Works around some existing issues when GPU memory is oversubscribed.

closes #1580

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rongou rongou requested review from a team as code owners June 11, 2024 20:15
@rongou rongou requested review from harrism and wence- June 11, 2024 20:15
Copy link

copy-pr-bot bot commented Jun 11, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added CMake cpp Pertains to C++ code labels Jun 11, 2024
@rongou rongou added feature request New feature or request non-breaking Non-breaking change labels Jun 11, 2024
Copy link
Contributor

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved trivial CMake changes

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly documentation nits, but one unsigned integer arithmetic issue that wants fixed.


namespace detail {
struct sam {
static bool is_supported()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please document that this function returns whether something is supported on the currently active device (or, perhaps better, allow passing a cuda_device_id in)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added parameter.

namespace rmm::mr {

namespace detail {
struct sam {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What is sam an abbreviation of? System accessible memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System allocated memory. Added a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should spell it out in the name of the struct. I like to avoid unnecessary abbreviation in code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the struct.

Comment on lines 74 to 75
* By default if no parameters are specified, this memory resource pass through to malloc/free
* for allocation/deallocation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* By default if no parameters are specified, this memory resource pass through to malloc/free
* for allocation/deallocation.
* By default if no parameters are specified, this memory resource passes through to malloc/free
* for allocation/deallocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed.


if (bytes >= threshold_size_) {
auto const free = rmm::available_device_memory().first;
auto const allocatable = std::max(free - headroom_size_, 0UL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: this subtraction is between two unsigned integers, and therefore wraps whenever free is less than headroom_size_. So exactly when you want to not allocate on the GPU, you will end up determining that you can allocate all of aligned bytes on the GPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

RMM_CUDA_TRY(cudaMemAdvise(static_cast<char*>(ptr) + gpu_portion,
cpu_portion,
cudaMemAdviseSetPreferredLocation,
cudaCpuDeviceId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What do we want this interface to look like in a post cuda-12.2 world where cudaMemAdvice_v2 exists and we can specify numa affinity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. At least in GH200, there is only one NUMA node on the CPU side, so it doesn't really matter.

Comment on lines 77 to 81
* However, when GPU memory is over-subscribed, system allocated memory will migrate to the GPU
* and cannot migrate back, thus causing other CUDA calls to fail with out-of-memory errors. To
* work around this problem, we can reserve some GPU memory as headroom for other CUDA calls.
* Doing this check can be expensive, so only large buffer above the given threshold will be
* checked.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think I quite follow this documentation. The joining However conjunction doesn't seem to follow from the previous paragraph.

For example, what is "this check"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved, removed "however".

Comment on lines 166 to 167
* Two cuda_memory_resources always compare equal, because they can each
* deallocate memory allocated by the other.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This isn't a cuda_memory_resource though.

I think for equality, the requirement RMM tends to have is that pointers allocated by memory resource A can be deallocated by memory resource B. But here we are stricter, since we also require that the headroom and threshold are identical. Is this desired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@wence-
Copy link
Contributor

wence- commented Jun 13, 2024

/ok to test

@rongou rongou requested a review from wence- June 13, 2024 19:56
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, two minor docstring suggestions from me.

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/sam_headroom_resource_adaptor.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/sam_headroom_resource_adaptor.hpp Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor

wence- commented Jun 14, 2024

/ok to test

@harrism
Copy link
Member

harrism commented Jun 15, 2024

/ok to test

Does this mean @rongou doesnt have commit signing set up?

@wence-
Copy link
Contributor

wence- commented Jun 17, 2024

/ok to test

Does this mean @rongou doesnt have commit signing set up?

Looks like it, yes. @roungou: these docs describe how to set this up https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification

@wence-
Copy link
Contributor

wence- commented Jun 17, 2024

/ok to test

@rongou rongou requested review from a team as code owners June 17, 2024 19:34
@rongou rongou requested a review from raydouglass June 17, 2024 19:34
@github-actions github-actions bot added Python Related to RMM Python API ci labels Jun 17, 2024
@github-actions github-actions bot removed the ci label Jun 17, 2024
@rongou
Copy link
Contributor Author

rongou commented Jun 17, 2024

Ok, signed the commits.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. One design discussion we should have, and a bunch of details.

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp Outdated Show resolved Hide resolved
* be expensive, checking whether to set the preferred location is only done above the specified
* allocation threshold size.
*
* Note that if threshold size is non-zero, then an application making many small allocations can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this, I wonder if instead of a threshold parameter, you should using a binning_memory_resource, and encode the threshold as the maximum allocation size for the lower bin (or second highest, if more than two bins). All allocations in the larger bin (above the threshold) will do the headroom check. All allocations in the smaller bin(s) won't. You could even tune the smaller bins by using multiple bins, some or all of which might not even use SAM (so they won't eat into the headroom).

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think we can leave the threshold checking to the caller. At least in the current use case, we want to configure the cupy allocator to use this, and it'd only be used for large buffers.

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp Outdated Show resolved Hide resolved
*/
void* do_allocate(std::size_t bytes, [[maybe_unused]] cuda_stream_view stream) override
{
auto const aligned{rmm::align_up(bytes, rmm::CUDA_ALLOCATION_ALIGNMENT)};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SAM, shouldn't we be using HOST alignment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and did some experiments with both HMM and ATS, looks like we don't actually need to do additional alignments. Removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got more recent advice from the CUDA team and they suggested we should use CUDA alignment if it will be accessed on the device. If nothing else, for cache line alignment this is important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we do this alignment? Do we need to add some padding as in aligned_host_allocate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to do host MR style alignment, but with CUDA alignment. PTAL

namespace rmm::mr {

namespace detail {
struct sam {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should spell it out in the name of the struct. I like to avoid unnecessary abbreviation in code.


namespace detail {
/** @brief Struct to check if system allocated memory (SAM) is supported. */
struct sam {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: does this need to be a struct? It looks like it could just be a static function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's just a single function for now. Removed the struct.

tests/CMakeLists.txt Show resolved Hide resolved
mr.deallocate(ptr, size_mb);
}

TEST(SAMHeadroomAdaptorTest, ThrowIfNotSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: perhaps group this with the other ThrowIfNotSupported test above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

rongou and others added 3 commits June 18, 2024 11:56
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
@rongou rongou requested a review from harrism June 19, 2024 00:31
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One copy-pasto

friend void get_property(system_memory_resource const&, cuda::mr::device_accessible) noexcept {}

/**
* @brief Enables the `cuda::mr::device_accessible` property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Enables the `cuda::mr::device_accessible` property
* @brief Enables the `cuda::mr::host_accessible` property

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all of my feedback. Approving, with a couple of comments.

explicit sam_headroom_resource_adaptor(Upstream* upstream, std::size_t headroom)
: upstream_{upstream}, headroom_{headroom}
{
static_assert(std::is_same_v<system_memory_resource, Upstream>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this could work with any prefetchable upstream, including a SAM MR that is already wrapped with another adaptor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specifically designed to address the current shortcomings of SAM, so I'm not sure if it's useful for other things (e.g. managed memory). You are right about the adaptors though, but I'm leaning towards being more strict to avoid misuse. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, OK. We can always loosen it if needed.

{
try {
void* ptr = ref.allocate(bytes);
EXPECT_NE(nullptr, ptr);
EXPECT_TRUE(is_properly_aligned(ptr));
EXPECT_TRUE(is_device_accessible_memory(ptr));
if (not is_system_mr) { EXPECT_TRUE(is_device_accessible_memory(ptr)); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not device accessible memory, you shouldn't even be running these tests on it. If it's device accessible but is_device_accessible_memory returns false, then we should investigate why it returns false and fix it. I don't really like the extra boolean parameter, the allocation tests should be generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

try {
return rmm::detail::aligned_host_allocate(
bytes, CUDA_ALLOCATION_ALIGNMENT, [](std::size_t size) { return ::operator new(size); });
} catch (std::bad_alloc const&) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this, is it just changing the exception type? I think you should try to incorporate the what() from the bad_alloc into the new exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* more information, see
* https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/.
*/
class system_memory_resource final : public device_memory_resource {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a use case where we want a system_memory_resource that is not limited to systems that support paging. So we don't want to limit it to is_system_memory_supported(device). Instead, we want to limit the device_accessible property to be true on systems where the memory is device accessible. I'm currently checking whether the device_accessible property can be used as a dynamic property. I don't think so. :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess technically you could have multiple GPUs on a system and only some of them support SAM (e.g. a Turing and a Pascal), so the dynamic property also needs a device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harrism can this be merged now or do you still want to get the device_accessible property resolved first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still waiting, have pinged @jrhemstad for help. I think since MRs are implicitly tied to the current device, it can query the current device, doesn't need to take a parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_accessible is not a dynamic property.

For this kind of functionality, we'd need to add a new property like maybe_device_accessible where get_property(mr, cuda::maybe_device_accessible) returns a bool for if it is device accessible or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this now? We can always enhance it when we have a dynamic device_accessible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad that would make this MR unusable with a resource_ref that expects device_accessible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rongou Perhaps. I'm troubled that something called a system_memory_resource is only usable on systems that support paging memory to device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a system_memory_resource under the device namespace, so it's probably reasonable to assume it only works on environments that support accessing system allocated memory from a device.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no device namespace, and device_memory_resource and host_memory_resource base classes are planned for removal.

@@ -32,6 +32,7 @@ INSTANTIATE_TEST_SUITE_P(ResourceTests,
mr_factory{"CUDA_Async", &make_cuda_async},
#endif
mr_factory{"Managed", &make_managed},
mr_factory{"System", &make_system},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this will cause the tests to fail on systems that don't support pageable memory access?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are skipped if SAM is not supported.

@harrism
Copy link
Member

harrism commented Jul 1, 2024

/merge

@rapids-bot rapids-bot bot merged commit a727a03 into rapidsai:branch-24.08 Jul 1, 2024
57 checks passed
rapids-bot bot pushed a commit that referenced this pull request Jul 25, 2024
Follow up on #1581 to add access to the system memory resource in python.

Fixes #1622

Authors:
  - Rong Ou (https://github.com/rongou)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #1605
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp Pertains to C++ code feature request New feature or request non-breaking Non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[FEA] System Memory Resource
5 participants