Add system memory resource #1581

rongou · 2024-06-11T20:15:25Z

Description

Adds a new device memory resource that uses system allocated memory. Works around some existing issues when GPU memory is oversubscribed.

closes #1580

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-06-11T20:15:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

KyleFromNVIDIA

Approved trivial CMake changes

wence-

Mostly documentation nits, but one unsigned integer arithmetic issue that wants fixed.

wence- · 2024-06-13T14:00:56Z

include/rmm/mr/device/system_memory_resource.hpp

+
+namespace detail {
+struct sam {
+  static bool is_supported()


nit: Please document that this function returns whether something is supported on the currently active device (or, perhaps better, allow passing a cuda_device_id in)?

Added parameter.

wence- · 2024-06-13T14:01:17Z

include/rmm/mr/device/system_memory_resource.hpp

+namespace rmm::mr {
+
+namespace detail {
+struct sam {


question: What is sam an abbreviation of? System accessible memory?

System allocated memory. Added a comment.

I think we should spell it out in the name of the struct. I like to avoid unnecessary abbreviation in code.

Removed the struct.

wence- · 2024-06-13T14:01:37Z

include/rmm/mr/device/system_memory_resource.hpp

+   * By default if no parameters are specified, this memory resource pass through to malloc/free
+   * for allocation/deallocation.


Suggested change

* By default if no parameters are specified, this memory resource pass through to malloc/free

* for allocation/deallocation.

* By default if no parameters are specified, this memory resource passes through to malloc/free

* for allocation/deallocation.

No longer needed.

wence- · 2024-06-13T14:09:03Z

include/rmm/mr/device/system_memory_resource.hpp

+
+    if (bytes >= threshold_size_) {
+      auto const free        = rmm::available_device_memory().first;
+      auto const allocatable = std::max(free - headroom_size_, 0UL);


issue: this subtraction is between two unsigned integers, and therefore wraps whenever free is less than headroom_size_. So exactly when you want to not allocate on the GPU, you will end up determining that you can allocate all of aligned bytes on the GPU.

Good catch, fixed.

wence- · 2024-06-13T14:10:18Z

include/rmm/mr/device/system_memory_resource.hpp

+        RMM_CUDA_TRY(cudaMemAdvise(static_cast<char*>(ptr) + gpu_portion,
+                                   cpu_portion,
+                                   cudaMemAdviseSetPreferredLocation,
+                                   cudaCpuDeviceId));


question: What do we want this interface to look like in a post cuda-12.2 world where cudaMemAdvice_v2 exists and we can specify numa affinity?

Not sure. At least in GH200, there is only one NUMA node on the CPU side, so it doesn't really matter.

wence- · 2024-06-13T14:10:46Z

include/rmm/mr/device/system_memory_resource.hpp

+   * However, when GPU memory is over-subscribed, system allocated memory will migrate to the GPU
+   * and cannot migrate back, thus causing other CUDA calls to fail with out-of-memory errors. To
+   * work around this problem, we can reserve some GPU memory as headroom for other CUDA calls.
+   * Doing this check can be expensive, so only large buffer above the given threshold will be
+   * checked.


nit: I don't think I quite follow this documentation. The joining However conjunction doesn't seem to follow from the previous paragraph.

For example, what is "this check"?

This is moved, removed "however".

wence- · 2024-06-13T14:13:20Z

include/rmm/mr/device/system_memory_resource.hpp

+   * Two cuda_memory_resources always compare equal, because they can each
+   * deallocate memory allocated by the other.


issue: This isn't a cuda_memory_resource though.

I think for equality, the requirement RMM tends to have is that pointers allocated by memory resource A can be deallocated by memory resource B. But here we are stricter, since we also require that the headroom and threshold are identical. Is this desired?

wence- · 2024-06-13T16:23:30Z

/ok to test

wence-

Thanks for the updates, two minor docstring suggestions from me.

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp

wence- · 2024-06-14T08:31:48Z

/ok to test

harrism · 2024-06-15T21:57:40Z

/ok to test

Does this mean @rongou doesnt have commit signing set up?

wence- · 2024-06-17T08:17:47Z

/ok to test

Does this mean @rongou doesnt have commit signing set up?

Looks like it, yes. @roungou: these docs describe how to set this up https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification

wence- · 2024-06-17T08:19:06Z

/ok to test

rongou · 2024-06-17T19:46:54Z

Ok, signed the commits.

harrism

Nice work. One design discussion we should have, and a bunch of details.

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp

harrism · 2024-06-17T23:14:21Z

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp

+ * be expensive, checking whether to set the preferred location is only done above the specified
+ * allocation threshold size.
+ *
+ * Note that if threshold size is non-zero, then an application making many small allocations can


Because of this, I wonder if instead of a threshold parameter, you should using a binning_memory_resource, and encode the threshold as the maximum allocation size for the lower bin (or second highest, if more than two bins). All allocations in the larger bin (above the threshold) will do the headroom check. All allocations in the smaller bin(s) won't. You could even tune the smaller bins by using multiple bins, some or all of which might not even use SAM (so they won't eat into the headroom).

Thoughts?

Ok, I think we can leave the threshold checking to the caller. At least in the current use case, we want to configure the cupy allocator to use this, and it'd only be used for large buffers.

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp

harrism · 2024-06-17T23:29:43Z

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp

+   */
+  void* do_allocate(std::size_t bytes, [[maybe_unused]] cuda_stream_view stream) override
+  {
+    auto const aligned{rmm::align_up(bytes, rmm::CUDA_ALLOCATION_ALIGNMENT)};


For SAM, shouldn't we be using HOST alignment?

I went back and did some experiments with both HMM and ATS, looks like we don't actually need to do additional alignments. Removed.

I got more recent advice from the CUDA team and they suggested we should use CUDA alignment if it will be accessed on the device. If nothing else, for cache line alignment this is important.

How do we do this alignment? Do we need to add some padding as in aligned_host_allocate?

Changed to do host MR style alignment, but with CUDA alignment. PTAL

harrism · 2024-06-17T23:33:18Z

include/rmm/mr/device/system_memory_resource.hpp

+namespace rmm::mr {
+
+namespace detail {
+struct sam {


I think we should spell it out in the name of the struct. I like to avoid unnecessary abbreviation in code.

harrism · 2024-06-17T23:34:45Z

include/rmm/mr/device/system_memory_resource.hpp

+
+namespace detail {
+/** @brief Struct to check if system allocated memory (SAM) is supported. */
+struct sam {


question: does this need to be a struct? It looks like it could just be a static function?

Yeah it's just a single function for now. Removed the struct.

include/rmm/mr/device/system_memory_resource.hpp

tests/CMakeLists.txt

harrism · 2024-06-18T00:41:27Z

tests/mr/device/system_mr_tests.cu

+  mr.deallocate(ptr, size_mb);
+}
+
+TEST(SAMHeadroomAdaptorTest, ThrowIfNotSupported)


Nit: perhaps group this with the other ThrowIfNotSupported test above.

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

harrism

One copy-pasto

harrism · 2024-06-24T21:00:30Z

include/rmm/mr/device/system_memory_resource.hpp

+  friend void get_property(system_memory_resource const&, cuda::mr::device_accessible) noexcept {}
+
+  /**
+   * @brief Enables the `cuda::mr::device_accessible` property


Suggested change

* @brief Enables the `cuda::mr::device_accessible` property

* @brief Enables the `cuda::mr::host_accessible` property

harrism

Thanks for addressing all of my feedback. Approving, with a couple of comments.

harrism · 2024-06-24T22:41:59Z

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp

+  explicit sam_headroom_resource_adaptor(Upstream* upstream, std::size_t headroom)
+    : upstream_{upstream}, headroom_{headroom}
+  {
+    static_assert(std::is_same_v<system_memory_resource, Upstream>,


Seems like this could work with any prefetchable upstream, including a SAM MR that is already wrapped with another adaptor.

This is specifically designed to address the current shortcomings of SAM, so I'm not sure if it's useful for other things (e.g. managed memory). You are right about the adaptors though, but I'm leaning towards being more strict to avoid misuse. What do you think?

Yeah, OK. We can always loosen it if needed.

include/rmm/mr/device/system_memory_resource.hpp

harrism · 2024-06-25T03:10:40Z

tests/mr/device/mr_ref_test.hpp

 {
  try {
    void* ptr = ref.allocate(bytes);
    EXPECT_NE(nullptr, ptr);
    EXPECT_TRUE(is_properly_aligned(ptr));
-    EXPECT_TRUE(is_device_accessible_memory(ptr));
+    if (not is_system_mr) { EXPECT_TRUE(is_device_accessible_memory(ptr)); }


If it's not device accessible memory, you shouldn't even be running these tests on it. If it's device accessible but is_device_accessible_memory returns false, then we should investigate why it returns false and fix it. I don't really like the extra boolean parameter, the allocation tests should be generic.

harrism · 2024-06-25T03:12:39Z

include/rmm/mr/device/system_memory_resource.hpp

+    try {
+      return rmm::detail::aligned_host_allocate(
+        bytes, CUDA_ALLOCATION_ALIGNMENT, [](std::size_t size) { return ::operator new(size); });
+    } catch (std::bad_alloc const&) {


What is the purpose of this, is it just changing the exception type? I think you should try to incorporate the what() from the bad_alloc into the new exception.

harrism · 2024-06-26T01:36:26Z

include/rmm/mr/device/system_memory_resource.hpp

+ *  more information, see
+ *  https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/.
+ */
+class system_memory_resource final : public device_memory_resource {


We have a use case where we want a system_memory_resource that is not limited to systems that support paging. So we don't want to limit it to is_system_memory_supported(device). Instead, we want to limit the device_accessible property to be true on systems where the memory is device accessible. I'm currently checking whether the device_accessible property can be used as a dynamic property. I don't think so. :(

I guess technically you could have multiple GPUs on a system and only some of them support SAM (e.g. a Turing and a Pascal), so the dynamic property also needs a device.

@harrism can this be merged now or do you still want to get the device_accessible property resolved first?

Still waiting, have pinged @jrhemstad for help. I think since MRs are implicitly tied to the current device, it can query the current device, doesn't need to take a parameter.

device_accessible is not a dynamic property.

For this kind of functionality, we'd need to add a new property like maybe_device_accessible where get_property(mr, cuda::maybe_device_accessible) returns a bool for if it is device accessible or not.

Can we merge this now? We can always enhance it when we have a dynamic device_accessible.

@jrhemstad that would make this MR unusable with a resource_ref that expects device_accessible.

@rongou Perhaps. I'm troubled that something called a system_memory_resource is only usable on systems that support paging memory to device.

It's a system_memory_resource under the device namespace, so it's probably reasonable to assume it only works on environments that support accessing system allocated memory from a device.

There is no device namespace, and device_memory_resource and host_memory_resource base classes are planned for removal.

harrism · 2024-07-01T20:38:15Z

tests/mr/device/mr_ref_tests.cpp

@@ -32,6 +32,7 @@ INSTANTIATE_TEST_SUITE_P(ResourceTests,
                                           mr_factory{"CUDA_Async", &make_cuda_async},
 #endif
                                           mr_factory{"Managed", &make_managed},
+                                           mr_factory{"System", &make_system},


I guess this will cause the tests to fail on systems that don't support pageable memory access?

Tests are skipped if SAM is not supported.

harrism · 2024-07-01T20:50:15Z

/merge

Follow up on #1581 to add access to the system memory resource in python. Fixes #1622 Authors: - Rong Ou (https://github.com/rongou) Approvers: - Mark Harris (https://github.com/harrism) - Vyas Ramasubramani (https://github.com/vyasr) URL: #1605

rongou requested review from a team as code owners June 11, 2024 20:15

rongou requested review from harrism and wence- June 11, 2024 20:15

github-actions bot added CMake cpp Pertains to C++ code labels Jun 11, 2024

rongou added feature request New feature or request non-breaking Non-breaking change labels Jun 11, 2024

KyleFromNVIDIA approved these changes Jun 11, 2024

View reviewed changes

wence- requested changes Jun 13, 2024

View reviewed changes

rongou requested a review from wence- June 13, 2024 19:56

wence- approved these changes Jun 14, 2024

View reviewed changes

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp Outdated Show resolved Hide resolved

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp Outdated Show resolved Hide resolved

include/rmm/mr/device/sam_headroom_resource_adaptor.hpp Outdated Show resolved Hide resolved

rongou force-pushed the sam-mr branch from b7be4a9 to 06ae5ce Compare June 17, 2024 19:34

rongou requested review from a team as code owners June 17, 2024 19:34

rongou requested a review from raydouglass June 17, 2024 19:34

github-actions bot added Python Related to RMM Python API ci labels Jun 17, 2024

rongou added 5 commits June 17, 2024 12:43

initial implementation

1393cc8

check is supported

ef0f245

test passthrough

30f7f8b

more tests and docs

3bc7ef1

revert async mr test

0478446

github-actions bot removed the ci label Jun 17, 2024

fix docs

eaa3fbd

harrism requested changes Jun 18, 2024

View reviewed changes

rongou and others added 3 commits June 18, 2024 11:56

Apply suggestions from code review

323254e

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

fix compile

575fea8

review comments

d85cfd3

rongou requested a review from harrism June 19, 2024 00:31

rongou added 3 commits June 24, 2024 11:35

Merge remote-tracking branch 'upstream/branch-24.08' into sam-mr

23a8f26

fix copyright

5ca8db0

fix formatting

9e1432f

harrism reviewed Jun 24, 2024

View reviewed changes

rongou added 2 commits June 24, 2024 14:39

fix doc on cuda::mr::host_accessible

7b5e92b

align system allocated memory with cuda alignment

0f6bd42

harrism approved these changes Jun 24, 2024

View reviewed changes

rongou added 2 commits June 24, 2024 16:45

fix tests on grace hopper

7c360b1

fix copyright in mr_ref_multithreaded_tests.cpp

343d0d6

harrism requested changes Jun 25, 2024

View reviewed changes

rongou added 2 commits June 25, 2024 14:00

sam is device accessible

af0157e

incorporate bad_alloc message

d06645b

harrism reviewed Jun 26, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/branch-24.08' into sam-mr

aa8c868

harrism reviewed Jul 1, 2024

View reviewed changes

harrism approved these changes Jul 1, 2024

View reviewed changes

rapids-bot bot merged commit a727a03 into rapidsai:branch-24.08 Jul 1, 2024
57 checks passed

rongou mentioned this pull request Jul 1, 2024

[FEA] CI should support HMM or Grace Hopper #1599

Open

rongou mentioned this pull request Jul 10, 2024

Add python wrapper for system memory resource #1605

Merged

3 tasks

harrism mentioned this pull request Jul 22, 2024

[FEA] Python bindings for system_memory_resource and sam_headroom_memory_resource #1622

Closed

		* By default if no parameters are specified, this memory resource pass through to malloc/free
		* for allocation/deallocation.

		* Two cuda_memory_resources always compare equal, because they can each
		* deallocate memory allocated by the other.

	* @brief Enables the `cuda::mr::device_accessible` property
	* @brief Enables the `cuda::mr::host_accessible` property

Add system memory resource #1581

Add system memory resource #1581

Conversation

rongou commented Jun 11, 2024

Description

Checklist

copy-pr-bot bot commented Jun 11, 2024

KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- commented Jun 13, 2024

wence- left a comment

Choose a reason for hiding this comment

wence- commented Jun 14, 2024

harrism commented Jun 15, 2024

wence- commented Jun 17, 2024

wence- commented Jun 17, 2024

rongou commented Jun 17, 2024

harrism left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism commented Jul 1, 2024