`cuda_async_memory_resource` built on `cudaMallocAsync` #676

harrism · 2021-01-19T01:45:07Z

This PR adds a new device memory resource, cuda_async_memory_resource, which uses cudaMallocAsync.

Closes #671

Merging this also depends on CI support for CUDA 11.2

TODO:

Extend tests and benchmarks to exercise the new resource
Implement get_mem_info correctly.
~~Consider a constructor which takes a CUDA memory pool handle to use (currently uses the default pool)~~ Edit: leave this for a followup because pools have multiple parameters and requirements aren't clear.
Test on a system without cudaMallocAsync support to verify that compiling with CUDA 11.2 but running on an earlier version fails gracefully

kkraus14 · 2021-01-19T01:49:30Z

include/rmm/mr/device/cuda_async_memory_resource.hpp

+
+#include <rmm/mr/device/cuda_memory_resource.hpp>
+
+using cuda_async_memory_resource = rmm::mr::cuda_memory_resource;


Are we sure we want to do this? Things like supports_streams would return differently which means someone would potentially need to special case all usage of cuda_async_memory_resource. Should we just have it throw at runtime if someone tries to create a cuda_async_memory_resource?

We can't just throw at runtime, we also have to disable it at compile time. I can just remove this alias, but then I have to modify all of my tests and benchmarks to also check at compile time.

supports_streams is going away. Nobody uses it and we are redesigning the base interface.

I can just remove this alias, but then I have to modify all of my tests and benchmarks to also check at compile time.

Everyone who builds against RMM who wants to support multiple CUDA versions in a single binary and potentially use cuda_async_memory_resource would also have to do the same thing then which is arguably a bit burdensome?

My thought is throwing at runtime would presumably allow someone to choose the best allocator that's available for them at runtime.

I don't fully understand. Are you suggesting in the non-CUDA11.2-compilation path to define a class cuda_async_memory_resource that just throws in the constructor?

This does have the problem that tests and benchmarks also have to check for compatibility before trying to construct it, but OK.

Are you suggesting in the non-CUDA11.2-compilation path to define a class cuda_async_memory_resource that just throws in the constructor?

Yes, but in my ideal world it's a runtime path as opposed to a compilation path so that a user can upgrade their CUDA toolkit runtime version without needing to rebuild the entire dependency tree below RMM.

Then at runtime they could do something like:

try: mr = cuda_async_memory_resource() except UnsupportedCUDARuntimeVersion or UnsupportedCUDADriverVersion: mr = some_other_memory_resource() ...

Then we end up with the following scenarios:

Built with CUDA 11.2+ runtime

Running with CUDA 11.2+ runtime and driver: cuda_async_memory_resource is used

Running with < CUDA11.2 runtime OR driver: some_other_memory_resource is used

Built with CUDA 11.0 runtime

Running with CUDA 11.2+ runtime and driver: cuda_async_memory_resource is used

Running with < CUDA 11.2 runtime or driver: some_other_memory_resource is used

OK, so I changed it to be a runtime exception in all cases where cudaMallocAsync was not available when compiled. Also added a test for this. That said, this is the behavior you will be able to achieve (3/4 of what you requested):

Built with CUDA 11.2+ runtime Running with CUDA 11.2+ runtime and driver: cuda_async_memory_resource is used Running with < CUDA11.2 runtime OR driver: some_other_memory_resource is used Built with CUDA 11.0 runtime some_other_memory_resource is used

This one is not possible. It's not possible for cuda_async_memory_resource to function if it was not built with CUDA 11.2+ because the APIs aren't there for it to call.

Isn't it possible with the dlopen / dlsym approach since symbols are resolved at runtime as opposed to compile time? The development cost is that we'd have to likely create our own signature for cudaMallocAsync and the runtime cost is we'd dynamically load the library and have to version check somewhere.

Also, I'm perfectly fine if the answer is this is something we don't want to support, but I want to make sure we have proper discussion before we arrive at that conclusion.

Maybe. My goal was to avoid that. With this simpler approach, by building our CUDA 11.x artifacts going forward using CUDA 11.2, the majority of our users get the behavior you desire. Only those who build from source with an older CUDA 11.x version don't get it.

Fair enough. Would welcome other thoughts here before we merge, but that seems like a completely reasonable tradeoff to me.

cc @rongou @germasch @jrhemstad @leofang for thoughts

include/rmm/mr/device/cuda_async_memory_resource.hpp

leofang

Hi, just a couple of naive questions for me to learn rmm internals better 🙂

include/rmm/mr/device/cuda_async_memory_resource.hpp

jrhemstad · 2021-01-19T14:07:35Z

include/rmm/mr/device/cuda_async_memory_resource.hpp

+  cuda_async_memory_resource()
+  {
+#ifdef CUDA_MALLOC_ASYNC_SUPPORT
+    // Check if cudaMallocAsync Memory pool supported


I think we want to be creating our own cudaMemPool_t with cudaMemPoolCreate and setting it as the default pool here. I believe this will be necessary so we can enable IPC.

Hmmm, that's debatable... I think the default should be to use the default pool. Then we should provide other constructors with parameters to specify creating a new pool (with options), and/or using a passed pool handle.

I like the idea of defaulting to the default pool, though if non-default pools are in use, we can't call cudaMallocAsync in do_allocate but should instead call cudaMallocFromPoolAsync, so I think it's better to just use the latter and always attempt to get & pass the pool handle.

do_deallocate is fine, though, because there is no "cudaFreeFromPoolAsync", just cudaFreeAsync.

I was revisiting the doc, and begin wondering if I do cudaDeviceSetMemPool first, will the subsequent calls to cudaMallocAsync draw memory from the specified pool? The doc isn't clear about this 😢

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g02bb50d2e2af83e5f24d0b2ab9dadc82

The memory pool must be local to the specified device. Unless a mempool is specified in the cudaMallocAsync call, cudaMallocAsync allocates from the current mempool of the provided stream's device. By default, a device's current memory pool is its default memory pool.

So looks like it's ok...

This is going to require more thought in future PRs because I'm still not sure on this. My current thinking is we should create a new cudaMemPool_t held by this resource and not set it as the default but instead just use cudaMallocFromPoolAsync.

Regardless of whether you need to create a new pool or not (I think it's a design choice up to RMM), my current understanding is that it is indeed the best to keep a cudaMemPool_t handle and always call cudaMallocFromPoolAsync if it is a concern to you that other applications overwrite the current pool on the device using cudaDeviceSetMemPool.

include/rmm/mr/device/cuda_async_memory_resource.hpp

kkraus14

cmake lgtm

leofang

cuda_async_memory_resource.hpp LGTM. I learned a lot, thanks 🙏

rongou

Just curious, do you have any performance numbers? How does this compare with the event synchronized pool?

kkraus14 · 2021-01-27T18:00:41Z

@gpucibot merge

harrism · 2021-01-27T19:48:37Z

Just curious, do you have any performance numbers? How does this compare with the event synchronized pool?

2-3x slower than the binning_memory_resource with fixed_size and pool bins. With optimisations in another open PR the pool becomes even faster.

harrism added 3 commits January 19, 2021 11:49

Add cuda_async_memory_resource

4076686

header order

c7ba557

Compile time and run time checks for support of cudaMallocAsync

c35f5f3

harrism added feature request New feature or request 3 - Ready for review Ready for review by team labels Jan 19, 2021

harrism requested a review from a team as a code owner January 19, 2021 01:45

harrism self-assigned this Jan 19, 2021

harrism added this to PR-WIP in v0.18 Release via automation Jan 19, 2021

harrism requested review from rongou and cwharris January 19, 2021 01:45

harrism moved this from PR-WIP to PR-Needs review in v0.18 Release Jan 19, 2021

Update copyright dates of modified files.

8149ffd

harrism added the non-breaking Non-breaking change label Jan 19, 2021

kkraus14 reviewed Jan 19, 2021

View reviewed changes

harrism added 6 commits January 19, 2021 14:03

Unify exception.

2ded675

Fix typo in RMM_FAIL macro (never exercised!)

eddf8c7

Make cudaMallocAsync non-support a runtime failure.

9dacc2d

Fix headers

3d6b688

Add a test for exception when cudaMallocAsync support not compiled.

e30dc8a

Remove blank line

a68d3f3

harrism requested a review from a team as a code owner January 19, 2021 04:04

leofang reviewed Jan 19, 2021

View reviewed changes

include/rmm/mr/device/cuda_async_memory_resource.hpp Outdated Show resolved Hide resolved

include/rmm/mr/device/cuda_async_memory_resource.hpp Outdated Show resolved Hide resolved

jrhemstad requested changes Jan 19, 2021

View reviewed changes

harrism added 3 commits January 20, 2021 12:10

Scope macro.

e31f9a7

Don't trim default pool.

5c6a161

Disable get_mem_info

33d720f

harrism requested a review from jrhemstad January 27, 2021 05:23

jrhemstad approved these changes Jan 27, 2021

View reviewed changes

v0.18 Release automation moved this from PR-Needs review to PR-Reviewer approved Jan 27, 2021

kkraus14 approved these changes Jan 27, 2021

View reviewed changes

leofang approved these changes Jan 27, 2021

View reviewed changes

rongou approved these changes Jan 27, 2021

View reviewed changes

rapids-bot bot merged commit afe237c into rapidsai:branch-0.18 Jan 27, 2021

v0.18 Release automation moved this from PR-Reviewer approved to Done Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cuda_async_memory_resource` built on `cudaMallocAsync` #676

`cuda_async_memory_resource` built on `cudaMallocAsync` #676

harrism commented Jan 19, 2021 •

edited by kkraus14

kkraus14 Jan 19, 2021

harrism Jan 19, 2021

kkraus14 Jan 19, 2021

harrism Jan 19, 2021

kkraus14 Jan 19, 2021 •

edited

harrism Jan 19, 2021

kkraus14 Jan 19, 2021

kkraus14 Jan 19, 2021

harrism Jan 19, 2021 •

edited

kkraus14 Jan 19, 2021 •

edited

leofang left a comment

jrhemstad Jan 19, 2021

harrism Jan 20, 2021

leofang Jan 20, 2021

leofang Jan 21, 2021

leofang Jan 21, 2021

jrhemstad Jan 27, 2021

leofang Jan 27, 2021

kkraus14 left a comment

leofang left a comment

rongou left a comment

kkraus14 commented Jan 27, 2021

harrism commented Jan 27, 2021


		#include <rmm/mr/device/cuda_memory_resource.hpp>

		using cuda_async_memory_resource = rmm::mr::cuda_memory_resource;

cuda_async_memory_resource built on cudaMallocAsync #676

cuda_async_memory_resource built on cudaMallocAsync #676

Conversation

harrism commented Jan 19, 2021 • edited by kkraus14

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkraus14 Jan 19, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism Jan 19, 2021 • edited

Choose a reason for hiding this comment

kkraus14 Jan 19, 2021 • edited

Choose a reason for hiding this comment

leofang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkraus14 left a comment

Choose a reason for hiding this comment

leofang left a comment

Choose a reason for hiding this comment

rongou left a comment

Choose a reason for hiding this comment

kkraus14 commented Jan 27, 2021

harrism commented Jan 27, 2021

`cuda_async_memory_resource` built on `cudaMallocAsync` #676

`cuda_async_memory_resource` built on `cudaMallocAsync` #676

harrism commented Jan 19, 2021 •

edited by kkraus14

kkraus14 Jan 19, 2021 •

edited

harrism Jan 19, 2021 •

edited

kkraus14 Jan 19, 2021 •

edited