Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda_async_memory_resource built on cudaMallocAsync #676

Merged
merged 13 commits into from Jan 27, 2021

Conversation

harrism
Copy link
Member

@harrism harrism commented Jan 19, 2021

This PR adds a new device memory resource, cuda_async_memory_resource, which uses cudaMallocAsync.

Closes #671

Merging this also depends on CI support for CUDA 11.2

TODO:

  • Extend tests and benchmarks to exercise the new resource
  • Implement get_mem_info correctly.
  • Consider a constructor which takes a CUDA memory pool handle to use (currently uses the default pool) Edit: leave this for a followup because pools have multiple parameters and requirements aren't clear.
  • Test on a system without cudaMallocAsync support to verify that compiling with CUDA 11.2 but running on an earlier version fails gracefully

@harrism harrism added feature request New feature or request 3 - Ready for review Ready for review by team labels Jan 19, 2021
@harrism harrism requested a review from a team as a code owner January 19, 2021 01:45
@harrism harrism self-assigned this Jan 19, 2021
@harrism harrism added this to PR-WIP in v0.18 Release via automation Jan 19, 2021
@harrism harrism moved this from PR-WIP to PR-Needs review in v0.18 Release Jan 19, 2021
@harrism harrism added the non-breaking Non-breaking change label Jan 19, 2021

#include <rmm/mr/device/cuda_memory_resource.hpp>

using cuda_async_memory_resource = rmm::mr::cuda_memory_resource;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to do this? Things like supports_streams would return differently which means someone would potentially need to special case all usage of cuda_async_memory_resource. Should we just have it throw at runtime if someone tries to create a cuda_async_memory_resource?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just throw at runtime, we also have to disable it at compile time. I can just remove this alias, but then I have to modify all of my tests and benchmarks to also check at compile time.

supports_streams is going away. Nobody uses it and we are redesigning the base interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can just remove this alias, but then I have to modify all of my tests and benchmarks to also check at compile time.

Everyone who builds against RMM who wants to support multiple CUDA versions in a single binary and potentially use cuda_async_memory_resource would also have to do the same thing then which is arguably a bit burdensome?

My thought is throwing at runtime would presumably allow someone to choose the best allocator that's available for them at runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand. Are you suggesting in the non-CUDA11.2-compilation path to define a class cuda_async_memory_resource that just throws in the constructor?

This does have the problem that tests and benchmarks also have to check for compatibility before trying to construct it, but OK.

Copy link
Contributor

@kkraus14 kkraus14 Jan 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting in the non-CUDA11.2-compilation path to define a class cuda_async_memory_resource that just throws in the constructor?

Yes, but in my ideal world it's a runtime path as opposed to a compilation path so that a user can upgrade their CUDA toolkit runtime version without needing to rebuild the entire dependency tree below RMM.

Then at runtime they could do something like:

try:
    mr = cuda_async_memory_resource()
except UnsupportedCUDARuntimeVersion or UnsupportedCUDADriverVersion:
    mr = some_other_memory_resource()
    
...

Then we end up with the following scenarios:

  • Built with CUDA 11.2+ runtime
    • Running with CUDA 11.2+ runtime and driver: cuda_async_memory_resource is used
    • Running with < CUDA11.2 runtime OR driver: some_other_memory_resource is used
  • Built with CUDA 11.0 runtime
    • Running with CUDA 11.2+ runtime and driver: cuda_async_memory_resource is used
    • Running with < CUDA 11.2 runtime or driver: some_other_memory_resource is used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I changed it to be a runtime exception in all cases where cudaMallocAsync was not available when compiled. Also added a test for this. That said, this is the behavior you will be able to achieve (3/4 of what you requested):

Built with CUDA 11.2+ runtime
    Running with CUDA 11.2+ runtime and driver: cuda_async_memory_resource is used
    Running with < CUDA11.2 runtime OR driver: some_other_memory_resource is used
Built with CUDA 11.0 runtime
    some_other_memory_resource is used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not possible. It's not possible for cuda_async_memory_resource to function if it was not built with CUDA 11.2+ because the APIs aren't there for it to call.

Isn't it possible with the dlopen / dlsym approach since symbols are resolved at runtime as opposed to compile time? The development cost is that we'd have to likely create our own signature for cudaMallocAsync and the runtime cost is we'd dynamically load the library and have to version check somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm perfectly fine if the answer is this is something we don't want to support, but I want to make sure we have proper discussion before we arrive at that conclusion.

Copy link
Member Author

@harrism harrism Jan 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. My goal was to avoid that. With this simpler approach, by building our CUDA 11.x artifacts going forward using CUDA 11.2, the majority of our users get the behavior you desire. Only those who build from source with an older CUDA 11.x version don't get it.

Copy link
Contributor

@kkraus14 kkraus14 Jan 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Would welcome other thoughts here before we merge, but that seems like a completely reasonable tradeoff to me.

cc @rongou @germasch @jrhemstad @leofang for thoughts

include/rmm/mr/device/cuda_async_memory_resource.hpp Outdated Show resolved Hide resolved
@harrism harrism requested a review from a team as a code owner January 19, 2021 04:04
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, just a couple of naive questions for me to learn rmm internals better 🙂

include/rmm/mr/device/cuda_async_memory_resource.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/cuda_async_memory_resource.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/cuda_async_memory_resource.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/cuda_async_memory_resource.hpp Outdated Show resolved Hide resolved
cuda_async_memory_resource()
{
#ifdef CUDA_MALLOC_ASYNC_SUPPORT
// Check if cudaMallocAsync Memory pool supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to be creating our own cudaMemPool_t with cudaMemPoolCreate and setting it as the default pool here. I believe this will be necessary so we can enable IPC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, that's debatable... I think the default should be to use the default pool. Then we should provide other constructors with parameters to specify creating a new pool (with options), and/or using a passed pool handle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of defaulting to the default pool, though if non-default pools are in use, we can't call cudaMallocAsync in do_allocate but should instead call cudaMallocFromPoolAsync, so I think it's better to just use the latter and always attempt to get & pass the pool handle.

do_deallocate is fine, though, because there is no "cudaFreeFromPoolAsync", just cudaFreeAsync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was revisiting the doc, and begin wondering if I do cudaDeviceSetMemPool first, will the subsequent calls to cudaMallocAsync draw memory from the specified pool? The doc isn't clear about this 😢

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g02bb50d2e2af83e5f24d0b2ab9dadc82

The memory pool must be local to the specified device. Unless a mempool is specified in the cudaMallocAsync call, cudaMallocAsync allocates from the current mempool of the provided stream's device. By default, a device's current memory pool is its default memory pool.

So looks like it's ok...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to require more thought in future PRs because I'm still not sure on this. My current thinking is we should create a new cudaMemPool_t held by this resource and not set it as the default but instead just use cudaMallocFromPoolAsync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of whether you need to create a new pool or not (I think it's a design choice up to RMM), my current understanding is that it is indeed the best to keep a cudaMemPool_t handle and always call cudaMallocFromPoolAsync if it is a concern to you that other applications overwrite the current pool on the device using cudaDeviceSetMemPool.

include/rmm/mr/device/cuda_async_memory_resource.hpp Outdated Show resolved Hide resolved
v0.18 Release automation moved this from PR-Needs review to PR-Reviewer approved Jan 27, 2021
Copy link
Contributor

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmake lgtm

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda_async_memory_resource.hpp LGTM. I learned a lot, thanks 🙏

Copy link
Contributor

@rongou rongou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do you have any performance numbers? How does this compare with the event synchronized pool?

@kkraus14
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit afe237c into rapidsai:branch-0.18 Jan 27, 2021
v0.18 Release automation moved this from PR-Reviewer approved to Done Jan 27, 2021
@harrism
Copy link
Member Author

harrism commented Jan 27, 2021

Just curious, do you have any performance numbers? How does this compare with the event synchronized pool?

2-3x slower than the binning_memory_resource with fixed_size and pool bins. With optimisations in another open PR the pool becomes even faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for review Ready for review by team feature request New feature or request non-breaking Non-breaking change
Projects
No open projects
v0.18 Release
  
Done
Development

Successfully merging this pull request may close these issues.

[FEA] Add a memory resource to wrap cudaMallocAsync
5 participants