Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Out-of-memory callback resource adaptor #892

Merged
merged 20 commits into from
Oct 26, 2021

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Oct 14, 2021

This PR implements a new resource adaptor that calls a callback function when an allocation fails. The idea being that the callback function can free up memory (e.g. by spilling) and ask rmm to retry the allocation.

/**
 * @brief Resource that uses `Upstream` to allocate memory and calls `callback`
 * when allocations throws `std::bad_alloc`.
 *
 * An instance of this resource can be constructed with an existing, upstream
 * resource in order to satisfy allocation requests.
 *
 * The callback function takes an allocation size and a closure and returns
 * whether to retry the allocation or throw `std::bad_alloc`.
 *
 * @tparam Upstream Type of the upstream resource used for
 * allocation/deallocation.
 */
template <typename Upstream>
class oom_callback_resource_adaptor final : public device_memory_resource

This is motivated by the fairly primitive spilling in Dask. Currently, Dask and Dask-CUDA has no way of handling OOM errors other then restarting tasks or workers. Instead they spill preemptively based on some very conservative memory thresholds. For instance, most workflows in Dask-CUDA starts spilling when half the GPU memory is in use.
This PR makes it possible for projects like Dask and Dask-CUDA to trigger spilling on demand instead of preemptively.

cc. @jrhemstad @shwina

@github-actions github-actions bot added cpp Pertains to C++ code Python Related to RMM Python API labels Oct 14, 2021
@madsbk madsbk marked this pull request as ready for review October 14, 2021 15:04
@madsbk madsbk requested review from a team as code owners October 14, 2021 15:04
@madsbk madsbk requested review from rongou and harrism October 14, 2021 15:04
@madsbk madsbk changed the title [WIP] Out-of-memory callback resource adaptor [REVIEW] Out-of-memory callback resource adaptor Oct 14, 2021
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython looks fantastic!

@jrhemstad
Copy link
Contributor

Wow, that was fast.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also get a test written in C++?

@rongou
Copy link
Contributor

rongou commented Oct 14, 2021

We do this for Spark as well, but from Scala. Perhaps we can reuse this. @jlowe @revans2 @abellina

@abellina
Copy link

Yeap @rongou we do it here: https://github.com/rapidsai/cudf/blob/branch-21.12/java/src/main/native/src/RmmJni.cpp#L139. We hook at the mr level, and call a JNI function in our case.

Here's where we handle an oom: https://github.com/rapidsai/cudf/blob/branch-21.12/java/src/main/native/src/RmmJni.cpp#L212

Our resource had handling for threshold-based OOM (i.e. not real OOM from RMM, but instead a low/high watermark for some preemptive spilling). We are not using the low/high watermark at the moment. I am sure part of that could be refactored into its own memory resource.

@madsbk madsbk requested a review from a team as a code owner October 14, 2021 19:54
@github-actions github-actions bot added the CMake label Oct 14, 2021
@madsbk
Copy link
Member Author

madsbk commented Oct 14, 2021

Can we also get a test written in C++?

@jrhemstad thanks for the review, I have added a C++ test, renamed the closure argument, and added some more doc.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the naming should be more general and not use an acronym. Other than that and a few doc improvements, looks like a great contribution. Thanks!

include/rmm/mr/device/oom_callback_resource_adaptor.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/oom_callback_resource_adaptor.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/oom_callback_resource_adaptor.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/oom_callback_resource_adaptor.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/oom_callback_resource_adaptor.hpp Outdated Show resolved Hide resolved
python/rmm/mr.py Show resolved Hide resolved
python/rmm/_lib/memory_resource.pyx Show resolved Hide resolved
python/rmm/_lib/memory_resource.pxd Outdated Show resolved Hide resolved
python/rmm/tests/test_rmm.py Show resolved Hide resolved
tests/CMakeLists.txt Show resolved Hide resolved
@caryr35 caryr35 added this to PR-WIP in v21.12 Release via automation Oct 19, 2021
@caryr35 caryr35 moved this from PR-WIP to PR-Needs review in v21.12 Release Oct 19, 2021
@madsbk madsbk requested a review from harrism October 25, 2021 10:06
@madsbk
Copy link
Member Author

madsbk commented Oct 25, 2021

I think the naming should be more general and not use an acronym. Other than that and a few doc improvements, looks like a great contribution. Thanks!

Thanks for the review @harrism, I think I have addressed all of your suggestions.

@jrhemstad jrhemstad added feature request New feature or request non-breaking Non-breaking change labels Oct 25, 2021
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly there.

python/rmm/_lib/__init__.py Outdated Show resolved Hide resolved
python/rmm/mr.py Show resolved Hide resolved
tests/mr/device/failure_callback_mr_tests.cpp Outdated Show resolved Hide resolved
madsbk and others added 3 commits October 26, 2021 14:27
Co-authored-by: Mark Harris <mharris@nvidia.com>
@madsbk madsbk requested a review from harrism October 26, 2021 13:59
v21.12 Release automation moved this from PR-Needs review to PR-Reviewer approved Oct 26, 2021
@harrism
Copy link
Member

harrism commented Oct 26, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0fbe357 into rapidsai:branch-21.12 Oct 26, 2021
v21.12 Release automation moved this from PR-Reviewer approved to Done Oct 26, 2021
rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this pull request Oct 29, 2021
Use rapidsai/rmm#892 to implement spilling on demand. Requires use of [RMM](https://github.com/rapidsai/rmm) and JIT-unspill enabled.

The `device_memory_limit` still works as usual -- when known allocations gets to `device_memory_limit`, Dask-CUDA starts spilling preemptively. However, with this PR it is should be possible to increase `device_memory_limit` significantly since memory spikes will be handled by spilling on demand.

Closes #755

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #756
rapids-bot bot pushed a commit that referenced this pull request Nov 9, 2021
…or (#898)

#892 added `failure_callback_resource_adaptor` which provides the ability to respond to memory allocation failures. However, it was hard-coded to catch (and rethrow) `std::bad_alloc` exceptions.  This PR makes the type of exception the adaptor catches a template parameter, to provide greater flexibility. The default exception type is now `rmm::out_of_memory` since we expect this to be the common use case.

Also a few changes to fix clang-tidy warnings.

Authors:
  - Mark Harris (https://github.com/harrism)

Approvers:
  - Rong Ou (https://github.com/rongou)
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #898
@madsbk madsbk deleted the oom_callback_resource_adaptor branch September 9, 2022 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp Pertains to C++ code feature request New feature or request non-breaking Non-breaking change Python Related to RMM Python API
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants