[REVIEW] Out-of-memory callback resource adaptor #892

madsbk · 2021-10-14T14:11:53Z

This PR implements a new resource adaptor that calls a callback function when an allocation fails. The idea being that the callback function can free up memory (e.g. by spilling) and ask rmm to retry the allocation.

/**
 * @brief Resource that uses `Upstream` to allocate memory and calls `callback`
 * when allocations throws `std::bad_alloc`.
 *
 * An instance of this resource can be constructed with an existing, upstream
 * resource in order to satisfy allocation requests.
 *
 * The callback function takes an allocation size and a closure and returns
 * whether to retry the allocation or throw `std::bad_alloc`.
 *
 * @tparam Upstream Type of the upstream resource used for
 * allocation/deallocation.
 */
template <typename Upstream>
class oom_callback_resource_adaptor final : public device_memory_resource

This is motivated by the fairly primitive spilling in Dask. Currently, Dask and Dask-CUDA has no way of handling OOM errors other then restarting tasks or workers. Instead they spill preemptively based on some very conservative memory thresholds. For instance, most workflows in Dask-CUDA starts spilling when half the GPU memory is in use.
This PR makes it possible for projects like Dask and Dask-CUDA to trigger spilling on demand instead of preemptively.

cc. @jrhemstad @shwina

shwina

Cython looks fantastic!

jrhemstad · 2021-10-14T15:41:25Z

Wow, that was fast.

include/rmm/mr/device/oom_callback_resource_adaptor.hpp

jrhemstad

Can we also get a test written in C++?

rongou · 2021-10-14T16:22:48Z

We do this for Spark as well, but from Scala. Perhaps we can reuse this. @jlowe @revans2 @abellina

abellina · 2021-10-14T17:01:00Z

Yeap @rongou we do it here: https://github.com/rapidsai/cudf/blob/branch-21.12/java/src/main/native/src/RmmJni.cpp#L139. We hook at the mr level, and call a JNI function in our case.

Here's where we handle an oom: https://github.com/rapidsai/cudf/blob/branch-21.12/java/src/main/native/src/RmmJni.cpp#L212

Our resource had handling for threshold-based OOM (i.e. not real OOM from RMM, but instead a low/high watermark for some preemptive spilling). We are not using the low/high watermark at the moment. I am sure part of that could be refactored into its own memory resource.

madsbk · 2021-10-14T20:06:43Z

Can we also get a test written in C++?

@jrhemstad thanks for the review, I have added a C++ test, renamed the closure argument, and added some more doc.

harrism

I think the naming should be more general and not use an acronym. Other than that and a few doc improvements, looks like a great contribution. Thanks!

include/rmm/mr/device/oom_callback_resource_adaptor.hpp

python/rmm/mr.py

python/rmm/_lib/memory_resource.pyx

python/rmm/_lib/memory_resource.pxd

python/rmm/tests/test_rmm.py

tests/CMakeLists.txt

Co-authored-by: Mark Harris <mharris@nvidia.com>

madsbk · 2021-10-25T10:07:38Z

I think the naming should be more general and not use an acronym. Other than that and a few doc improvements, looks like a great contribution. Thanks!

Thanks for the review @harrism, I think I have addressed all of your suggestions.

harrism

Nearly there.

include/rmm/mr/device/failure_callback_resource_adaptor.hpp

python/rmm/_lib/__init__.py

python/rmm/mr.py

tests/mr/device/failure_callback_mr_tests.cpp

Co-authored-by: Mark Harris <mharris@nvidia.com>

harrism · 2021-10-26T21:19:15Z

@gpucibot merge

Use rapidsai/rmm#892 to implement spilling on demand. Requires use of [RMM](https://github.com/rapidsai/rmm) and JIT-unspill enabled. The `device_memory_limit` still works as usual -- when known allocations gets to `device_memory_limit`, Dask-CUDA starts spilling preemptively. However, with this PR it is should be possible to increase `device_memory_limit` significantly since memory spikes will be handled by spilling on demand. Closes #755 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #756

…or (#898) #892 added `failure_callback_resource_adaptor` which provides the ability to respond to memory allocation failures. However, it was hard-coded to catch (and rethrow) `std::bad_alloc` exceptions. This PR makes the type of exception the adaptor catches a template parameter, to provide greater flexibility. The default exception type is now `rmm::out_of_memory` since we expect this to be the common use case. Also a few changes to fix clang-tidy warnings. Authors: - Mark Harris (https://github.com/harrism) Approvers: - Rong Ou (https://github.com/rongou) - Mads R. B. Kristensen (https://github.com/madsbk) - Jake Hemstad (https://github.com/jrhemstad) URL: #898

Implement oom_callback_resource_adaptor

e21c0b2

github-actions bot added cpp Pertains to C++ code Python Related to RMM Python API labels Oct 14, 2021

madsbk mentioned this pull request Oct 14, 2021

[FEA] On demand memory spilling rapidsai/dask-cuda#755

Closed

madsbk added 2 commits October 14, 2021 16:22

style: clang-format

3bdf7ce

style: flake8 & isort

3f3c502

madsbk marked this pull request as ready for review October 14, 2021 15:04

madsbk requested review from a team as code owners October 14, 2021 15:04

madsbk requested review from rongou and harrism October 14, 2021 15:04

madsbk changed the title ~~[WIP] Out-of-memory callback resource adaptor~~ [REVIEW] Out-of-memory callback resource adaptor Oct 14, 2021

shwina approved these changes Oct 14, 2021

View reviewed changes

jrhemstad reviewed Oct 14, 2021

View reviewed changes

include/rmm/mr/device/oom_callback_resource_adaptor.hpp Outdated Show resolved Hide resolved

jrhemstad reviewed Oct 14, 2021

View reviewed changes

include/rmm/mr/device/oom_callback_resource_adaptor.hpp Outdated Show resolved Hide resolved

jrhemstad requested changes Oct 14, 2021

View reviewed changes

madsbk added 3 commits October 14, 2021 20:29

doc and renaming closure => callback_arg

7d38f4b

Added c++ test

3e874ca

more rename

6e24c5b

madsbk requested a review from a team as a code owner October 14, 2021 19:54

github-actions bot added the CMake label Oct 14, 2021

madsbk requested a review from jrhemstad October 14, 2021 20:15

madsbk mentioned this pull request Oct 15, 2021

Spilling on demand rapidsai/dask-cuda#756

Merged

harrism requested changes Oct 19, 2021

View reviewed changes

caryr35 added this to PR-WIP in v21.12 Release via automation Oct 19, 2021

caryr35 moved this from PR-WIP to PR-Needs review in v21.12 Release Oct 19, 2021

madsbk and others added 11 commits October 25, 2021 08:30

Doc and renaming

8e59908

Co-authored-by: Mark Harris <mharris@nvidia.com>

more renaming

6a43fd5

style

4057496

Copyright

044f164

even more renaming

883e6e3

copyright

d6ec85f

revert temp clang version check

4d270dd

renaming

a3ac627

docs

546ba04

cmake: renaming

df84723

using std:function

3045cee

madsbk requested a review from harrism October 25, 2021 10:06

jrhemstad approved these changes Oct 25, 2021

View reviewed changes

jrhemstad added feature request New feature or request non-breaking Non-breaking change labels Oct 25, 2021

harrism requested changes Oct 25, 2021

View reviewed changes

madsbk and others added 3 commits October 26, 2021 14:27

Docs

a1ed413

Co-authored-by: Mark Harris <mharris@nvidia.com>

copyright

f69d299

Avoid the cb_arg struct

1da3dd0

madsbk requested a review from harrism October 26, 2021 13:59

harrism approved these changes Oct 26, 2021

View reviewed changes

v21.12 Release automation moved this from PR-Needs review to PR-Reviewer approved Oct 26, 2021

rapids-bot bot merged commit 0fbe357 into rapidsai:branch-21.12 Oct 26, 2021

v21.12 Release automation moved this from PR-Reviewer approved to Done Oct 26, 2021

harrism mentioned this pull request Oct 27, 2021

Parameterize exception type caught by failure_callback_resource_adaptor #898

Merged

madsbk deleted the oom_callback_resource_adaptor branch September 9, 2022 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Out-of-memory callback resource adaptor #892

[REVIEW] Out-of-memory callback resource adaptor #892

madsbk commented Oct 14, 2021

shwina left a comment

jrhemstad commented Oct 14, 2021

jrhemstad left a comment

rongou commented Oct 14, 2021

abellina commented Oct 14, 2021

madsbk commented Oct 14, 2021

harrism left a comment

madsbk commented Oct 25, 2021

harrism left a comment

harrism commented Oct 26, 2021

[REVIEW] Out-of-memory callback resource adaptor #892

[REVIEW] Out-of-memory callback resource adaptor #892

Conversation

madsbk commented Oct 14, 2021

shwina left a comment

Choose a reason for hiding this comment

jrhemstad commented Oct 14, 2021

jrhemstad left a comment

Choose a reason for hiding this comment

rongou commented Oct 14, 2021

abellina commented Oct 14, 2021

madsbk commented Oct 14, 2021

harrism left a comment

Choose a reason for hiding this comment

madsbk commented Oct 25, 2021

harrism left a comment

Choose a reason for hiding this comment

harrism commented Oct 26, 2021