Throw `rmm::out_of_memory` when we know for sure #894

rongou · 2021-10-16T02:13:13Z

When RMM fails to allocate a buffer, it currently throws a rmm::bad_alloc exception, which a user might want to catch, spill some GPU buffers, and try again. But that exception covers all error conditions, catching it blindly may hide some other more serious CUDA errors, making the code hard to debug. Adding a more specific rmm::out_of_memory exception and throwing it when we are certain we are running out of memory, so that it can be caught to trigger spilling.

jrhemstad · 2021-10-16T02:31:31Z

include/rmm/detail/error.hpp

+#define RMM_CUDA_TRY(...)                                                                      \
+  GET_RMM_CUDA_TRY_MACRO(__VA_ARGS__, RMM_CUDA_TRY_4, INVALID, RMM_CUDA_TRY_2, RMM_CUDA_TRY_1) \
  (__VA_ARGS__)
-#define GET_RMM_CUDA_TRY_MACRO(_1, _2, NAME, ...) NAME
-#define RMM_CUDA_TRY_2(_call, _exception_type)                                               \
-  do {                                                                                       \
-    cudaError_t const error = (_call);                                                       \
-    if (cudaSuccess != error) {                                                              \
-      cudaGetLastError();                                                                    \
-      /*NOLINTNEXTLINE(bugprone-macro-parentheses)*/                                         \
-      throw _exception_type{std::string{"CUDA error at: "} + __FILE__ + ":" +                \
-                            RMM_STRINGIFY(__LINE__) + ": " + cudaGetErrorName(error) + " " + \
-                            cudaGetErrorString(error)};                                      \
-    }                                                                                        \
+#define GET_RMM_CUDA_TRY_MACRO(_1, _2, _3, _4, NAME, ...) NAME
+#define RMM_CUDA_TRY_4(_call, _exception_type, _custom_error, _custom_exception_type)              \
+  do {                                                                                             \
+    cudaError_t const error = (_call);                                                             \
+    if (cudaSuccess != error) {                                                                    \
+      cudaGetLastError();                                                                          \
+      auto const msg = std::string{"CUDA error at: "} + __FILE__ + ":" + RMM_STRINGIFY(__LINE__) + \
+                       ": " + cudaGetErrorName(error) + " " + cudaGetErrorString(error);           \
+      if ((_custom_error) == error) {                                                              \
+        /*NOLINTNEXTLINE(bugprone-macro-parentheses)*/                                             \
+        throw _custom_exception_type{msg};                                                         \
+      } else {                                                                                     \
+        /*NOLINTNEXTLINE(bugprone-macro-parentheses)*/                                             \
+        throw _exception_type{msg};                                                                \
+      }                                                                                            \
+    }                                                                                              \
  } while (0)
+#define RMM_CUDA_TRY_2(_call, _exception_type) \
+  RMM_CUDA_TRY_4(_call, _exception_type, cudaSuccess, rmm::cuda_error)


Hm, this macro is smelly now.

First, it's non-intuitive at the call site what the different arguments mean.

Second, if we allow customizing the exception for one error, why not allow customizing the exception for an arbitrary number of errors?

If we're going to go this route, I think I'd like to see something more generic. Maybe a macro that accepts some kind of trait that maps a cudaError_t to a particular exception that defaults to rmm::bad_alloc.

Added a new macro for allocation calls. Seems cleaner, let me know what you think.

I guess it solves the first issue of the callsite being confusing, but it doesn't solve the second issue of customizing exceptions for one or more cudaError_t values.

jrhemstad · 2021-10-16T02:33:05Z

include/rmm/detail/error.hpp

+  out_of_memory(const char* msg) : bad_alloc{msg} {}
+  out_of_memory(std::string const& msg) : bad_alloc{msg} {}


Suggested change

out_of_memory(const char* msg) : bad_alloc{msg} {}

out_of_memory(std::string const& msg) : bad_alloc{msg} {}

using bad_alloc::bad_alloc;

jrhemstad · 2021-10-16T02:34:37Z

include/rmm/detail/error.hpp

+/**
+ * @brief Exception thrown when RMM runs out of memory
+ *
+ */
+class out_of_memory : public bad_alloc {


We should be very precise about defining what "out of memory" means and what the (non)guarantees are about the behavior of this exception vs. a more generic bad_alloc. i.e., what exactly is it that this exception conveys that a normal bad_alloc does not?

harrism · 2021-10-25T21:15:30Z

include/rmm/detail/error.hpp

+/**
+ * @brief Exception thrown when RMM runs out of memory
+ *
+ * This is thrown under the following conditions:


I don't like having a list in a comment that we have to maintain. I think instead we should make it very clear that this error should only be thrown when we know for sure a resource is out of memory.

I don't know for sure that cudaErrorMemoryAllocation always means OOM, BTW. Is this documented somewhere?

Done.

According to the CUDA Runtime API doc:

cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.

include/rmm/detail/error.hpp

rongou · 2021-10-26T05:43:13Z

@gpucibot merge

throw rmm::out_of_memory when we know for sure

7912f4f

rongou added feature request New feature or request 3 - Ready for review Ready for review by team non-breaking Non-breaking change improvement Improvement / enhancement to an existing function cpp Pertains to C++ code labels Oct 16, 2021

rongou requested review from harrism and jrhemstad October 16, 2021 02:13

rongou self-assigned this Oct 16, 2021

rongou requested a review from a team as a code owner October 16, 2021 02:13

rongou removed the feature request New feature or request label Oct 16, 2021

clang format

23e7002

jrhemstad reviewed Oct 16, 2021

View reviewed changes

rongou added 2 commits October 25, 2021 11:11

review feedback

2d5419f

remove clang tidy hints

8c4f492

harrism requested changes Oct 25, 2021

View reviewed changes

jrhemstad reviewed Oct 25, 2021

View reviewed changes

include/rmm/detail/error.hpp Show resolved Hide resolved

more review feedback

869bdbe

harrism approved these changes Oct 26, 2021

View reviewed changes

jrhemstad approved these changes Oct 26, 2021

View reviewed changes

rapids-bot bot merged commit fcf92c5 into rapidsai:branch-21.12 Oct 26, 2021

jlowe mentioned this pull request Oct 26, 2021

catch rmm::out_of_memory exceptions in jni rapidsai/cudf#9525

Merged

rongou deleted the oom-exception branch November 23, 2021 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throw `rmm::out_of_memory` when we know for sure #894

Throw `rmm::out_of_memory` when we know for sure #894

rongou commented Oct 16, 2021 •

edited

Loading

jrhemstad Oct 16, 2021

rongou Oct 25, 2021

jrhemstad Oct 25, 2021

jrhemstad Oct 16, 2021

rongou Oct 25, 2021

jrhemstad Oct 16, 2021

rongou Oct 25, 2021

harrism Oct 25, 2021

rongou Oct 25, 2021

rongou commented Oct 26, 2021

		out_of_memory(const char* msg) : bad_alloc{msg} {}
		out_of_memory(std::string const& msg) : bad_alloc{msg} {}

	out_of_memory(const char* msg) : bad_alloc{msg} {}
	out_of_memory(std::string const& msg) : bad_alloc{msg} {}
	using bad_alloc::bad_alloc;

Throw rmm::out_of_memory when we know for sure #894

Throw rmm::out_of_memory when we know for sure #894

Conversation

rongou commented Oct 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rongou commented Oct 26, 2021

Throw `rmm::out_of_memory` when we know for sure #894

Throw `rmm::out_of_memory` when we know for sure #894

rongou commented Oct 16, 2021 •

edited

Loading