-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throw rmm::out_of_memory
when we know for sure
#894
Conversation
include/rmm/detail/error.hpp
Outdated
#define RMM_CUDA_TRY(...) \ | ||
GET_RMM_CUDA_TRY_MACRO(__VA_ARGS__, RMM_CUDA_TRY_4, INVALID, RMM_CUDA_TRY_2, RMM_CUDA_TRY_1) \ | ||
(__VA_ARGS__) | ||
#define GET_RMM_CUDA_TRY_MACRO(_1, _2, NAME, ...) NAME | ||
#define RMM_CUDA_TRY_2(_call, _exception_type) \ | ||
do { \ | ||
cudaError_t const error = (_call); \ | ||
if (cudaSuccess != error) { \ | ||
cudaGetLastError(); \ | ||
/*NOLINTNEXTLINE(bugprone-macro-parentheses)*/ \ | ||
throw _exception_type{std::string{"CUDA error at: "} + __FILE__ + ":" + \ | ||
RMM_STRINGIFY(__LINE__) + ": " + cudaGetErrorName(error) + " " + \ | ||
cudaGetErrorString(error)}; \ | ||
} \ | ||
#define GET_RMM_CUDA_TRY_MACRO(_1, _2, _3, _4, NAME, ...) NAME | ||
#define RMM_CUDA_TRY_4(_call, _exception_type, _custom_error, _custom_exception_type) \ | ||
do { \ | ||
cudaError_t const error = (_call); \ | ||
if (cudaSuccess != error) { \ | ||
cudaGetLastError(); \ | ||
auto const msg = std::string{"CUDA error at: "} + __FILE__ + ":" + RMM_STRINGIFY(__LINE__) + \ | ||
": " + cudaGetErrorName(error) + " " + cudaGetErrorString(error); \ | ||
if ((_custom_error) == error) { \ | ||
/*NOLINTNEXTLINE(bugprone-macro-parentheses)*/ \ | ||
throw _custom_exception_type{msg}; \ | ||
} else { \ | ||
/*NOLINTNEXTLINE(bugprone-macro-parentheses)*/ \ | ||
throw _exception_type{msg}; \ | ||
} \ | ||
} \ | ||
} while (0) | ||
#define RMM_CUDA_TRY_2(_call, _exception_type) \ | ||
RMM_CUDA_TRY_4(_call, _exception_type, cudaSuccess, rmm::cuda_error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this macro is smelly now.
First, it's non-intuitive at the call site what the different arguments mean.
Second, if we allow customizing the exception for one error, why not allow customizing the exception for an arbitrary number of errors?
If we're going to go this route, I think I'd like to see something more generic. Maybe a macro that accepts some kind of trait that maps a cudaError_t
to a particular exception that defaults to rmm::bad_alloc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new macro for allocation calls. Seems cleaner, let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it solves the first issue of the callsite being confusing, but it doesn't solve the second issue of customizing exceptions for one or more cudaError_t values.
include/rmm/detail/error.hpp
Outdated
out_of_memory(const char* msg) : bad_alloc{msg} {} | ||
out_of_memory(std::string const& msg) : bad_alloc{msg} {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out_of_memory(const char* msg) : bad_alloc{msg} {} | |
out_of_memory(std::string const& msg) : bad_alloc{msg} {} | |
using bad_alloc::bad_alloc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
include/rmm/detail/error.hpp
Outdated
/** | ||
* @brief Exception thrown when RMM runs out of memory | ||
* | ||
*/ | ||
class out_of_memory : public bad_alloc { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be very precise about defining what "out of memory" means and what the (non)guarantees are about the behavior of this exception vs. a more generic bad_alloc
. i.e., what exactly is it that this exception conveys that a normal bad_alloc
does not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
include/rmm/detail/error.hpp
Outdated
/** | ||
* @brief Exception thrown when RMM runs out of memory | ||
* | ||
* This is thrown under the following conditions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like having a list in a comment that we have to maintain. I think instead we should make it very clear that this error should only be thrown when we know for sure a resource is out of memory.
I don't know for sure that cudaErrorMemoryAllocation
always means OOM, BTW. Is this documented somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
According to the CUDA Runtime API doc:
cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.
@gpucibot merge |
When RMM fails to allocate a buffer, it currently throws a
rmm::bad_alloc
exception, which a user might want to catch, spill some GPU buffers, and try again. But that exception covers all error conditions, catching it blindly may hide some other more serious CUDA errors, making the code hard to debug. Adding a more specificrmm::out_of_memory
exception and throwing it when we are certain we are running out of memory, so that it can be caught to trigger spilling.