Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify error handling in failure mode or during finalization #269

Open
hidmic opened this issue Jul 27, 2020 · 7 comments
Open

Simplify error handling in failure mode or during finalization #269

hidmic opened this issue Jul 27, 2020 · 7 comments
Labels
backlog enhancement New feature or request

Comments

@hidmic
Copy link
Contributor

hidmic commented Jul 27, 2020

Rationale

Error handling support in rcutils (and by extension, in every ROS 2 C library implementation or client code) falls short when it comes to error propagation when in failure mode or during finalization. As discussed in ros2/rmw_fastrtps#414 and ros2/rmw_cyclonedds#210, ensuring the initial error gets propagated, while no subsequent error goes unnoticed, clutters client code significantly.

Proposal

Introduce a mechanism for rcutils to distinguish between an error overwrite and a nested error that should be handled differently (e.g. logging to stderr directly). For that matter, have a thread local integer to track nested error handling scopes and a boolean in local storage to ensure a single error handling scope per function.

With an API like:

typedef struct rcutils_error_handling_scope_t {
    bool active;
} rcutils_error_handling_scope_t;

void rcutils_error_handling_scope_init(rcutils_error_handling_scope_t * scope);
void rcutils_error_handling_scope_enter(rcutils_error_handling_scope_t * scope);
void rcutils_error_handling_scope_leave(rcutils_error_handling_scope_t * scope);

a function may request rcutils_set_error_state to behave differently in between _enter and _leave calls, for its own code and the call chain that follows after it. In C++, RAII may be used to simplify API use even further.

@hidmic hidmic added the enhancement New feature or request label Jul 27, 2020
@hidmic hidmic added this to To do in Galactic via automation Jul 27, 2020
@hidmic
Copy link
Contributor Author

hidmic commented Jul 27, 2020

FYI @clalancette @ivanpauno

@wjwwood
Copy link
Member

wjwwood commented Aug 18, 2020

My problem with a call chain is that you cannot know how deep it needs to be. If we have one then, in my opinion, it should be preallocated per thread and have a fixed depth, similar to how the error message itself has a fixed length and is preallocated per thread.

@hidmic
Copy link
Contributor Author

hidmic commented Aug 18, 2020

I'm not sure I follow. The purpose of rcutils_error_handling_scope_t is not to store anything, but to aid changing error handling macros (e.g. RCUTILS_SET_ERROR_MSG) behavior whenever errors occur while handling another error. I guess we could generate full backtraces within each error handling "scope", but we also can start small and just print to stderr (like we do explicitly in quite a few places).

@gbiggs
Copy link
Member

gbiggs commented Aug 19, 2020

My problem with a call chain is that you cannot know how deep it needs to be

Wouldn't it be possible to make a reasonable guess for how much we need, at least inside the ROS API? We can take a look at our call graph and see how deep it goes.

@wjwwood
Copy link
Member

wjwwood commented Aug 19, 2020

The purpose of rcutils_error_handling_scope_t is not to store anything, but to aid changing error handling macros (e.g. RCUTILS_SET_ERROR_MSG) behavior whenever errors occur while handling another error.

I think the goal should be to avoid that in the first place.

The error state is trivially copied, so I'd just recommend storing the error state locally before starting cleanup that might set the error state again. For example:

rcutils_ret_t rcutils_ret = rcutils_failing_function(...);
if (RCUTILS_RET_OK != rcutils_ret) {
  rcutils_error_state_t error_state = *rcutils_get_error_state();
  rcutils_reset_error();  // cannot fail
  rcl_ret_t ret = rcl_cleanup_function(...);
  if (RCL_RET_OK != ret) {
    RCUTILS_SAFE_FWRITE_TO_STDERR("[some context:42] failed to clean up foo while handling error: ");
    new_rcutils_print_error_state_to_stderr(&error_state);  // cannot fail, this is the first error state
  } else {
    rcutils_set_error_state(error_state.message, error_state.file, error_state.line_number);
  }
}

Now, that's obviously awful to write, and I think we should try to improve that workflow, but I don't think we should be pushing behavior into the error handling code, to be honest.

Maybe I would be more convinced to see your proposed API additions in action, and see how it makes a use case before and after better.

@hidmic
Copy link
Contributor Author

hidmic commented Aug 19, 2020

The error state is trivially copied, so I'd just recommend storing the error state locally before starting cleanup that might set the error state again.

That's precisely the kind of code that I'd like to avoid. Code gets cluttered very quickly, particularly for procedures that have to keep going after an error while not losing track of it (e.g. finalization functions).

Now, that's obviously awful to write, and I think we should try to improve that workflow, but I don't think we should be pushing behavior into the error handling code, to be honest.

That's fair. Thinking out loud, push/pop/print APIs for error states could be a more verbose alternative. Having an easy way to pop the current error state into a local variable OR to print it to stderr if it has already been popped into that local variable before would keep LOC count low for _fini() functions too.

@wjwwood
Copy link
Member

wjwwood commented Aug 19, 2020

Thinking out loud, push/pop/print APIs for error states could be a more verbose alternative.

Print makes sense to me, but push/pop implies a stack and a stack has to have a size, and that either has to be fixed or dynamically allocated. If it is fixed it can be exhausted, which may be ok, and if it's dynamic then that's a non-started in my opinion because these error handling functions need to be used in places where memory allocations are not allowed (or well desired, but not allowed if we want to use them in real-time situations).

Having an easy way to pop the current error state into a local variable OR to print it to stderr if it has already been popped into that local variable before would keep LOC count low for _fini() functions too.

I have a hard time imagining how this would be used.

Like I said, having a specific example where the code is clunky and showing how it could be improved with new API would be helpful for me to understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants