-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft topology: handle abort exceptions better in fence_previous_coordinator #15948
raft topology: handle abort exceptions better in fence_previous_coordinator #15948
Conversation
🟢 CI State: SUCCESS✅ - Build Build Details:
|
It is noexcept since dcaaa74. Is this PR is still relevant? |
Yes, the PR is still relevant. It turns out that in C++ But, even if I wasn't sure whether to keep the |
...and replace it with an explicit try..catch + terminate to achieve the intended behavior |
There should be a way to mark a function as not capable of returning an exceptional future.
Yes. |
} catch (abort_requested_exception&) { | ||
// Abort was requested. Break the loop | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it throws something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it throws something unexpected, e.g. std::bad_alloc
then I think it's fine to abort as the situation is already bad or something is broken inside sleep_abortable
. This is how it is now handled in v2.
The only alternative I see is to continue the loop, but then we risk loop spinning if sleep_abortable
is broken, so I'm not sure this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if we do want topology_coordinator::run
to handle all the exceptions internally and never cause a crash then we should continue
here. It's how it is already done in cdc_generation_publisher_fiber
. I'll adjust the code here for consistency.
v2: added an explicit try..catch + on_fatal_internal_error to |
service/storage_service.cc
Outdated
@@ -2530,6 +2539,9 @@ future<> topology_coordinator::run() noexcept { | |||
|
|||
co_await _async_gate.close(); | |||
co_await std::move(cdc_generation_publisher); | |||
} catch (...) { | |||
on_fatal_internal_error(slogger, format("raft topology: unhandled exception in topology_coordinator::run: {}", std::current_exception())); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like this approach. The code above already handled all exceptions. I is fine to use simple noexcep
C++ keyword to mark a function as non throwing more for the documentation purposes then to actually terminate the program. Re-implementing noexcep
semantics manually just add noise the the code. If you want to asser you can do it in raft_state_monitor_fiber
when waiting for _topology_change_coordinator
feature which suppose to be non exceptional.
…s_coordinator The fence_previous_coordinator function has a retry loop: if it fails to perform a group0 operation, it will try again after a 1 second delay. However, if the topology coordinator is aborted while it waits, an exception will be thrown and will be propagated out of the function. The function is supposed to handle all exceptions internally, so this is not desired. Fix this by catching the abort_requested_exception and returning from the function if the exception is caught.
… is aborted An attempt to fence the previous coordinator may fail because the current coordinator is aborted. It's not a critical error and it can happen during normal operations, so lower the verbosity used to print a message about this error to 'debug'. Return from the function immediately in that case - the sleep_aborted that happens as a next step would fail on abort_requested_exception anyway, so make it more explicit.
…n function as noexcept" This reverts commit dcaaa74. The `noexcept` specifier that it added is only relevant to the function and not the coroutine returned from that function. This was not the intention and it looks confusing now, so remove it.
a8d5c6a
to
8829697
Compare
v3:
|
service/storage_service.cc
Outdated
try { | ||
co_await coordinator->run(); | ||
} catch (...) { | ||
on_fatal_internal_error(slogger, format("raft topology: unhandled exception in topology_coordinator_run: {}", std::current_exception())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we initiate a raft leader step down in this case?
@gleb-cloudius
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea indeed. The question is if this should be done everywhere we assert() or in a generic code that handles the exit because of an assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertion failure means there's a bug we need to fix, not do a leader stepdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you tell me why it would be beneficial to step down in this case? This catch
block is supposed to be triggered on unhandled exceptions, wouldn't it make more sense to abort ASAP as we encountered a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertion failure means there's a bug we need to fix, not do a leader stepdown
You are not wrong, but if a node crashes it would be beneficial for the cluster to find out about it faster. If there is a safe way to do it (safe in a way that it cannot cause harm even if it did not work properly) then why not do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put it into a broader context. We don't know here whether or not it is a bug - all we know is we caught an exception we can't handle. I think there is a set of reasonable cases when the topology coordinator may fail with an uncaught exception: be it a memory issue or an error from the operating system. In this case we can either fail the node gracefully, or just abort it and disrupt cluster availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also Ok with a follow up patch, but looks like it's a call for an error injection test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know here whether or not it is a bug - all we know is we caught an exception we can't handle. I think there is a set of reasonable cases when the topology coordinator may fail with an uncaught exception
The point of this try..catch is to simulate noexcept
but for coroutines; noexcept
immediately aborts if the function throws. Any exception flying out of topology coordinator code indicates a bug in the code, because the topology coordinator fiber is supposed to handle all exceptions internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also Ok with a follow up patch, but looks like it's a call for an error injection test.
That would be like asserting that a boolean is true at some point in the code, and then adding an injection just before that assertion that sets the bool to false, to check that the assertion indeed fires.
This try..catch is only there because of a defect of the C++ language, we shouldn't be having this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing some context, but what does make you think that coordinator->run() can not fail for a good, valid reason that is not abort_requested?
🔴 CI State: FAILURE✅ - Build Failed Tests (2/21259):Build Details:
|
…r::run The `topology_coordinator` function is supposed to handle all of the exceptions internally. Assert, in runtime, that this is the case by wrapping the `run` invocation with a try..catch; in case of an exception, step down as a leader first and then abort.
Strange failure, the test complains that a message didn't appear in the nodes' logs:
but it does appear in the logs:
Perhaps a race condition between writing to logs and grepping them? In any case, I don't think it's related, but I'll run this test multiple times before submitting v4. |
8829697
to
85516c9
Compare
I ran I'll create an issue for it. |
v4: added stepdown before |
I saw it yesterday here without your patch: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4548/. |
|
🟢 CI State: SUCCESS✅ - Build Build Details:
|
…s_coordinator' from Piotr Dulikowski When topology coordinator tries to fence the previous coordinator it performs a group0 operation. The current topology coordinator might be aborted in the meantime, which will result in a `raft::request_aborted` exception being thrown. After the fix to #15728 was merged, the exception is caught, but then `sleep_abortable` is called which immediately throws `abort_requested_exception` as it uses the same abort source as the group0 operation. The `fence_previous_coordinator` function which does all those things is not supposed to throw exceptions, if it does - it causes `raft_state_monitor_fiber` to exit, completely disabling the topology coordinator functionality on that node. Modify the code in the following way: - Catch `abort_requested_exception` thrown from `sleep_abortable` and exit the function if it happens. In addition to the described issue, it will also handle the case when abort is requested while `sleep_abortable` happens, - Catch `raft::request_aborted` thrown from group0 operation, log the exception with lower verbosity and exit the function explicitly. Finally, wrap both `fence_previous_coordinator` and `run` functions in a `try` block with `on_fatal_internal_error` in the catch handler in order to implement the behavior that adding `noexcept` was originally supposed to introduce. Fixes: #15747 Closes #15948 * github.com:scylladb/scylladb: raft topology: catch and abort on exceptions from topology_coordinator::run Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept" raft topology: don't print an error when fencing previous coordinator is aborted raft topology: handle abort exceptions from sleeping in fence_previous_coordinator (cherry picked from commit 07e9522)
When topology coordinator tries to fence the previous coordinator it performs a group0 operation. The current topology coordinator might be aborted in the meantime, which will result in a
raft::request_aborted
exception being thrown. After the fix to #15728 was merged, the exception is caught, but thensleep_abortable
is called which immediately throwsabort_requested_exception
as it uses the same abort source as the group0 operation. Thefence_previous_coordinator
function which does all those things is not supposed to throw exceptions, if it does - it causesraft_state_monitor_fiber
to exit, completely disabling the topology coordinator functionality on that node.Modify the code in the following way:
abort_requested_exception
thrown fromsleep_abortable
and exit the function if it happens. In addition to the described issue, it will also handle the case when abort is requested whilesleep_abortable
happens,raft::request_aborted
thrown from group0 operation, log the exception with lower verbosity and exit the function explicitly.Finally, wrap both
fence_previous_coordinator
andrun
functions in atry
block withon_fatal_internal_error
in the catch handler in order to implement the behavior that addingnoexcept
was originally supposed to introduce.Fixes: #15747