Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft topology: handle abort exceptions better in fence_previous_coordinator #15948

Conversation

piodul
Copy link
Contributor

@piodul piodul commented Nov 4, 2023

When topology coordinator tries to fence the previous coordinator it performs a group0 operation. The current topology coordinator might be aborted in the meantime, which will result in a raft::request_aborted exception being thrown. After the fix to #15728 was merged, the exception is caught, but then sleep_abortable is called which immediately throws abort_requested_exception as it uses the same abort source as the group0 operation. The fence_previous_coordinator function which does all those things is not supposed to throw exceptions, if it does - it causes raft_state_monitor_fiber to exit, completely disabling the topology coordinator functionality on that node.

Modify the code in the following way:

  • Catch abort_requested_exception thrown from sleep_abortable and exit the function if it happens. In addition to the described issue, it will also handle the case when abort is requested while sleep_abortable happens,
  • Catch raft::request_aborted thrown from group0 operation, log the exception with lower verbosity and exit the function explicitly.

Finally, wrap both fence_previous_coordinator and run functions in a try block with on_fatal_internal_error in the catch handler in order to implement the behavior that adding noexcept was originally supposed to introduce.

Fixes: #15747

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Sanity Tests
✅ - Unit Tests

Build Details:

@gleb-cloudius
Copy link
Contributor

fence_previous_coordinator function which does all those things is not supposed to throw exceptions

It is noexcept since dcaaa74. Is this PR is still relevant?

@piodul
Copy link
Contributor Author

piodul commented Nov 6, 2023

Yes, the PR is still relevant. It turns out that in C++ noexcept doesn't work for coroutines (function vs. coroutine returned by the function are two different things, and noexcept only applies to the former: https://devblogs.microsoft.com/oldnewthing/20210426-00/?p=105153).

But, even if noexcept worked as we expected, there would still be an issue - there is an unhandled abort exception in the topology coordinator fiber - we would be looking at a node crash instead of a silent breakage.

I wasn't sure whether to keep the noexcept from the signatures or not, but now I think I should remove it not to give any false impressions that noexcept works with coroutines.

@piodul
Copy link
Contributor Author

piodul commented Nov 6, 2023

I wasn't sure whether to keep the noexcept from the signatures or not, but now I think I should remove it not to give any false impressions that noexcept works with coroutines.

...and replace it with an explicit try..catch + terminate to achieve the intended behavior

@gleb-cloudius
Copy link
Contributor

Yes, the PR is still relevant. It turns out that in C++ noexcept doesn't work for coroutines (function vs. coroutine returned by the function are two different things, and noexcept only applies to the former: https://devblogs.microsoft.com/oldnewthing/20210426-00/?p=105153).

There should be a way to mark a function as not capable of returning an exceptional future.

But, even if noexcept worked as we expected, there would still be an issue - there is an unhandled abort exception in the topology coordinator fiber - we would be looking at a node crash instead of a silent breakage.

I wasn't sure whether to keep the noexcept from the signatures or not, but now I think I should remove it not to give any false impressions that noexcept works with coroutines.

Yes.

} catch (abort_requested_exception&) {
// Abort was requested. Break the loop
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it throws something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it throws something unexpected, e.g. std::bad_alloc then I think it's fine to abort as the situation is already bad or something is broken inside sleep_abortable. This is how it is now handled in v2.

The only alternative I see is to continue the loop, but then we risk loop spinning if sleep_abortable is broken, so I'm not sure this is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if we do want topology_coordinator::run to handle all the exceptions internally and never cause a crash then we should continue here. It's how it is already done in cdc_generation_publisher_fiber. I'll adjust the code here for consistency.

@piodul
Copy link
Contributor Author

piodul commented Nov 6, 2023

v2: added an explicit try..catch + on_fatal_internal_error to fence_previous_coordinator and run functions

@@ -2530,6 +2539,9 @@ future<> topology_coordinator::run() noexcept {

co_await _async_gate.close();
co_await std::move(cdc_generation_publisher);
} catch (...) {
on_fatal_internal_error(slogger, format("raft topology: unhandled exception in topology_coordinator::run: {}", std::current_exception()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like this approach. The code above already handled all exceptions. I is fine to use simple noexcep C++ keyword to mark a function as non throwing more for the documentation purposes then to actually terminate the program. Re-implementing noexcep semantics manually just add noise the the code. If you want to asser you can do it in raft_state_monitor_fiber when waiting for _topology_change_coordinator feature which suppose to be non exceptional.

…s_coordinator

The fence_previous_coordinator function has a retry loop: if it fails to
perform a group0 operation, it will try again after a 1 second delay.
However, if the topology coordinator is aborted while it waits, an
exception will be thrown and will be propagated out of the function. The
function is supposed to handle all exceptions internally, so this is not
desired.

Fix this by catching the abort_requested_exception and returning from
the function if the exception is caught.
… is aborted

An attempt to fence the previous coordinator may fail because the
current coordinator is aborted. It's not a critical error and it can
happen during normal operations, so lower the verbosity used to print a
message about this error to 'debug'.

Return from the function immediately in that case - the sleep_aborted
that happens as a next step would fail on abort_requested_exception
anyway, so make it more explicit.
…n function as noexcept"

This reverts commit dcaaa74. The
`noexcept` specifier that it added is only relevant to the function and
not the coroutine returned from that function. This was not the
intention and it looks confusing now, so remove it.
@piodul piodul force-pushed the dont-throw-from-fence-previous-coordinator branch from a8d5c6a to 8829697 Compare November 6, 2023 11:03
@piodul
Copy link
Contributor Author

piodul commented Nov 6, 2023

v3:

  • removed noexcept and the code that simulates intended semantics from signatures of run and fence_previous_coordinator
  • assert in raft_state_monitor_fiber that the future returned from topology_coordinator::run is not exceptional
  • in fence_previous_coordinator, continue the loop if sleeping fails due to other exceptions than abort_requested_exception

try {
co_await coordinator->run();
} catch (...) {
on_fatal_internal_error(slogger, format("raft topology: unhandled exception in topology_coordinator_run: {}", std::current_exception()));
Copy link
Contributor

@kostja kostja Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we initiate a raft leader step down in this case?
@gleb-cloudius

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea indeed. The question is if this should be done everywhere we assert() or in a generic code that handles the exit because of an assertion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion failure means there's a bug we need to fix, not do a leader stepdown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you tell me why it would be beneficial to step down in this case? This catch block is supposed to be triggered on unhandled exceptions, wouldn't it make more sense to abort ASAP as we encountered a bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion failure means there's a bug we need to fix, not do a leader stepdown

You are not wrong, but if a node crashes it would be beneficial for the cluster to find out about it faster. If there is a safe way to do it (safe in a way that it cannot cause harm even if it did not work properly) then why not do it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put it into a broader context. We don't know here whether or not it is a bug - all we know is we caught an exception we can't handle. I think there is a set of reasonable cases when the topology coordinator may fail with an uncaught exception: be it a memory issue or an error from the operating system. In this case we can either fail the node gracefully, or just abort it and disrupt cluster availability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also Ok with a follow up patch, but looks like it's a call for an error injection test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know here whether or not it is a bug - all we know is we caught an exception we can't handle. I think there is a set of reasonable cases when the topology coordinator may fail with an uncaught exception

The point of this try..catch is to simulate noexcept but for coroutines; noexcept immediately aborts if the function throws. Any exception flying out of topology coordinator code indicates a bug in the code, because the topology coordinator fiber is supposed to handle all exceptions internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also Ok with a follow up patch, but looks like it's a call for an error injection test.

That would be like asserting that a boolean is true at some point in the code, and then adding an injection just before that assertion that sets the bool to false, to check that the assertion indeed fires.

This try..catch is only there because of a defect of the C++ language, we shouldn't be having this discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing some context, but what does make you think that coordinator->run() can not fail for a good, valid reason that is not abort_requested?

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Sanity Tests
❌ - Unit Tests

Failed Tests (2/21259):

Build Details:

…r::run

The `topology_coordinator` function is supposed to handle all of the
exceptions internally. Assert, in runtime, that this is the case by
wrapping the `run` invocation with a try..catch; in case of an
exception, step down as a leader first and then abort.
@piodul
Copy link
Contributor Author

piodul commented Nov 6, 2023

🔴 CI State: FAILURE

✅ - Build ✅ - Sanity Tests ❌ - Unit Tests

Failed Tests (2/21259):

Build Details:

Strange failure, the test complains that a message didn't appear in the nodes' logs:


=================================== FAILURES ===================================
_______________________ test_topology_streaming_failure ________________________

request = <FixtureRequest for <Function test_topology_streaming_failure>>
manager = <test.pylib.manager_client.ManagerClient object at 0x7fc04d366a90>

    @pytest.mark.asyncio
    async def test_topology_streaming_failure(request, manager: ManagerClient):
        """Fail streaming while doing a topology operation"""
        # decommission failure
        servers = await manager.running_servers()
        await manager.api.enable_injection(servers[2].ip_addr, 'stream_ranges_fail', one_shot=True)
        await manager.decommission_node(servers[2].server_id, expected_error="Decommission failed. See earlier errors")
        servers = await manager.running_servers()
        assert len(servers) == 3
        logs = [await manager.server_open_log(srv.server_id) for srv in servers]
        matches = [await log.grep("storage_service - rollback.*after decommissioning failure to state normal") for log in logs]
>       assert sum(len(x) for x in matches) == 1
E       assert 0 == 1
E        +  where 0 = sum(<generator object test_topology_streaming_failure.<locals>.<genexpr> at 0x7fc04d29ece0>)

test/topology/test_topology_failure_recovery.py:25: AssertionError

but it does appear in the logs:

INFO  2023-11-06 14:59:41,878 [shard 0:strm] load_balancer - Prepared 0 migrations in DC datacenter1
INFO  2023-11-06 14:59:41,878 [shard 0:strm] load_balancer - Prepared 0 migrations
ERROR 2023-11-06 14:59:41,971 [shard 0:strm] storage_service - raft topology: send_raft_topology_cmd(stream_ranges) failed with exception (node state is decommissioning): std::runtime_error (failed status returned from 9593501a-a909-4625-840f-d86e12bb4e87/127.182.241.49)
INFO  2023-11-06 14:59:41,976 [shard 0:strm] storage_service - raft topology: start rolling back topology change
INFO  2023-11-06 14:59:42,076 [shard 0:strm] storage_service - rollback 9593501a-a909-4625-840f-d86e12bb4e87 after decommissioning failure to state normal
INFO  2023-11-06 14:59:42,109 [shard 0:strm] load_balancer - Examining DC datacenter1

Perhaps a race condition between writing to logs and grepping them?

In any case, I don't think it's related, but I'll run this test multiple times before submitting v4.

@piodul piodul force-pushed the dont-throw-from-fence-previous-coordinator branch from 8829697 to 85516c9 Compare November 7, 2023 09:33
@piodul
Copy link
Contributor Author

piodul commented Nov 7, 2023

I ran test_topology_streaming_failure ~100 times with my PR and ~100 times without. I managed to reproduce it with my PR only once, and without it - zero times. Unless it happens in the CI again, I don't think this should block merging the PR, I really don't think it's related.

I'll create an issue for it.

@piodul
Copy link
Contributor Author

piodul commented Nov 7, 2023

v4: added stepdown before on_fatal_internal_error

@gleb-cloudius
Copy link
Contributor

I ran test_topology_streaming_failure ~100 times with my PR and ~100 times without. I managed to reproduce it with my PR only once, and without it - zero times. Unless it happens in the CI again, I don't think this should block merging the PR, I really don't think it's related.

I'll create an issue for it.

I saw it yesterday here without your patch: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4548/.

@piodul
Copy link
Contributor Author

piodul commented Nov 7, 2023

I'll create an issue for it.

#15980

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Sanity Tests
✅ - Unit Tests

Build Details:

@scylladb-promoter scylladb-promoter merged commit 07e9522 into scylladb:master Nov 7, 2023
3 checks passed
kbr-scylla added a commit that referenced this pull request Nov 16, 2023
…s_coordinator' from Piotr Dulikowski

When topology coordinator tries to fence the previous coordinator it
performs a group0 operation. The current topology coordinator might be
aborted in the meantime, which will result in a `raft::request_aborted`
exception being thrown. After the fix to #15728 was
merged, the exception is caught, but then `sleep_abortable` is called
which immediately throws `abort_requested_exception` as it uses the same
abort source as the group0 operation. The `fence_previous_coordinator`
function which does all those things is not supposed to throw
exceptions, if it does - it causes `raft_state_monitor_fiber` to exit,
completely disabling the topology coordinator functionality on that
node.

Modify the code in the following way:

- Catch `abort_requested_exception` thrown from `sleep_abortable` and
  exit the function if it happens. In addition to the described issue,
it will also handle the case when abort is requested while
`sleep_abortable` happens,
- Catch `raft::request_aborted` thrown from group0 operation, log the
  exception with lower verbosity and exit the function explicitly.

Finally, wrap both `fence_previous_coordinator` and `run` functions in a
`try` block with `on_fatal_internal_error` in the catch handler in order
to implement the behavior that adding `noexcept` was originally supposed
to introduce.

Fixes: #15747

Closes #15948

* github.com:scylladb/scylladb:
  raft topology: catch and abort on exceptions from topology_coordinator::run
  Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept"
  raft topology: don't print an error when fencing previous coordinator is aborted
  raft topology: handle abort exceptions from sleeping in fence_previous_coordinator

(cherry picked from commit 07e9522)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests / Unit Tests / non-boost tests.topology_experimental_raft.test_raft_cluster_features.debug is flaky
5 participants