raft topology: handle abort exceptions better in fence_previous_coordinator #15948

piodul · 2023-11-04T16:30:53Z

When topology coordinator tries to fence the previous coordinator it performs a group0 operation. The current topology coordinator might be aborted in the meantime, which will result in a raft::request_aborted exception being thrown. After the fix to #15728 was merged, the exception is caught, but then sleep_abortable is called which immediately throws abort_requested_exception as it uses the same abort source as the group0 operation. The fence_previous_coordinator function which does all those things is not supposed to throw exceptions, if it does - it causes raft_state_monitor_fiber to exit, completely disabling the topology coordinator functionality on that node.

Modify the code in the following way:

Catch abort_requested_exception thrown from sleep_abortable and exit the function if it happens. In addition to the described issue, it will also handle the case when abort is requested while sleep_abortable happens,
Catch raft::request_aborted thrown from group0 operation, log the exception with lower verbosity and exit the function explicitly.

Finally, wrap both fence_previous_coordinator and run functions in a try block with on_fatal_internal_error in the catch handler in order to implement the behavior that adding noexcept was originally supposed to introduce.

Fixes: #15747

scylladb-promoter · 2023-11-04T18:37:55Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Sanity Tests
✅ - Unit Tests

Build Details:

Build URL: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4520/
Duration: 2 hr 6 min
Builder: spider4.cloudius-systems.com

gleb-cloudius · 2023-11-05T11:50:15Z

fence_previous_coordinator function which does all those things is not supposed to throw exceptions

It is noexcept since dcaaa74. Is this PR is still relevant?

piodul · 2023-11-06T08:48:28Z

Yes, the PR is still relevant. It turns out that in C++ noexcept doesn't work for coroutines (function vs. coroutine returned by the function are two different things, and noexcept only applies to the former: https://devblogs.microsoft.com/oldnewthing/20210426-00/?p=105153).

But, even if noexcept worked as we expected, there would still be an issue - there is an unhandled abort exception in the topology coordinator fiber - we would be looking at a node crash instead of a silent breakage.

I wasn't sure whether to keep the noexcept from the signatures or not, but now I think I should remove it not to give any false impressions that noexcept works with coroutines.

piodul · 2023-11-06T08:56:26Z

I wasn't sure whether to keep the noexcept from the signatures or not, but now I think I should remove it not to give any false impressions that noexcept works with coroutines.

...and replace it with an explicit try..catch + terminate to achieve the intended behavior

gleb-cloudius · 2023-11-06T08:57:48Z

Yes, the PR is still relevant. It turns out that in C++ noexcept doesn't work for coroutines (function vs. coroutine returned by the function are two different things, and noexcept only applies to the former: https://devblogs.microsoft.com/oldnewthing/20210426-00/?p=105153).

There should be a way to mark a function as not capable of returning an exceptional future.

But, even if noexcept worked as we expected, there would still be an issue - there is an unhandled abort exception in the topology coordinator fiber - we would be looking at a node crash instead of a silent breakage.

I wasn't sure whether to keep the noexcept from the signatures or not, but now I think I should remove it not to give any false impressions that noexcept works with coroutines.

Yes.

gleb-cloudius · 2023-11-06T09:04:46Z

service/storage_service.cc

+        } catch (abort_requested_exception&) {
+            // Abort was requested. Break the loop
+            break;
+        }


What if it throws something else?

If it throws something unexpected, e.g. std::bad_alloc then I think it's fine to abort as the situation is already bad or something is broken inside sleep_abortable. This is how it is now handled in v2.

The only alternative I see is to continue the loop, but then we risk loop spinning if sleep_abortable is broken, so I'm not sure this is fine.

I guess if we do want topology_coordinator::run to handle all the exceptions internally and never cause a crash then we should continue here. It's how it is already done in cdc_generation_publisher_fiber. I'll adjust the code here for consistency.

piodul · 2023-11-06T10:00:23Z

v2: added an explicit try..catch + on_fatal_internal_error to fence_previous_coordinator and run functions

gleb-cloudius · 2023-11-06T10:21:19Z

service/storage_service.cc

@@ -2530,6 +2539,9 @@ future<> topology_coordinator::run() noexcept {

    co_await _async_gate.close();
    co_await std::move(cdc_generation_publisher);
+  } catch (...) {
+    on_fatal_internal_error(slogger, format("raft topology: unhandled exception in topology_coordinator::run: {}", std::current_exception()));
+  }


I do not like this approach. The code above already handled all exceptions. I is fine to use simple noexcep C++ keyword to mark a function as non throwing more for the documentation purposes then to actually terminate the program. Re-implementing noexcep semantics manually just add noise the the code. If you want to asser you can do it in raft_state_monitor_fiber when waiting for _topology_change_coordinator feature which suppose to be non exceptional.

…s_coordinator The fence_previous_coordinator function has a retry loop: if it fails to perform a group0 operation, it will try again after a 1 second delay. However, if the topology coordinator is aborted while it waits, an exception will be thrown and will be propagated out of the function. The function is supposed to handle all exceptions internally, so this is not desired. Fix this by catching the abort_requested_exception and returning from the function if the exception is caught.

… is aborted An attempt to fence the previous coordinator may fail because the current coordinator is aborted. It's not a critical error and it can happen during normal operations, so lower the verbosity used to print a message about this error to 'debug'. Return from the function immediately in that case - the sleep_aborted that happens as a next step would fail on abort_requested_exception anyway, so make it more explicit.

…n function as noexcept" This reverts commit dcaaa74. The `noexcept` specifier that it added is only relevant to the function and not the coroutine returned from that function. This was not the intention and it looks confusing now, so remove it.

piodul · 2023-11-06T11:04:21Z

v3:

removed noexcept and the code that simulates intended semantics from signatures of run and fence_previous_coordinator
assert in raft_state_monitor_fiber that the future returned from topology_coordinator::run is not exceptional
in fence_previous_coordinator, continue the loop if sleeping fails due to other exceptions than abort_requested_exception

kostja · 2023-11-06T12:11:29Z

service/storage_service.cc

+                    try {
+                        co_await coordinator->run();
+                    } catch (...) {
+                        on_fatal_internal_error(slogger, format("raft topology: unhandled exception in topology_coordinator_run: {}", std::current_exception()));


Shouldn't we initiate a raft leader step down in this case?
@gleb-cloudius

Good idea indeed. The question is if this should be done everywhere we assert() or in a generic code that handles the exit because of an assertion.

Assertion failure means there's a bug we need to fix, not do a leader stepdown

Could you tell me why it would be beneficial to step down in this case? This catch block is supposed to be triggered on unhandled exceptions, wouldn't it make more sense to abort ASAP as we encountered a bug?

Assertion failure means there's a bug we need to fix, not do a leader stepdown

You are not wrong, but if a node crashes it would be beneficial for the cluster to find out about it faster. If there is a safe way to do it (safe in a way that it cannot cause harm even if it did not work properly) then why not do it?

I'd put it into a broader context. We don't know here whether or not it is a bug - all we know is we caught an exception we can't handle. I think there is a set of reasonable cases when the topology coordinator may fail with an uncaught exception: be it a memory issue or an error from the operating system. In this case we can either fail the node gracefully, or just abort it and disrupt cluster availability.

I am also Ok with a follow up patch, but looks like it's a call for an error injection test.

We don't know here whether or not it is a bug - all we know is we caught an exception we can't handle. I think there is a set of reasonable cases when the topology coordinator may fail with an uncaught exception

The point of this try..catch is to simulate noexcept but for coroutines; noexcept immediately aborts if the function throws. Any exception flying out of topology coordinator code indicates a bug in the code, because the topology coordinator fiber is supposed to handle all exceptions internally.

I am also Ok with a follow up patch, but looks like it's a call for an error injection test.

That would be like asserting that a boolean is true at some point in the code, and then adding an injection just before that assertion that sets the bool to false, to check that the assertion indeed fires.

This try..catch is only there because of a defect of the C++ language, we shouldn't be having this discussion.

Maybe I am missing some context, but what does make you think that coordinator->run() can not fail for a good, valid reason that is not abort_requested?

scylladb-promoter · 2023-11-06T13:11:22Z

🔴 CI State: FAILURE

✅ - Build
✅ - Sanity Tests
❌ - Unit Tests

Failed Tests (2/21259):

Build Details:

Build URL: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4542/
Duration: 2 hr 7 min
Builder: spider2.cloudius-systems.com

…r::run The `topology_coordinator` function is supposed to handle all of the exceptions internally. Assert, in runtime, that this is the case by wrapping the `run` invocation with a try..catch; in case of an exception, step down as a leader first and then abort.

piodul · 2023-11-06T14:29:57Z

🔴 CI State: FAILURE

✅ - Build ✅ - Sanity Tests ❌ - Unit Tests

Failed Tests (2/21259):

test_topology_streaming_failure 🔍

topology.test_topology_failure_recovery.debug.1 🔍

Build Details:

Build URL: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4542/

Duration: 2 hr 7 min

Builder: spider2.cloudius-systems.com

Strange failure, the test complains that a message didn't appear in the nodes' logs:


=================================== FAILURES ===================================
_______________________ test_topology_streaming_failure ________________________

request = <FixtureRequest for <Function test_topology_streaming_failure>>
manager = <test.pylib.manager_client.ManagerClient object at 0x7fc04d366a90>

    @pytest.mark.asyncio
    async def test_topology_streaming_failure(request, manager: ManagerClient):
        """Fail streaming while doing a topology operation"""
        # decommission failure
        servers = await manager.running_servers()
        await manager.api.enable_injection(servers[2].ip_addr, 'stream_ranges_fail', one_shot=True)
        await manager.decommission_node(servers[2].server_id, expected_error="Decommission failed. See earlier errors")
        servers = await manager.running_servers()
        assert len(servers) == 3
        logs = [await manager.server_open_log(srv.server_id) for srv in servers]
        matches = [await log.grep("storage_service - rollback.*after decommissioning failure to state normal") for log in logs]
>       assert sum(len(x) for x in matches) == 1
E       assert 0 == 1
E        +  where 0 = sum(<generator object test_topology_streaming_failure.<locals>.<genexpr> at 0x7fc04d29ece0>)

test/topology/test_topology_failure_recovery.py:25: AssertionError

but it does appear in the logs:

INFO  2023-11-06 14:59:41,878 [shard 0:strm] load_balancer - Prepared 0 migrations in DC datacenter1
INFO  2023-11-06 14:59:41,878 [shard 0:strm] load_balancer - Prepared 0 migrations
ERROR 2023-11-06 14:59:41,971 [shard 0:strm] storage_service - raft topology: send_raft_topology_cmd(stream_ranges) failed with exception (node state is decommissioning): std::runtime_error (failed status returned from 9593501a-a909-4625-840f-d86e12bb4e87/127.182.241.49)
INFO  2023-11-06 14:59:41,976 [shard 0:strm] storage_service - raft topology: start rolling back topology change
INFO  2023-11-06 14:59:42,076 [shard 0:strm] storage_service - rollback 9593501a-a909-4625-840f-d86e12bb4e87 after decommissioning failure to state normal
INFO  2023-11-06 14:59:42,109 [shard 0:strm] load_balancer - Examining DC datacenter1

Perhaps a race condition between writing to logs and grepping them?

In any case, I don't think it's related, but I'll run this test multiple times before submitting v4.

piodul · 2023-11-07T09:39:36Z

I ran test_topology_streaming_failure ~100 times with my PR and ~100 times without. I managed to reproduce it with my PR only once, and without it - zero times. Unless it happens in the CI again, I don't think this should block merging the PR, I really don't think it's related.

I'll create an issue for it.

piodul · 2023-11-07T09:40:01Z

v4: added stepdown before on_fatal_internal_error

gleb-cloudius · 2023-11-07T09:43:05Z

I ran test_topology_streaming_failure ~100 times with my PR and ~100 times without. I managed to reproduce it with my PR only once, and without it - zero times. Unless it happens in the CI again, I don't think this should block merging the PR, I really don't think it's related.

I'll create an issue for it.

I saw it yesterday here without your patch: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4548/.

piodul · 2023-11-07T10:08:59Z

I'll create an issue for it.

#15980

scylladb-promoter · 2023-11-07T16:04:34Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Sanity Tests
✅ - Unit Tests

Build Details:

Build URL: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4563/
Duration: 6 hr 30 min
Builder: monster

…s_coordinator' from Piotr Dulikowski When topology coordinator tries to fence the previous coordinator it performs a group0 operation. The current topology coordinator might be aborted in the meantime, which will result in a `raft::request_aborted` exception being thrown. After the fix to #15728 was merged, the exception is caught, but then `sleep_abortable` is called which immediately throws `abort_requested_exception` as it uses the same abort source as the group0 operation. The `fence_previous_coordinator` function which does all those things is not supposed to throw exceptions, if it does - it causes `raft_state_monitor_fiber` to exit, completely disabling the topology coordinator functionality on that node. Modify the code in the following way: - Catch `abort_requested_exception` thrown from `sleep_abortable` and exit the function if it happens. In addition to the described issue, it will also handle the case when abort is requested while `sleep_abortable` happens, - Catch `raft::request_aborted` thrown from group0 operation, log the exception with lower verbosity and exit the function explicitly. Finally, wrap both `fence_previous_coordinator` and `run` functions in a `try` block with `on_fatal_internal_error` in the catch handler in order to implement the behavior that adding `noexcept` was originally supposed to introduce. Fixes: #15747 Closes #15948 * github.com:scylladb/scylladb: raft topology: catch and abort on exceptions from topology_coordinator::run Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept" raft topology: don't print an error when fencing previous coordinator is aborted raft topology: handle abort exceptions from sleeping in fence_previous_coordinator (cherry picked from commit 07e9522)

piodul requested a review from tgrabiec as a code owner November 4, 2023 16:30

piodul requested review from gleb-cloudius and kbr-scylla November 4, 2023 16:43

gleb-cloudius reviewed Nov 6, 2023

View reviewed changes

piodul added 3 commits November 6, 2023 12:00

piodul force-pushed the dont-throw-from-fence-previous-coordinator branch from a8d5c6a to 8829697 Compare November 6, 2023 11:03

gleb-cloudius approved these changes Nov 6, 2023

View reviewed changes

kostja reviewed Nov 6, 2023

View reviewed changes

piodul force-pushed the dont-throw-from-fence-previous-coordinator branch from 8829697 to 85516c9 Compare November 7, 2023 09:33

piodul mentioned this pull request Nov 7, 2023

test_topology_streaming_failure: storage_service - rollback.*after decommissioning failure to state normal not observed in logs by the test even though it appears #15980

Closed

kbr-scylla approved these changes Nov 7, 2023

View reviewed changes

scylladb-promoter merged commit 07e9522 into scylladb:master Nov 7, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft topology: handle abort exceptions better in fence_previous_coordinator #15948

raft topology: handle abort exceptions better in fence_previous_coordinator #15948

piodul commented Nov 4, 2023 •

edited

scylladb-promoter commented Nov 4, 2023

gleb-cloudius commented Nov 5, 2023

piodul commented Nov 6, 2023

piodul commented Nov 6, 2023

gleb-cloudius commented Nov 6, 2023

gleb-cloudius Nov 6, 2023

piodul Nov 6, 2023

piodul Nov 6, 2023

piodul commented Nov 6, 2023

gleb-cloudius Nov 6, 2023

piodul commented Nov 6, 2023

kostja Nov 6, 2023 •

edited

gleb-cloudius Nov 6, 2023

kbr-scylla Nov 6, 2023

piodul Nov 6, 2023

gleb-cloudius Nov 6, 2023

kostja Nov 7, 2023

kostja Nov 7, 2023

kbr-scylla Nov 7, 2023

kbr-scylla Nov 7, 2023

kostja Nov 7, 2023

scylladb-promoter commented Nov 6, 2023

piodul commented Nov 6, 2023

🔴 CI State: FAILURE

Failed Tests (2/21259):

Build Details:

piodul commented Nov 7, 2023

piodul commented Nov 7, 2023 •

edited

gleb-cloudius commented Nov 7, 2023

piodul commented Nov 7, 2023

scylladb-promoter commented Nov 7, 2023

raft topology: handle abort exceptions better in fence_previous_coordinator #15948

raft topology: handle abort exceptions better in fence_previous_coordinator #15948

Conversation

piodul commented Nov 4, 2023 • edited

scylladb-promoter commented Nov 4, 2023

🟢 CI State: SUCCESS

Build Details:

gleb-cloudius commented Nov 5, 2023

piodul commented Nov 6, 2023

piodul commented Nov 6, 2023

gleb-cloudius commented Nov 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piodul commented Nov 6, 2023

Choose a reason for hiding this comment

piodul commented Nov 6, 2023

kostja Nov 6, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scylladb-promoter commented Nov 6, 2023

🔴 CI State: FAILURE

Failed Tests (2/21259):

Build Details:

piodul commented Nov 6, 2023

🔴 CI State: FAILURE

Failed Tests (2/21259):

Build Details:

piodul commented Nov 7, 2023

piodul commented Nov 7, 2023 • edited

gleb-cloudius commented Nov 7, 2023

piodul commented Nov 7, 2023

scylladb-promoter commented Nov 7, 2023

🟢 CI State: SUCCESS

Build Details:

piodul commented Nov 4, 2023 •

edited

kostja Nov 6, 2023 •

edited

piodul commented Nov 7, 2023 •

edited