New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests / Unit Tests / non-boost tests.topology_experimental_raft.test_raft_cluster_features.debug is flaky #15747
Comments
@kbr-scylla - please have someone look at the latest failure above. |
The node logs from the original report are gone, but there are still some from the second report here. This excerpt from https://jenkins.scylladb.com/job/scylla-master/job/build/1671/artifact/testlog/aarch64/debug/scylla-305.log looks very similar to #15728 (comment) and the explanation is the same - the topology coordinator failed to fence the previous coordinator, it throws an exception which causes the topology coordinator to exit and not do anything else.
So it's a duplicate of #15728. It will be fixed by Gleb's patch which was already queued. |
Unfortunately, it's not fixed yet, we had a similar failure today here: https://jenkins.scylladb.com/job/scylla-master/job/build/1684/ Logs are slightly different (there is a difference in the last message):
The fix mentioned above has a bug: although it catches the Will send a fix shortly. |
We should backport to 5.4 to prevent flakiness. |
…s_coordinator' from Piotr Dulikowski When topology coordinator tries to fence the previous coordinator it performs a group0 operation. The current topology coordinator might be aborted in the meantime, which will result in a `raft::request_aborted` exception being thrown. After the fix to #15728 was merged, the exception is caught, but then `sleep_abortable` is called which immediately throws `abort_requested_exception` as it uses the same abort source as the group0 operation. The `fence_previous_coordinator` function which does all those things is not supposed to throw exceptions, if it does - it causes `raft_state_monitor_fiber` to exit, completely disabling the topology coordinator functionality on that node. Modify the code in the following way: - Catch `abort_requested_exception` thrown from `sleep_abortable` and exit the function if it happens. In addition to the described issue, it will also handle the case when abort is requested while `sleep_abortable` happens, - Catch `raft::request_aborted` thrown from group0 operation, log the exception with lower verbosity and exit the function explicitly. Finally, wrap both `fence_previous_coordinator` and `run` functions in a `try` block with `on_fatal_internal_error` in the catch handler in order to implement the behavior that adding `noexcept` was originally supposed to introduce. Fixes: #15747 Closes #15948 * github.com:scylladb/scylladb: raft topology: catch and abort on exceptions from topology_coordinator::run Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept" raft topology: don't print an error when fencing previous coordinator is aborted raft topology: handle abort exceptions from sleeping in fence_previous_coordinator (cherry picked from commit 07e9522)
Queued backport to 5.4 (will eventually end up in 2024.1). |
Removed labels. |
Seen in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4217/testReport/junit/(root)/test_raft_cluster_features/Tests___Unit_Tests___test_rolling_upgrade_happy_path_5/
AssertionError: Deadline exceeded, failing test.
Ran on an m5ad.8xlarge instance, so shouldn't be an infra problem even though it's a timeout
The text was updated successfully, but these errors were encountered: