New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test "test_tablets.test_topology_changes" failed in next - TimeoutError #15728
Comments
@kbr-scylla - does that belong to your team? |
(Running on spider2 ) |
It's tablets test, @tgrabiec please take a look. If you find that this is some generic problem with the topology coordinator, then we can take over. |
The test failed because decommission request timed out:
The timeout is ~16min, streaming finished successfully within 1 sec, but the process seem to be stuck somewhere after giving up leadership. This node should shut down, but it's not shutting down:
scylla-1209.log:
|
The node should be told to shut down by the new leader. Looks like the new leader fails, in which case the topology change fiber will exit and not react to anything else:
The fiber probably shouldn't exit if |
failed on @avikivity - could you please backport it to 5.4? currently it is blocking us during a release process of 5.4.0-rc1 |
…s_coordinator() Throwing error kills the topology coordinator monitor fiber. Instead we retry the operation until it succeeds or the node looses its leadership. This is fine before for the operation to succeed quorum is needed and if the quorum is not available the node should relinquish its leadership. Fixes #15728 (cherry picked from commit 65bf587)
Backported to 5.4. Earlier versions not vulnerable. |
Removed |
i am still seeing a similar failure. see https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4533/testReport/junit/(root)/test_tablets/Tests___Unit_Tests___test_topology_changes/. but due to #15973, the link might not lead us to the failure. refresh the browser more times to see it.
|
…s_coordinator' from Piotr Dulikowski When topology coordinator tries to fence the previous coordinator it performs a group0 operation. The current topology coordinator might be aborted in the meantime, which will result in a `raft::request_aborted` exception being thrown. After the fix to #15728 was merged, the exception is caught, but then `sleep_abortable` is called which immediately throws `abort_requested_exception` as it uses the same abort source as the group0 operation. The `fence_previous_coordinator` function which does all those things is not supposed to throw exceptions, if it does - it causes `raft_state_monitor_fiber` to exit, completely disabling the topology coordinator functionality on that node. Modify the code in the following way: - Catch `abort_requested_exception` thrown from `sleep_abortable` and exit the function if it happens. In addition to the described issue, it will also handle the case when abort is requested while `sleep_abortable` happens, - Catch `raft::request_aborted` thrown from group0 operation, log the exception with lower verbosity and exit the function explicitly. Finally, wrap both `fence_previous_coordinator` and `run` functions in a `try` block with `on_fatal_internal_error` in the catch handler in order to implement the behavior that adding `noexcept` was originally supposed to introduce. Fixes: #15747 Closes #15948 * github.com:scylladb/scylladb: raft topology: catch and abort on exceptions from topology_coordinator::run Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept" raft topology: don't print an error when fencing previous coordinator is aborted raft topology: handle abort exceptions from sleeping in fence_previous_coordinator
this looks like #15747 |
…s_coordinator' from Piotr Dulikowski When topology coordinator tries to fence the previous coordinator it performs a group0 operation. The current topology coordinator might be aborted in the meantime, which will result in a `raft::request_aborted` exception being thrown. After the fix to #15728 was merged, the exception is caught, but then `sleep_abortable` is called which immediately throws `abort_requested_exception` as it uses the same abort source as the group0 operation. The `fence_previous_coordinator` function which does all those things is not supposed to throw exceptions, if it does - it causes `raft_state_monitor_fiber` to exit, completely disabling the topology coordinator functionality on that node. Modify the code in the following way: - Catch `abort_requested_exception` thrown from `sleep_abortable` and exit the function if it happens. In addition to the described issue, it will also handle the case when abort is requested while `sleep_abortable` happens, - Catch `raft::request_aborted` thrown from group0 operation, log the exception with lower verbosity and exit the function explicitly. Finally, wrap both `fence_previous_coordinator` and `run` functions in a `try` block with `on_fatal_internal_error` in the catch handler in order to implement the behavior that adding `noexcept` was originally supposed to introduce. Fixes: #15747 Closes #15948 * github.com:scylladb/scylladb: raft topology: catch and abort on exceptions from topology_coordinator::run Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept" raft topology: don't print an error when fencing previous coordinator is aborted raft topology: handle abort exceptions from sleeping in fence_previous_coordinator (cherry picked from commit 07e9522)
https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/staging/job/next/22/testReport/(root)/test_tablets/Tests___Unit_Tests___test_topology_changes_2/
The text was updated successfully, but these errors were encountered: