storage_service: node_ops_cmd_handler: decommission rollback fix #16508

gusev-p · 2023-12-21T11:17:25Z

This is a regression after #15903. Before that PR del_leaving_endpoint took IP as a parameter and did nothing if it was called with a non-existent IP. After the PR del_leaving_endpoint takes host_id as a parameter and we need to convert IP to host_id before we can call it. The problem is that under certain conditions the node may be already removed and we can't do so.

This patch restores original behaviour, we just do nothing if we can't find the node by IP.

The problem was revealed by the dtest test_remove_garbage_members_from_group0_after_abort_decommission[Announcing_that_I_have_left_the_ring-]. The test was flaky as in most cases the node died before the
gossiper notification reached all the other nodes. To make it fail consistently and reproduce the problem one can move the info log Announcing that I have after the sleep and add additional sleep after it in
storage_service::leave_ring function.

Fixes #16466

…the node if's already removed This is a regression after scylladb#15903. Before these changes del_leaving_endpoint took IP as a parameter and did nothing if it was called with a non-existent IP. The problem was revealed by the dtest test_remove_garbage_members_from_group0_after_abort_decommission[Announcing_that_I_have_left_the_ring-]. The test was flaky as in most cases the node died before the gossiper notification reached all the other nodes. To make it fail consistently and reproduce the problem one can move the info log 'Announcing that I have' after the sleep and add additional sleep after it in storage_service::leave_ring function. Fixes scylladb#16466

scylladb-promoter · 2023-12-21T14:48:32Z

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 3 hr 26 min
Builder: spider7.cloudius-systems.com

bhalevy

lgtm. I have a stale branch that extends node ops to use host_id:s, so it would more elegant this way since the host_id would be part of the request.

…the node if's already removed This is a regression after scylladb#15903. Before these changes del_leaving_endpoint took IP as a parameter and did nothing if it was called with a non-existent IP. The problem was revealed by the dtest test_remove_garbage_members_from_group0_after_abort_decommission[Announcing_that_I_have_left_the_ring-]. The test was flaky as in most cases the node died before the gossiper notification reached all the other nodes. To make it fail consistently and reproduce the problem one can move the info log 'Announcing that I have' after the sleep and add additional sleep after it in storage_service::leave_ring function. Fixes scylladb#16466 Closes scylladb#16508

gusev-p requested a review from tgrabiec as a code owner December 21, 2023 11:17

gusev-p requested a review from bhalevy December 21, 2023 11:17

bhalevy approved these changes Dec 21, 2023

View reviewed changes

scylladb-promoter closed this in c05fd8c Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage_service: node_ops_cmd_handler: decommission rollback fix #16508

storage_service: node_ops_cmd_handler: decommission rollback fix #16508

gusev-p commented Dec 21, 2023

scylladb-promoter commented Dec 21, 2023

bhalevy left a comment

storage_service: node_ops_cmd_handler: decommission rollback fix #16508

storage_service: node_ops_cmd_handler: decommission rollback fix #16508

Conversation

gusev-p commented Dec 21, 2023

scylladb-promoter commented Dec 21, 2023

🟢 CI State: SUCCESS

Build Details:

bhalevy left a comment

Choose a reason for hiding this comment