Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage_service: node_ops_cmd_handler: decommission rollback fix #16508

Closed
wants to merge 1 commit into from

Conversation

gusev-p
Copy link

@gusev-p gusev-p commented Dec 21, 2023

This is a regression after #15903. Before that PR del_leaving_endpoint took IP as a parameter and did nothing if it was called with a non-existent IP. After the PR del_leaving_endpoint takes host_id as a parameter and we need to convert IP to host_id before we can call it. The problem is that under certain conditions the node may be already removed and we can't do so.

This patch restores original behaviour, we just do nothing if we can't find the node by IP.

The problem was revealed by the dtest test_remove_garbage_members_from_group0_after_abort_decommission[Announcing_that_I_have_left_the_ring-]. The test was flaky as in most cases the node died before the
gossiper notification reached all the other nodes. To make it fail consistently and reproduce the problem one can move the info log Announcing that I have after the sleep and add additional sleep after it in
storage_service::leave_ring function.

Fixes #16466

…the node if's already removed

This is a regression after scylladb#15903. Before these changes
del_leaving_endpoint took IP as a parameter and did nothing
if it was called with a non-existent IP.

The problem was revealed by the dtest test_remove_garbage_members_from_group0_after_abort_decommission[Announcing_that_I_have_left_the_ring-]. The test was
flaky as in most cases the node died before the
gossiper notification reached all the other nodes. To make
it fail consistently and reproduce the problem one
can move the info log 'Announcing that I have' after
the sleep and add additional sleep after it in
storage_service::leave_ring function.

Fixes scylladb#16466
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 3 hr 26 min
  • Builder: spider7.cloudius-systems.com

Copy link
Member

@bhalevy bhalevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I have a stale branch that extends node ops to use host_id:s, so it would more elegant this way since the host_id would be part of the request.

dgarcia360 pushed a commit to dgarcia360/scylla that referenced this pull request Apr 30, 2024
…the node if's already removed

This is a regression after scylladb#15903. Before these changes
del_leaving_endpoint took IP as a parameter and did nothing
if it was called with a non-existent IP.

The problem was revealed by the dtest test_remove_garbage_members_from_group0_after_abort_decommission[Announcing_that_I_have_left_the_ring-]. The test was
flaky as in most cases the node died before the
gossiper notification reached all the other nodes. To make
it fail consistently and reproduce the problem one
can move the info log 'Announcing that I have' after
the sleep and add additional sleep after it in
storage_service::leave_ring function.

Fixes scylladb#16466

Closes scylladb#16508
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants