Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migration_manager::sync_schema failed with seastar::rpc::closed_error causing repair to fail #12956

Closed
asias opened this issue Feb 22, 2023 · 1 comment
Milestone

Comments

@asias
Copy link
Contributor

asias commented Feb 22, 2023

The issue is split from: #12764 (comment)

@asias
Copy link
Contributor Author

asias commented Feb 22, 2023

Tested with recent master: d7b6cf0

asias added a commit to asias/scylla that referenced this issue Feb 23, 2023
…otstrap

In storage_service::handle_state_normal, storage_service::notify_joined
will be called which drops the rpc connections to the node becomes
normal. This causes rpc calls with that node fail with
seastar::rpc::closed_error error.

Consider this:

- n1 in the cluster
- n2 is added to join the cluster
- n2 sees n1 is in normal status
- n2 starts bootstrap process
- notify_joined on n2 closes rpc connection to n1 in the middle of
  bootstrap
- n2 fails to bootstrap

For example, during bootstrap with RBNO, we saw repair failed in a
test that sets ring_delay to zero and does not wait for gossip to
settle.

repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for
keyspace=system_distributed_everywhere, status=failed:
std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is
closed)})

This patch fixes the race by waiting for the handle_state_normal handler
to finish before the bootstrap process.

Fixes scylladb#12764
Fixes scylladb#12956
asias added a commit to asias/scylla that referenced this issue Feb 24, 2023
In storage_service::handle_state_normal, storage_service::notify_joined
will be called which drops the rpc connections to the node becomes
normal. This causes rpc calls with that node fail with
seastar::rpc::closed_error error.

Consider this:

- n1 in the cluster
- n2 is added to join the cluster
- n2 sees n1 is in normal status
- n2 starts bootstrap process
- notify_joined on n2 closes rpc connection to n1 in the middle of
  bootstrap
- n2 fails to bootstrap

For example, during bootstrap with RBNO, we saw repair failed in a
test that sets ring_delay to zero and does not wait for gossip to
settle.

repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for
keyspace=system_distributed_everywhere, status=failed:
std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is
closed)})

This patch fixes the race by waiting for the handle_state_normal handler
to finish before the bootstrap process.

Fixes scylladb#12764
Fixes scylladb#12956
asias added a commit to asias/scylla that referenced this issue Feb 24, 2023
In storage_service::handle_state_normal, storage_service::notify_joined
will be called which drops the rpc connections to the node becomes
normal. This causes rpc calls with that node fail with
seastar::rpc::closed_error error.

Consider this:

- n1 in the cluster
- n2 is added to join the cluster
- n2 sees n1 is in normal status
- n2 starts bootstrap process
- notify_joined on n2 closes rpc connection to n1 in the middle of
  bootstrap
- n2 fails to bootstrap

For example, during bootstrap with RBNO, we saw repair failed in a
test that sets ring_delay to zero and does not wait for gossip to
settle.

repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for
keyspace=system_distributed_everywhere, status=failed:
std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is
closed)})

This patch fixes the race by waiting for the handle_state_normal handler
to finish before the bootstrap process.

Fixes scylladb#12764
Fixes scylladb#12956
asias added a commit to asias/scylla that referenced this issue Feb 27, 2023
In storage_service::handle_state_normal, storage_service::notify_joined
will be called which drops the rpc connections to the node becomes
normal. This causes rpc calls with that node fail with
seastar::rpc::closed_error error.

Consider this:

- n1 in the cluster
- n2 is added to join the cluster
- n2 sees n1 is in normal status
- n2 starts bootstrap process
- notify_joined on n2 closes rpc connection to n1 in the middle of
  bootstrap
- n2 fails to bootstrap

For example, during bootstrap with RBNO, we saw repair failed in a
test that sets ring_delay to zero and does not wait for gossip to
settle.

repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for
keyspace=system_distributed_everywhere, status=failed:
std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is
closed)})

This patch fixes the race by waiting for the handle_state_normal handler
to finish before the bootstrap process.

Fixes scylladb#12764
Fixes scylladb#12956
@DoronArazii DoronArazii added this to the 5.3 milestone Mar 15, 2023
bhalevy pushed a commit to bhalevy/scylla that referenced this issue Apr 16, 2023
In storage_service::handle_state_normal, storage_service::notify_joined
will be called which drops the rpc connections to the node becomes
normal. This causes rpc calls with that node fail with
seastar::rpc::closed_error error.

Consider this:

- n1 in the cluster
- n2 is added to join the cluster
- n2 sees n1 is in normal status
- n2 starts bootstrap process
- notify_joined on n2 closes rpc connection to n1 in the middle of
  bootstrap
- n2 fails to bootstrap

For example, during bootstrap with RBNO, we saw repair failed in a
test that sets ring_delay to zero and does not wait for gossip to
settle.

repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for
keyspace=system_distributed_everywhere, status=failed:
std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is
closed)})

This patch fixes the race by waiting for the handle_state_normal handler
to finish before the bootstrap process.

Fixes scylladb#12764
Fixes scylladb#12956

(cherry picked from commit 5363616)
bhalevy pushed a commit to bhalevy/scylla that referenced this issue Apr 30, 2023
In storage_service::handle_state_normal, storage_service::notify_joined
will be called which drops the rpc connections to the node becomes
normal. This causes rpc calls with that node fail with
seastar::rpc::closed_error error.

Consider this:

- n1 in the cluster
- n2 is added to join the cluster
- n2 sees n1 is in normal status
- n2 starts bootstrap process
- notify_joined on n2 closes rpc connection to n1 in the middle of
  bootstrap
- n2 fails to bootstrap

For example, during bootstrap with RBNO, we saw repair failed in a
test that sets ring_delay to zero and does not wait for gossip to
settle.

repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for
keyspace=system_distributed_everywhere, status=failed:
std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is
closed)})

This patch fixes the race by waiting for the handle_state_normal handler
to finish before the bootstrap process.

Fixes scylladb#12764
Fixes scylladb#12956

(cherry picked from commit 5363616)
denesb added a commit that referenced this issue May 17, 2023
This branch backports to branch-5.2 several fixes related to node operations:
- ba919aa (PR #12980; Fixes: #11011, #12969)
- 5363616 (part of PR #12970; Fixes: #12764, #12956)
- 5856e69 (part of PR #12970)
- 2b44631 (PR #13028; Fixes: #12989)
- 6373452 (PR #12799; Fixes #12798)

Closes #13531

* github.com:scylladb/scylladb:
  Merge 'Do not mask node operation errors' from Benny Halevy
  Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec
  storage_service: Wait for normal state handler to finish in replace
  storage_service: Wait for normal state handler to finish in bootstrap
  storage_service: Send heartbeat earlier for node ops
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants