migration_manager::sync_schema failed with seastar::rpc::closed_error causing repair to fail #12956

asias · 2023-02-22T05:14:15Z

The issue is split from: #12764 (comment)

asias · 2023-02-22T05:32:48Z

Tested with recent master: d7b6cf0

…otstrap In storage_service::handle_state_normal, storage_service::notify_joined will be called which drops the rpc connections to the node becomes normal. This causes rpc calls with that node fail with seastar::rpc::closed_error error. Consider this: - n1 in the cluster - n2 is added to join the cluster - n2 sees n1 is in normal status - n2 starts bootstrap process - notify_joined on n2 closes rpc connection to n1 in the middle of bootstrap - n2 fails to bootstrap For example, during bootstrap with RBNO, we saw repair failed in a test that sets ring_delay to zero and does not wait for gossip to settle. repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for keyspace=system_distributed_everywhere, status=failed: std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is closed)}) This patch fixes the race by waiting for the handle_state_normal handler to finish before the bootstrap process. Fixes scylladb#12764 Fixes scylladb#12956

In storage_service::handle_state_normal, storage_service::notify_joined will be called which drops the rpc connections to the node becomes normal. This causes rpc calls with that node fail with seastar::rpc::closed_error error. Consider this: - n1 in the cluster - n2 is added to join the cluster - n2 sees n1 is in normal status - n2 starts bootstrap process - notify_joined on n2 closes rpc connection to n1 in the middle of bootstrap - n2 fails to bootstrap For example, during bootstrap with RBNO, we saw repair failed in a test that sets ring_delay to zero and does not wait for gossip to settle. repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for keyspace=system_distributed_everywhere, status=failed: std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is closed)}) This patch fixes the race by waiting for the handle_state_normal handler to finish before the bootstrap process. Fixes scylladb#12764 Fixes scylladb#12956

In storage_service::handle_state_normal, storage_service::notify_joined will be called which drops the rpc connections to the node becomes normal. This causes rpc calls with that node fail with seastar::rpc::closed_error error. Consider this: - n1 in the cluster - n2 is added to join the cluster - n2 sees n1 is in normal status - n2 starts bootstrap process - notify_joined on n2 closes rpc connection to n1 in the middle of bootstrap - n2 fails to bootstrap For example, during bootstrap with RBNO, we saw repair failed in a test that sets ring_delay to zero and does not wait for gossip to settle. repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for keyspace=system_distributed_everywhere, status=failed: std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is closed)}) This patch fixes the race by waiting for the handle_state_normal handler to finish before the bootstrap process. Fixes scylladb#12764 Fixes scylladb#12956 (cherry picked from commit 5363616)

This branch backports to branch-5.2 several fixes related to node operations: - ba919aa (PR #12980; Fixes: #11011, #12969) - 5363616 (part of PR #12970; Fixes: #12764, #12956) - 5856e69 (part of PR #12970) - 2b44631 (PR #13028; Fixes: #12989) - 6373452 (PR #12799; Fixes #12798) Closes #13531 * github.com:scylladb/scylladb: Merge 'Do not mask node operation errors' from Benny Halevy Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec storage_service: Wait for normal state handler to finish in replace storage_service: Wait for normal state handler to finish in bootstrap storage_service: Send heartbeat earlier for node ops

scylladb-promoter closed this as completed in 5363616 Mar 1, 2023

scylladb-promoter added the Backport candidate label Mar 1, 2023

DoronArazii added this to the 5.3 milestone Mar 15, 2023

bhalevy mentioned this issue Apr 16, 2023

[Backport 5.2]: node ops backports #13531

Merged

kbr-scylla mentioned this issue Jun 21, 2023

service: raft: force initial snapshot transfer in new cluster #14336

Closed

denesb removed the Backport candidate label Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migration_manager::sync_schema failed with seastar::rpc::closed_error causing repair to fail #12956

migration_manager::sync_schema failed with seastar::rpc::closed_error causing repair to fail #12956

asias commented Feb 22, 2023

asias commented Feb 22, 2023

migration_manager::sync_schema failed with seastar::rpc::closed_error causing repair to fail #12956

migration_manager::sync_schema failed with seastar::rpc::closed_error causing repair to fail #12956

Comments

asias commented Feb 22, 2023

asias commented Feb 22, 2023