Decommission can fail due to concurrent fiber dropping RPC connections #11780

kbr- · 2022-10-13T15:28:48Z

The decommission operation sends a bunch of RPCs and opens RPC streams. These remote call may fail due to connection dropping by concurrently running fiber.

This concurrent dropping is caused by storage_service::handle_state_normal. Note that even if the operator first checks that every node is considered to be in NORMAL state from the POV of the decommissioned node before starting the operation, it does not guarantee that handle_state_normal has finished on this node - it may still be running.

handle_state_normal contains this call:

    // Send joined notification only when this node was not a member prior to this
    if (!is_member) {
        co_await notify_joined(endpoint);
    }

notify_joined does:

    co_await container().invoke_on_all([endpoint] (auto&& ss) {
        ss._messaging.local().remove_rpc_client_with_ignored_topology(netw::msg_addr{endpoint, 0});
        return ss._lifecycle_notifier.notify_joined(endpoint);
    });

and remove_rpc_client_with_ignored_topology drops all connections to this node, causing a concurrently running decommission to fail.

The text was updated successfully, but these errors were encountered:

kbr- · 2022-10-13T15:28:54Z

cc @asias @xemul

kbr- · 2022-10-13T15:29:36Z

This causes test_topology.test_decommission_node_add_column to fail.
e.g. https://jenkins.scylladb.com/job/scylla-master/job/next/5343/

kbr- · 2022-10-13T15:36:07Z

cc @gleb-cloudius

Flaky due to #11780, causes next promotion failures. We can reenable it after the issue is fixed or a workaround is found.

xemul · 2022-10-13T16:02:00Z

... and remove_rpc_client_with_ignored_topology drops all connections to this node ...

not all, but only those whose location was unknown at the time of the fist RPC verb in this verb-group was sent to

kbr- · 2022-11-08T12:29:46Z

I wonder if instead of closing the client immediately, we could put it in the background and drain all existing RPCs.

The drain could potentially be long though, e.g. RPC streams could be opened. And we'd need to ensure that the drains are interrupted when we close messaging service so we don't prevent shutdowns.

…s in get_rpc_client `get_rpc_client` calculates a `topology_ignored` field when creating a client which says whether the client's endpoint had topology information when topology was created. This is later used to check if that client needs to be dropped and replaced with a new client which uses the correct topology information. The `topology_ignored` field was incorrectly calculated as `true` for pending endpoints even though we had topology information for them. This would lead to unnecessary drops of RPC clients later. Fix this. Remove the default parameter for `with_pending` from `topology::has_endpoint` to avoid similar bugs in the future. Apparently this fixes scylladb#11780. The verbs used by decommission operation use RPC client index 1 (see `do_get_rpc_client_idx` in message/messaging_service.cc). From local testing with additional logging I found that by the time this client is created (i.e. the first verb in this group is used), we already know the topology. The node is pending at that point - hence the bug would cause us to assume we don't know the topology, leading us to dropping the RPC client later, possibly in the middle of a decommission operation. Fixes: scylladb#11780

Also improve the test to increase the probability of reproducing scylladb#11780 by injecting sleeps in appropriate places. Without the fix for scylladb#11780 from the earlier commit, the test reproduces the issue in roughly half of all runs in dev build on my laptop.

…s in get_rpc_client `get_rpc_client` calculates a `topology_ignored` field when creating a client which says whether the client's endpoint had topology information when topology was created. This is later used to check if that client needs to be dropped and replaced with a new client which uses the correct topology information. The `topology_ignored` field was incorrectly calculated as `true` for pending endpoints even though we had topology information for them. This would lead to unnecessary drops of RPC clients later. Fix this. Remove the default parameter for `with_pending` from `topology::has_endpoint` to avoid similar bugs in the future. Apparently this fixes scylladb#11780. The verbs used by decommission operation use RPC client index 1 (see `do_get_rpc_client_idx` in message/messaging_service.cc). From local testing with additional logging I found that by the time this client is created (i.e. the first verb in this group is used), we already know the topology. The node is pending at that point - hence the bug would cause us to assume we don't know the topology, leading us to dropping the RPC client later, possibly in the middle of a decommission operation. Fixes: scylladb#11780

Also improve the test to increase the probability of reproducing scylladb#11780 by injecting sleeps in appropriate places. Without the fix for scylladb#11780 from the earlier commit, the test reproduces the issue in roughly half of all runs in dev build on my laptop.

…ndpoints in get_rpc_client' from Kamil Braun `get_rpc_client` calculates a `topology_ignored` field when creating a client which says whether the client's endpoint had topology information when this client was created. This is later used to check if that client needs to be dropped and replaced with a new client which uses the correct topology information. The `topology_ignored` field was incorrectly calculated as `true` for pending endpoints even though we had topology information for them. This would lead to unnecessary drops of RPC clients later. Fix this. Remove the default parameter for `with_pending` from `topology::has_endpoint` to avoid similar bugs in the future. Apparently this fixes #11780. The verbs used by decommission operation use RPC client index 1 (see `do_get_rpc_client_idx` in message/messaging_service.cc). From local testing with additional logging I found that by the time this client is created (i.e. the first verb in this group is used), we already know the topology. The node is pending at that point - hence the bug would cause us to assume we don't know the topology, leading us to dropping the RPC client later, possibly in the middle of a decommission operation. Fixes: #11780 Closes #11942 * github.com:scylladb/scylladb: test: reenable test_topology::test_decommission_node_add_column test/pylib: util: configurable period in wait_for message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client

…for GOSSIP verbs The gossip verbs are used to learn about topology of other nodes. If inter-dc/rack encryption is enabled, the knowledge of topology is necessary to decide whether it's safe to send unencrypted messages to nodes (i.e., whether the destination lies in the same dc/rack). The logic in `messaging_service::get_rpc_client`, which decided whether a connection must be encrypted, was this (given that encryption is enabled): if the topology of the peer is known, and the peer is in the same dc/rack, don't encrypt. Otherwise encrypt. However, it may happen that node A knows node B's topology, but B doesn't know A's topology. A deduces that B is in the same DC and rack and tries sending B an unencrypted message. As the code currently stands, this would cause B to call `on_internal_error`. This is what I encountered when attempting to fix scylladb#11780. To guarantee that it's always possible to deliver gossiper verbs (even if one or both sides don't know each other's topology), and to simplify reasoning about the system in general, choose connection settings that are independent of the topology - for the connection used by gossiper verbs (other connections are still topology-dependent and use complex logic to handle the situation of unknown-and-later-known topology). This connection only contains 'rare' and 'cheap' verbs, so it's not a performance problem to always encrypt it (given that encryption is configured). And this is what already was happening in the past; it was at some point removed during topology knowledge management refactors. We just bring this logic back. Fixes scylladb#11992. Inspired by xemul/scylla#45d48f3d02fd48c6d186cc955cf83d747ac080b9.

…s in get_rpc_client `get_rpc_client` calculates a `topology_ignored` field when creating a client which says whether the client's endpoint had topology information when topology was created. This is later used to check if that client needs to be dropped and replaced with a new client which uses the correct topology information. The `topology_ignored` field was incorrectly calculated as `true` for pending endpoints even though we had topology information for them. This would lead to unnecessary drops of RPC clients later. Fix this. Remove the default parameter for `with_pending` from `topology::has_endpoint` to avoid similar bugs in the future. Apparently this fixes scylladb#11780. The verbs used by decommission operation use RPC client index 1 (see `do_get_rpc_client_idx` in message/messaging_service.cc). From local testing with additional logging I found that by the time this client is created (i.e. the first verb in this group is used), we already know the topology. The node is pending at that point - hence the bug would cause us to assume we don't know the topology, leading us to dropping the RPC client later, possibly in the middle of a decommission operation. Fixes: scylladb#11780

Also improve the test to increase the probability of reproducing scylladb#11780 by injecting sleeps in appropriate places. Without the fix for scylladb#11780 from the earlier commit, the test reproduces the issue in roughly half of all runs in dev build on my laptop.

…ndpoints in get_rpc_client' from Kamil Braun `get_rpc_client` calculates a `topology_ignored` field when creating a client which says whether the client's endpoint had topology information when this client was created. This is later used to check if that client needs to be dropped and replaced with a new client which uses the correct topology information. The `topology_ignored` field was incorrectly calculated as `true` for pending endpoints even though we had topology information for them. This would lead to unnecessary drops of RPC clients later. Fix this. Remove the default parameter for `with_pending` from `topology::has_endpoint` to avoid similar bugs in the future. Apparently this fixes scylladb#11780. The verbs used by decommission operation use RPC client index 1 (see `do_get_rpc_client_idx` in message/messaging_service.cc). From local testing with additional logging I found that by the time this client is created (i.e. the first verb in this group is used), we already know the topology. The node is pending at that point - hence the bug would cause us to assume we don't know the topology, leading us to dropping the RPC client later, possibly in the middle of a decommission operation. Fixes: scylladb#11780 Closes scylladb#11942 * github.com:scylladb/scylladb: message: messaging_service: check for known topology before calling is_same_dc/rack test: reenable test_topology::test_decommission_node_add_column test/pylib: util: configurable period in wait_for message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client message: messaging_service: topology independent connection settings for GOSSIP verbs

…ndpoints in get_rpc_client' from Kamil Braun `get_rpc_client` calculates a `topology_ignored` field when creating a client which says whether the client's endpoint had topology information when this client was created. This is later used to check if that client needs to be dropped and replaced with a new client which uses the correct topology information. The `topology_ignored` field was incorrectly calculated as `true` for pending endpoints even though we had topology information for them. This would lead to unnecessary drops of RPC clients later. Fix this. Remove the default parameter for `with_pending` from `topology::has_endpoint` to avoid similar bugs in the future. Apparently this fixes #11780. The verbs used by decommission operation use RPC client index 1 (see `do_get_rpc_client_idx` in message/messaging_service.cc). From local testing with additional logging I found that by the time this client is created (i.e. the first verb in this group is used), we already know the topology. The node is pending at that point - hence the bug would cause us to assume we don't know the topology, leading us to dropping the RPC client later, possibly in the middle of a decommission operation. Fixes: #11780 Closes #11942 * github.com:scylladb/scylladb: message: messaging_service: check for known topology before calling is_same_dc/rack test: reenable test_topology::test_decommission_node_add_column test/pylib: util: configurable period in wait_for message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client message: messaging_service: topology independent connection settings for GOSSIP verbs

avikivity · 2022-11-20T11:44:32Z

I'd like to consider for backport. Is it a regression? How likely it is to happen in real life?

kbr- · 2022-11-21T09:27:49Z

The RPC clients dropping code was introduced in 7bdad47, which didn't make it into 5.1.

Very unlikely to happen in real life, to reproduce it you have to issue decommission immediately after bootstrapping a node, and even then there's a chance you won't reproduce.

kbr- mentioned this issue Oct 13, 2022

raft: test_topology.test_remove_node_add_column is flaky #11721

Closed

kbr- pushed a commit that referenced this issue Oct 13, 2022

test/topology: disable flaky test_decommission_add_column

9548497

Flaky due to #11780, causes next promotion failures. We can reenable it after the issue is fixed or a workaround is found.

kbr- mentioned this issue Nov 10, 2022

message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client #11942

Merged

kbr- mentioned this issue Nov 16, 2022

messaging_service: gossiper verbs should be encrypted regardless of topology knowledge #11992

Closed

kbr- mentioned this issue Nov 16, 2022

message: messaging_service: topology independent connection settings for GOSSIP verbs #11998

Merged

kbr- mentioned this issue Nov 17, 2022

There is no way to wait until the UP/NORMAL states of other nodes are processed #12015

Closed

scylladb-promoter closed this as completed in 840be34 Nov 17, 2022

scylladb-promoter added the Backport candidate label Nov 17, 2022

xemul mentioned this issue Nov 24, 2022

Revert "Maintain dc/rack by topology" ++ #12065

Closed

mykaul removed the Backport candidate label Mar 21, 2023

DoronArazii added this to the 5.2 milestone May 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decommission can fail due to concurrent fiber dropping RPC connections #11780

Decommission can fail due to concurrent fiber dropping RPC connections #11780

kbr- commented Oct 13, 2022

kbr- commented Oct 13, 2022

kbr- commented Oct 13, 2022 •

edited

kbr- commented Oct 13, 2022

xemul commented Oct 13, 2022

kbr- commented Nov 8, 2022

avikivity commented Nov 20, 2022

kbr- commented Nov 21, 2022

Decommission can fail due to concurrent fiber dropping RPC connections #11780

Decommission can fail due to concurrent fiber dropping RPC connections #11780

Comments

kbr- commented Oct 13, 2022

kbr- commented Oct 13, 2022

kbr- commented Oct 13, 2022 • edited

kbr- commented Oct 13, 2022

xemul commented Oct 13, 2022

kbr- commented Nov 8, 2022

avikivity commented Nov 20, 2022

kbr- commented Nov 21, 2022

kbr- commented Oct 13, 2022 •

edited