-
Notifications
You must be signed in to change notification settings - Fork 1.4k
raft topology: ban left nodes from the cluster #13850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In the meantime I'll work on some test infrastructure to automatically test this. |
message/messaging_service.cc
Outdated
ms._host_connections.erase(start, end); | ||
|
||
co_await parallel_for_each(conns, [] (shared_ptr<rpc::server_connection>& conn) { | ||
return conn->stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also drop outgoing connections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but if some service is sending out RPCs to that node, get_rpc_client
would just reestablish the connection.
I guess we could also ban outgoing connections like we do with incoming (and checking whether a host is banned in get_rpc_client
or something).
Alternatively - and this is the approach I took basically - assume that once a node has left, we will not attempt to communicate with it (fingers crossed). This depends on nice behavior of our services. So we don't have to do the firewall in this direction, because we're not sending anything to that node anymore anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but if some service is sending out RPCs to that node,
get_rpc_client
would just reestablish the connection.I guess we could also ban outgoing connections like we do with incoming (and checking whether a host is banned in
get_rpc_client
or something).
Of course.
Alternatively - and this is the approach I took basically - assume that once a node has left, we will not attempt to communicate with it (fingers crossed). This depends on nice behavior of our services. So we don't have to do the firewall in this direction, because we're not sending anything to that node anymore anyway.
I want to reuse this for isolate_node
as well.
CI state |
General question -- is seastar part really needed? There's |
IIUC it only works for clients, i.e. outgoing connections/messages. The seastar part is needed for blocking incoming messages. also, dropping existing connections is not enough, we also need a way to prevent further communication. |
I am not an expert in seastar rpc, but provided we go with the approach suggested in the RFC I would consider the following:
|
Without the exception it would be awkward if a two-way RPC handler wants to drop a connection. The handler would stop the connection but then it would still have to construct some kind of response or throw another type of exception. For this use case though we only close connections from one-way handlers so maybe the exception is not necessary.
That's a question for @gleb-cloudius |
Shutdown read wake ups a reader fiber which start shutdown process, notifies sender fiber which shutdowns write side at the end and closes the descriptor. |
v2: seastar:
scylla:
I didn't implement dropping/banning of outgoing connections, because it's blocked by #6403. When we send a message, in general, we don't have access to the Still, I think that banning incoming connections provides 99%-100% of the value that this PR is supposed to provide: it effectively isolates removed nodes, as the added test illustrates. I propose that in the scope of this PR, we solve banning incoming connections, and leave banning outgoing connections to a follow-up, after we have #6403. Unfortunately, there's a bigger problem. Well, it does to some extent, the node is removed from token ring / topology... but it cannot remove itself from group 0, because by that time, it is banned from the cluster :D cc @kostja @gleb-cloudius We should discuss how to solve this. This IMO supports the theory that the topology coordinator should be responsible for group 0 reconfigurations. |
But the coordinator already does that in case a node itself fails. So may be we only need to silence the error? |
CI state |
The "error" is that the decommissioning node hangs on |
Then we need to do it with a timeout. |
What's the point of having a call which will timeout 90% of the time? We expect it to fail, so why do it in the first place? |
True. We can not do it at all. |
I don't like the resulting UX.
So IMO the decommissioning node should get a confirmation from the topology coordinator that the node got removed entirely. But when we ban left nodes, we make it impossible for the node to get this confirmation. Once it's removed, we're no longer communicating with it, but we still want to tell it that it's removed. |
When The only other way to "fix" ux is to issue |
The guarantee is there if it completes successfully.
Or ban the node only after it gets the confirmation. |
@avikivity could you merge please? |
@avikivity ping |
Please rebase, non-trivial conflicts. |
Hm, probably with fencing PR which was queued 20 minutes ago. |
I'll rebase once #14285 is merged, which will simplify the "messaging_service: store the node's host ID" commit (we'll be able to make the host_id field non-optional). |
You're risking a maintainer missing the pings again! But I appreciate doing the right thing. |
When a node first establishes a connection to another node, it always sending a `CLIENT_ID` one-way RPC first. The message contains some metadata such as `broadcast_address`. Include the `host_id` of the sender in that RPC. On the receiving side, store a mapping from that `host_id` to the connection that was just opened. This mapping will be used later when we ban nodes that we remove from the cluster.
Calling `ban_host` causes the following: - all connections from that host are dropped, - any further attempts to connect will be rejected (the connection will be immediately dropped) when receiving the `CLIENT_ID` verb.
Saves some redundant typing when passing `raft_topology_cmd` parameters, so we can change this: ``` raft_topology_cmd{raft_topology_cmd::command::fence_old_reads} ``` into this: ``` raft_topology_cmd::command::fence_old_reads ```
We want for the decommissioning node to wait before shutting down until every node learns that it left the token ring. Otherwise some nodes may still try coordinating writes to that nodes after it already shut down, leading to unnecessary failures on the data path(e.g. for CL=ALL writes). Before this change, a node would shut down immediately after observing that it was in `left` state; some other nodes may still see it in `decommissioning` state and the topology transition state as `write_both_read_new`, so they'd try to write to that node. After this change, the node first enters the `left_token_ring` state before entering `left`, while the topology transition state is removed (so we've finished the token ring change - the node no longer has tokens in the ring, but it's still part of the topology). There we perform a read barrier, allowing all nodes to observe that the decommissioning node has indeed left the token ring. Only after that barrier succeeds we allow the node to shut down.
Currently the decommissioned node waits until it observes that it was moved to the `left` state, then proceeds to leave group 0 and shut down. Unfortunately, this strategy won't work once we introduce banning nodes that are in `left` state - there is no guarantee that the decommissioning node will observe that it entered `left` state. The replication of Raft commands races with the ban propagating through the cluster. We also can't make the node leave as soon as it observes the `left_token_ring` state, which would defeat the purpose of `left_token_ring` - allowing all nodes to observe that the node has left the token ring before it shuts down. We could introduce yet another state between `left_token_ring` and `left`, which the node waits for before shutting down; the coordinator would request a barrier from the node before moving to `left` state. The alternative - which we chose here - is to have the coordinator explicitly tell the node to shutdown while we're in `left_token_ring` through a direct RPC. We introduce `raft_topology_cmd::command::shutdown` and send it to the node while in `left_token_ring` state, after we requested a cluster barrier. We don't require the RPC to succeed; we need to allow it to fail to preserve availability. This is because an earlier incarnation of the coordinator may have requested the node to shut down already, so the new coordinator will fail the RPC as the node is already dead. This also improves availability in general - if the node dies while we're in `left_token_ring`, we can proceed. We don't lose safety from that, since we'll ban the node (later commit). We only lose a bit of user experience if there's a failure at this decommission step - the decommissioning node may hang, never receiving the RPC (it will be necessary to shut it down manually). Another complication arising from banning the node is that it won't be able to leave group 0 on its own; by the time it tries that, it may have already been banned by the cluster (the coordinator moves the node to `left` state after telling it to shut down). So we get rid of the `leave_group0` step from `raft_decommission()` (which simplifies the function too), putting a `remove_from_raft_config` inside the coordinator code instead - after we told the node to shut down. (Removing the node from configuration is also another reason why we need to allow the above RPC to fail; the node won't be able to handle the request once it's outside the configuration, because it handles all coordinator requests by starting a read barrier.) Finally, a complication arises when the coordinator is the decommissioning node. The node would shut down in the middle of handling the `left_token_ring` state, leading to harmless but awkward errors even though there were no node/network failures (the original coordinator would fail the `left_token_ring` state logic; a new coordinator would take over and do it again, this time succeeding). We fix that by checking if we're the decommissioning node at the beginning of `left_token_ring` state handler, and if so, stepping down from leadership by becoming a nonvoter first.
The "tell the node to shut down" RPC would fail every time in the removenode path (since the node is dead), which is kind of awkward. Besides, for removenode we don't really need the `left_token_ring` state, we don't need to coordinate with the node - writes destined for it are failing anyway (since it's dead) and we can ban the node immediately. Remove the node from group 0 while in `write_both_read_new` transition state (even when we implement abort, in this state it's too late to abort, we're committed to removing the node - so it's fine to remove it from group 0 at this point).
Pause one of the nodes and once it's marked as DOWN, remove it from the cluster. Check that it is not able to perform queries once it unpauses.
`server_sees_others` and similar functions periodically call `get_alive_endpoints`. The period was `.1` seconds, increase it to `.5` to reduce the log spam (I checked empirically that `.5` is usually how long it takes in dev mode on my laptop.)
v6:
Range-diff:
|
CI state |
https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/2354/
rekicking |
CI state |
@avikivity ready for merging again :) |
@avikivity ping |
Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected.
This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node.
Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce
left_token_ring
state" and "prepare decommission path for node banning" commits for details.