-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster features on raft: new procedure for joining group 0 #15196
Conversation
Some nice-to-haves that I didn't implement yet, but perhaps could be done in followups:
|
docs/dev/topology-over-raft.md
Outdated
or rejected. | ||
|
||
In order to correlate requests with replies, the joining node uses a UUID called | ||
"join cookie". Until the joining node acknowledges a reply from the coordinator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not look at the code how this "join cookie" is used, but can host_id can be used instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that host_id is generated and persisted early in the procedure. If the procedure crashes after persisting host_id, the next attempt might use some different parameters (which ones though?) but the same host_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "host_id is generated and persisted early in the procedure" then after the crash the same host_id will be used. Theoretically one may change some parameters (cluster name? Snitch?) and and restart, but if we persist those before start joining we can detect the change and fail early. If features are changed we can detect it in join_request rpc since whatever was added into the topology will not match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to deal with lack of good names for the verb is to rename group0_modify_config to group0_modify_config_legacy and keep using the name group0_modify_config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to deal with lack of good names for the verb is to rename group0_modify_config to group0_modify_config_legacy and keep using the name group0_modify_config
I suppose that it's not a problem that the join_node
name is less clear than group0_modify_config
; the bigger problem is that we have two RPCs now which are both used in the handshake named join_node_request
and join_node_response
, and both of them are two-way. It's confusing to talk about "response to join_node_request" etc.
docs/dev/topology-over-raft.md
Outdated
from the `system.topology` table. | ||
|
||
In case of other failures (timeouts, connection issues etc.) the request is | ||
simply retried. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a node that requests joining is permanently dead we need to stop retrying eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This should be implemented as a part of the error handling work.
@gusev-p please review as well. Understanding this new procedure will probably be important to implement upgrades. I'm guessing that to correctly handle potential node bootstraps in a mixed cluster (where some nodes have upgraded to use raft-topology but others did not), we'll want to distinguish whether the joining node uses the new RPC or the old GROUP0_MODIFY_CONFIG. And probably in a fully upgraded cluster we'll reject attempts that use the old way. |
CI state |
docs/dev/topology-over-raft.md
Outdated
eventually removes it. | ||
- If there is no topology request for that node, it is placed in | ||
the `system.topology` table and an `ok` response is sent back. | ||
- If the node has already joined, an `ok` response is sent back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering about the scenario if it's in the middle of joining (post-none
state, but before becoming normal).
The node may try to start the procedure from scratch, while the coordinator is somewhere in the middle of it.
But perhaps handling that is not the responsibility of JOIN_NODE_REQUEST
but a later step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I meant "topology coordinator accepted it". I hope it's more clear in the current version of the doc.
docs/dev/topology-over-raft.md
Outdated
|
||
Upon noticing a node in `none` state and with `join` topology request, the | ||
coordinator verifies parameters of the node (e.g. whether it supports all | ||
of the enabled features) and decides whether to accept it or not. It sends |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do both the handler of JOIN_NODE_REQUEST, and now the topology coordinator, have to verify parameters? Does the coordinator have some additional knowledge? If yes, is the initial verification at all necessary, if there's a second verification done by the coordinator anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained the need for initial verification here: #15196 (comment)
As for the topology coordinator, we use it to do the "ultimate" verification and marking the node as accepted (i.e. continuing with bootstrapping/replacing the node) because we need to serialize it with marking a feature as enabled, which is currently also done by the topology coordinator.
service/storage_service.cc
Outdated
slogger.info("raft topology: join: returning ok for idempotency ({})", params.host_id); | ||
co_return result; | ||
} else { | ||
assert(rs.state == node_state::none); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use on_internal_error
instead
service/storage_service.cc
Outdated
|
||
// If the validation failed, we will remove the request at the end. | ||
bool should_drop = std::holds_alternative<join_node_response_params::rejected>(decision); | ||
auto ip = _address_map.find(node.id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a FIXME, this should eventually be done by messaging_service, not by us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to do in v2, but will do on the next occasion (after the next round of comments).
Following our discussion on the last meeting, I'll try to simplify the implementation not to use join cookies. For now, I have responded to the comments that I was sure will be relevant after the changes. |
38834c0
to
6d4f1d4
Compare
v2:
|
CI state |
@@ -6272,12 +6562,25 @@ void storage_service::init_messaging_service(sharded<db::system_distributed_keys | |||
return ss.cleanup_tablet(tablet); | |||
}); | |||
}); | |||
if (raft_topology_change_enabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up we can move existing topology RPCs into the if
.
…rom Piotr Dulikowski This PR implements a new procedure for joining nodes to group 0, based on the description in the "Cluster features on Raft (v2)" document. This is a continuation of the previous PRs related to cluster features on raft (#14722, #14232), and the last piece necessary to replace cluster feature checks in gossip. Current implementation relies on gossip shadow round to fetch the set of enabled features, determine whether the node supports all of the enabled features, and joins only if it is safe. As we are moving management of cluster features to group 0, we encounter a problem: the contents of group 0 itself may depend on features, hence it is not safe to join it unless we perform the feature check which depends on information in group 0. Hence, we have a dependency cycle. In order to solve this problem, the algorithm for joining group 0 is modified, and verification of features and other parameters is offloaded to an existing node in group 0. Instead of directly asking the discovery leader to unconditionally add the node to the configuration with `GROUP0_MODIFY_CONFIG`, two different RPCs are added: `JOIN_NODE_REQUEST` and `JOIN_NODE_RESPONSE`. The main idea is as follows: - The new node sends `JOIN_NODE_REQUEST` to the discovery leader. It sends a bunch of information describing the node, including supported cluster features. The discovery leader verifies some of the parameters and adds the node in the `none` state to `system.topology`. - The topology coordinator picks up the request for the node to be joined (i.e. the node in `none` state), verifies its properties - including cluster features - and then informs the new node about the decision via `JOIN_NODE_RESPONSE`: whether the new node will be joined or not. - If the new node is supposed to be joined, the topology coordinator will add it to group 0 and the new node will start its group 0 server. - Otherwise, the request is removed from `system.topology` and the new node is shut down. The procedure is designed to be idempotent. A node can safely retry joining the cluster even if it shut down while in the middle of the previous attempt and restarted with a different configuration - a random UUID called "join cookie" is used to differentiate the attempts. More details about the RPC are described in `topology-over-raft.md`. Fixes: #15152 Closes #15196 * github.com:scylladb/scylladb: tests: mark test_blocked_bootstrap as skipped storage_service: do not check features in shadow round storage_service: remove raft_{boostrap,replace} topology_coordinator: relax the check in enable_features raft_group0: insert replaced node info before server setup storage_service: use join node rpc to join the cluster topology_coordinator: handle joining nodes topology_state_machine: add join_group0 state storage_service: add join node RPC handlers raft: expose current_leader in raft::server storage_service: extract wait_for_live_nodes_timeout constant raft_group0: abstract out node joining handshake storage_service: pass raft_topology_change_enabled on rpc init rpc: add new join handshake verbs docs: document the new join procedure topology_state_machine: add supported_features to replica_state storage_service: check destination host ID in raft verbs group_state_machine: take reference to raft address map raft_group0: expose joined_group0
Currently, when the topology coordinator notices a request to join or replace a node, the node is transitioned to an appropriate state and the topology is moved to commit_new_generation/write_both_read_old, in a single group 0 operation. In later commits, the topology coordinator will accept/reject nodes based on the request, so we would like to have a separate step - topology coordinator accepts, transitions to bootstrap state, tells the node that it is accepted, and only then continues with the topology transition. This commits adds a new `join_group0` transition state that precedes `commit_cdc_generation`.
The topology coordinator is updated to perform verification of joining nodes and to send `JOIN_NODE_RESPONSE` RPC back to the joining node.
Now, the storage_service uses new RPCs to join the cluster. A new handshaker is implemented and passed to group0 in order to make it happen.
Currently, information about replaced node is put into the raft address map after joining group 0 via `join_group0`. However, the new handshake which happens when joining group 0 needs to read the group 0 state (so that it can wait until it sees all normal nodes as UP). Loading the topology state to memory involves resolving IP addresses of the normal nodes, so the information about replaced node needs to be inserted before the handshake happens. This commit moves insertion of the replace node's data before the call to `join_group0`.
Currently, `enable_features` requires that there are no topology in progress and there are no nodes waiting to be joined. Now, after the new handshake is implemented, we can drop the second condition because nodes in `none` state are not a part of group 0 yet. Additionally, the comments inside `enable_features` are clarified so that they explain why it's safe to only include normal features when doing the barrier and calculating features to enable.
The functionality of `raft_bootstrap` and `raft_replace` is handled by the new handshake, so those functions can be removed.
The new joining procedure safely checks compatibility of supported/enabled features, therefore there is no longer any need to do it in the gossip shadow round.
With the new procedure to join nodes, testing the scenario in `test_blocked_bootstrap` becomes very tricky. To recap, the test does the following: - Starts a 3-node cluster, - Shuts down node 1, - Tries to replace node 1 with node 4, but an error injection is triggered which causes node 4 to fail after it joins group 0. Note that pre-JOIN_NODE handshake, this would only result in node 4 being added to group 0 config, but no modification to the group 0 state itself is being done - the joining node is supposed to write a request to join. - Tries to replace node 1 again with node 5, which should succeed. The bug that this regression test was supposed to check for was that node 5 would try to resolve all IPs of nodes added to group 0 config. Because node 4 shuts down before advertising itself in gossip, the node 5 would get stuck. The new procedure to join group 0 complicates the situation because a request to join is written first to group 0 and only then the topology coordinator modifies the group 0 config. It is possible to add an error injection to the topology coordinator code so that it doesn't change the group 0 state and proceeds with bootstrapping the node, but it will only get stuck trying to add the node. If node 5 tries to join in the meantime, the topology coordinator may switch to it and try to bootstrap it instead, but this is basically a 50% chance because it depends on the order of node 4 and node 5's host IDs in the topology_state_machine struct. It should be possible to fix the test with error recovery, but until then it is marked as skipped.
50eb05d
to
2c17f81
Compare
v5.4:
|
CI state |
Apparently, we exceeded some rate limit in GitHub and that caused the CI to fail. Rerunning. |
CI state |
This PR implements a new procedure for joining nodes to group 0, based on the description in the "Cluster features on Raft (v2)" document. This is a continuation of the previous PRs related to cluster features on raft (#14722, #14232), and the last piece necessary to replace cluster feature checks in gossip.
Current implementation relies on gossip shadow round to fetch the set of enabled features, determine whether the node supports all of the enabled features, and joins only if it is safe. As we are moving management of cluster features to group 0, we encounter a problem: the contents of group 0 itself may depend on features, hence it is not safe to join it unless we perform the feature check which depends on information in group 0. Hence, we have a dependency cycle.
In order to solve this problem, the algorithm for joining group 0 is modified, and verification of features and other parameters is offloaded to an existing node in group 0. Instead of directly asking the discovery leader to unconditionally add the node to the configuration with
GROUP0_MODIFY_CONFIG
, two different RPCs are added:JOIN_NODE_REQUEST
andJOIN_NODE_RESPONSE
. The main idea is as follows:JOIN_NODE_REQUEST
to the discovery leader. It sends a bunch of information describing the node, including supported cluster features. The discovery leader verifies some of the parameters and adds the node in thenone
state tosystem.topology
.none
state), verifies its properties - including cluster features - and then:boostrap
/replace
state and transitions the topology tojoin_group0
state. The node is added to group 0 and thenJOIN_NODE_RESPONSE
is sent to it with information that the node was accepted.left
state, told by the coordinator viaJOIN_NODE_RESPONSE
that it was rejected and it shuts down.The procedure is not retryable - if a node fails to do it from start to end and crashes in between, it will not be allowed to retry it with the same host_id -
JOIN_NODE_REQUEST
will fail. The data directory must be cleared before attempting to add it again (so that a new host_id is generated).More details about the procedure and the RPC are described in
topology-over-raft.md
.Fixes: #15152