Cluster features on raft: new procedure for joining group 0 #15196

piodul · 2023-08-28T10:30:10Z

This PR implements a new procedure for joining nodes to group 0, based on the description in the "Cluster features on Raft (v2)" document. This is a continuation of the previous PRs related to cluster features on raft (#14722, #14232), and the last piece necessary to replace cluster feature checks in gossip.

Current implementation relies on gossip shadow round to fetch the set of enabled features, determine whether the node supports all of the enabled features, and joins only if it is safe. As we are moving management of cluster features to group 0, we encounter a problem: the contents of group 0 itself may depend on features, hence it is not safe to join it unless we perform the feature check which depends on information in group 0. Hence, we have a dependency cycle.

In order to solve this problem, the algorithm for joining group 0 is modified, and verification of features and other parameters is offloaded to an existing node in group 0. Instead of directly asking the discovery leader to unconditionally add the node to the configuration with GROUP0_MODIFY_CONFIG, two different RPCs are added: JOIN_NODE_REQUEST and JOIN_NODE_RESPONSE. The main idea is as follows:

The new node sends JOIN_NODE_REQUEST to the discovery leader. It sends a bunch of information describing the node, including supported cluster features. The discovery leader verifies some of the parameters and adds the node in the none state to system.topology.
The topology coordinator picks up the request for the node to be joined (i.e. the node in none state), verifies its properties - including cluster features - and then:
- If the node is accepted, the coordinator transitions it to boostrap/replace state and transitions the topology to join_group0 state. The node is added to group 0 and then JOIN_NODE_RESPONSE is sent to it with information that the node was accepted.
- Otherwise, the node is moved to left state, told by the coordinator via JOIN_NODE_RESPONSE that it was rejected and it shuts down.

The procedure is not retryable - if a node fails to do it from start to end and crashes in between, it will not be allowed to retry it with the same host_id - JOIN_NODE_REQUEST will fail. The data directory must be cleared before attempting to add it again (so that a new host_id is generated).

More details about the procedure and the RPC are described in topology-over-raft.md.

Fixes: #15152

piodul · 2023-08-28T10:41:12Z

Some nice-to-haves that I didn't implement yet, but perhaps could be done in followups:

Error injection tests to verify idempotence properties of the procedure
~~Maybe add a new key in system.scylla_local to store the enabled features that were sent by the topology coordinator on boot instead of reusing the existing and legacy enabled_features key~~ (not relevant to the implementation anymore)

gleb-cloudius · 2023-08-28T12:03:39Z

docs/dev/topology-over-raft.md

+  or rejected.
+
+In order to correlate requests with replies, the joining node uses a UUID called
+"join cookie". Until the joining node acknowledges a reply from the coordinator,


I did not look at the code how this "join cookie" is used, but can host_id can be used instead?

I'm guessing that host_id is generated and persisted early in the procedure. If the procedure crashes after persisting host_id, the next attempt might use some different parameters (which ones though?) but the same host_id.

If "host_id is generated and persisted early in the procedure" then after the crash the same host_id will be used. Theoretically one may change some parameters (cluster name? Snitch?) and and restart, but if we persist those before start joining we can detect the change and fail early. If features are changed we can detect it in join_request rpc since whatever was added into the topology will not match.

One way to deal with lack of good names for the verb is to rename group0_modify_config to group0_modify_config_legacy and keep using the name group0_modify_config

One way to deal with lack of good names for the verb is to rename group0_modify_config to group0_modify_config_legacy and keep using the name group0_modify_config

I suppose that it's not a problem that the join_node name is less clear than group0_modify_config; the bigger problem is that we have two RPCs now which are both used in the handshake named join_node_request and join_node_response, and both of them are two-way. It's confusing to talk about "response to join_node_request" etc.

gleb-cloudius · 2023-08-28T12:08:14Z

docs/dev/topology-over-raft.md

+    from the `system.topology` table.
+
+In case of other failures (timeouts, connection issues etc.) the request is
+simply retried.


If a node that requests joining is permanently dead we need to stop retrying eventually.

Yes. This should be implemented as a part of the error handling work.

service/join_node.hh

service/raft/group0_state_machine.cc

service/raft/raft_group0.cc

db/system_keyspace.cc

service/storage_service.cc

kbr-scylla · 2023-08-28T13:23:33Z

@gusev-p please review as well. Understanding this new procedure will probably be important to implement upgrades.

I'm guessing that to correctly handle potential node bootstraps in a mixed cluster (where some nodes have upgraded to use raft-topology but others did not), we'll want to distinguish whether the joining node uses the new RPC or the old GROUP0_MODIFY_CONFIG. And probably in a fully upgraded cluster we'll reject attempts that use the old way.

service/storage_service.cc

service/topology_state_machine.hh

scylladb-promoter · 2023-08-28T18:56:57Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3300/

kbr-scylla · 2023-08-28T15:06:08Z

docs/dev/topology-over-raft.md

+  eventually removes it.
+- If there is no topology request for that node, it is placed in
+  the `system.topology` table and an `ok` response is sent back.
+- If the node has already joined, an `ok` response is sent back.


I'm wondering about the scenario if it's in the middle of joining (post-none state, but before becoming normal).

The node may try to start the procedure from scratch, while the coordinator is somewhere in the middle of it.

But perhaps handling that is not the responsibility of JOIN_NODE_REQUEST but a later step.

Here, I meant "topology coordinator accepted it". I hope it's more clear in the current version of the doc.

docs/dev/topology-over-raft.md

kbr-scylla · 2023-08-29T13:43:39Z

docs/dev/topology-over-raft.md

+
+Upon noticing a node in `none` state and with `join` topology request, the
+coordinator verifies parameters of the node (e.g. whether it supports all
+of the enabled features) and decides whether to accept it or not. It sends


Why do both the handler of JOIN_NODE_REQUEST, and now the topology coordinator, have to verify parameters? Does the coordinator have some additional knowledge? If yes, is the initial verification at all necessary, if there's a second verification done by the coordinator anyway?

I explained the need for initial verification here: #15196 (comment)

As for the topology coordinator, we use it to do the "ultimate" verification and marking the node as accepted (i.e. continuing with bootstrapping/replacing the node) because we need to serialize it with marking a feature as enabled, which is currently also done by the topology coordinator.

docs/dev/topology-over-raft.md

kbr-scylla · 2023-08-29T15:05:07Z

service/storage_service.cc

+                slogger.info("raft topology: join: returning ok for idempotency ({})", params.host_id);
+                co_return result;
+            } else {
+                assert(rs.state == node_state::none);


Please use on_internal_error instead

service/storage_service.cc

kbr-scylla · 2023-08-29T15:22:46Z

service/storage_service.cc

+
+                // If the validation failed, we will remove the request at the end.
+                bool should_drop = std::holds_alternative<join_node_response_params::rejected>(decision);
+                auto ip = _address_map.find(node.id);


Let's add a FIXME, this should eventually be done by messaging_service, not by us.

Forgot to do in v2, but will do on the next occasion (after the next round of comments).

service/storage_service.cc

piodul · 2023-08-30T08:02:10Z

Following our discussion on the last meeting, I'll try to simplify the implementation not to use join cookies. For now, I have responded to the comments that I was sure will be relevant after the changes.

piodul · 2023-09-11T09:43:30Z

v2:

Fixed the bugs that caused the tests to fail,
Got rid of the join cookie. Now, the node needs to complete the handshake with the same snitch name, cluster name and features, or clear its data directory to generate a new host_id. This makes the procedure a bit simpler.
Supported features that the node originally started with are now stored in system.scylla_local under the initial_supported_features, used only for the duration of the handshake,
Snitch name is now persistently stored in system.scylla_local and verified on startup,
Storage service verbs related to raft now verify destination host_id (without it, node replace tests would fail),
During the handshake, joining node now waits for normal nodes to become UP (nodes to wait for are calculated based on group 0 state),
Per-node supported features are now kept in the replica_state structure, allowing for more convenient access.

scylladb-promoter · 2023-09-11T12:13:07Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3573/

idl/storage_service.idl.hh

idl/join_node.idl.hh

service/raft/join_node.hh

service/storage_service.cc

gleb-cloudius · 2023-09-27T07:15:53Z

service/storage_service.cc

@@ -6272,12 +6562,25 @@ void storage_service::init_messaging_service(sharded<db::system_distributed_keys
            return ss.cleanup_tablet(tablet);
        });
    });
+    if (raft_topology_change_enabled) {


As a follow up we can move existing topology RPCs into the if.

…rom Piotr Dulikowski This PR implements a new procedure for joining nodes to group 0, based on the description in the "Cluster features on Raft (v2)" document. This is a continuation of the previous PRs related to cluster features on raft (#14722, #14232), and the last piece necessary to replace cluster feature checks in gossip. Current implementation relies on gossip shadow round to fetch the set of enabled features, determine whether the node supports all of the enabled features, and joins only if it is safe. As we are moving management of cluster features to group 0, we encounter a problem: the contents of group 0 itself may depend on features, hence it is not safe to join it unless we perform the feature check which depends on information in group 0. Hence, we have a dependency cycle. In order to solve this problem, the algorithm for joining group 0 is modified, and verification of features and other parameters is offloaded to an existing node in group 0. Instead of directly asking the discovery leader to unconditionally add the node to the configuration with `GROUP0_MODIFY_CONFIG`, two different RPCs are added: `JOIN_NODE_REQUEST` and `JOIN_NODE_RESPONSE`. The main idea is as follows: - The new node sends `JOIN_NODE_REQUEST` to the discovery leader. It sends a bunch of information describing the node, including supported cluster features. The discovery leader verifies some of the parameters and adds the node in the `none` state to `system.topology`. - The topology coordinator picks up the request for the node to be joined (i.e. the node in `none` state), verifies its properties - including cluster features - and then informs the new node about the decision via `JOIN_NODE_RESPONSE`: whether the new node will be joined or not. - If the new node is supposed to be joined, the topology coordinator will add it to group 0 and the new node will start its group 0 server. - Otherwise, the request is removed from `system.topology` and the new node is shut down. The procedure is designed to be idempotent. A node can safely retry joining the cluster even if it shut down while in the middle of the previous attempt and restarted with a different configuration - a random UUID called "join cookie" is used to differentiate the attempts. More details about the RPC are described in `topology-over-raft.md`. Fixes: #15152 Closes #15196 * github.com:scylladb/scylladb: tests: mark test_blocked_bootstrap as skipped storage_service: do not check features in shadow round storage_service: remove raft_{boostrap,replace} topology_coordinator: relax the check in enable_features raft_group0: insert replaced node info before server setup storage_service: use join node rpc to join the cluster topology_coordinator: handle joining nodes topology_state_machine: add join_group0 state storage_service: add join node RPC handlers raft: expose current_leader in raft::server storage_service: extract wait_for_live_nodes_timeout constant raft_group0: abstract out node joining handshake storage_service: pass raft_topology_change_enabled on rpc init rpc: add new join handshake verbs docs: document the new join procedure topology_state_machine: add supported_features to replica_state storage_service: check destination host ID in raft verbs group_state_machine: take reference to raft address map raft_group0: expose joined_group0

Currently, when the topology coordinator notices a request to join or replace a node, the node is transitioned to an appropriate state and the topology is moved to commit_new_generation/write_both_read_old, in a single group 0 operation. In later commits, the topology coordinator will accept/reject nodes based on the request, so we would like to have a separate step - topology coordinator accepts, transitions to bootstrap state, tells the node that it is accepted, and only then continues with the topology transition. This commits adds a new `join_group0` transition state that precedes `commit_cdc_generation`.

The topology coordinator is updated to perform verification of joining nodes and to send `JOIN_NODE_RESPONSE` RPC back to the joining node.

Now, the storage_service uses new RPCs to join the cluster. A new handshaker is implemented and passed to group0 in order to make it happen.

Currently, information about replaced node is put into the raft address map after joining group 0 via `join_group0`. However, the new handshake which happens when joining group 0 needs to read the group 0 state (so that it can wait until it sees all normal nodes as UP). Loading the topology state to memory involves resolving IP addresses of the normal nodes, so the information about replaced node needs to be inserted before the handshake happens. This commit moves insertion of the replace node's data before the call to `join_group0`.

Currently, `enable_features` requires that there are no topology in progress and there are no nodes waiting to be joined. Now, after the new handshake is implemented, we can drop the second condition because nodes in `none` state are not a part of group 0 yet. Additionally, the comments inside `enable_features` are clarified so that they explain why it's safe to only include normal features when doing the barrier and calculating features to enable.

The functionality of `raft_bootstrap` and `raft_replace` is handled by the new handshake, so those functions can be removed.

The new joining procedure safely checks compatibility of supported/enabled features, therefore there is no longer any need to do it in the gossip shadow round.

With the new procedure to join nodes, testing the scenario in `test_blocked_bootstrap` becomes very tricky. To recap, the test does the following: - Starts a 3-node cluster, - Shuts down node 1, - Tries to replace node 1 with node 4, but an error injection is triggered which causes node 4 to fail after it joins group 0. Note that pre-JOIN_NODE handshake, this would only result in node 4 being added to group 0 config, but no modification to the group 0 state itself is being done - the joining node is supposed to write a request to join. - Tries to replace node 1 again with node 5, which should succeed. The bug that this regression test was supposed to check for was that node 5 would try to resolve all IPs of nodes added to group 0 config. Because node 4 shuts down before advertising itself in gossip, the node 5 would get stuck. The new procedure to join group 0 complicates the situation because a request to join is written first to group 0 and only then the topology coordinator modifies the group 0 config. It is possible to add an error injection to the topology coordinator code so that it doesn't change the group 0 state and proceeds with bootstrapping the node, but it will only get stuck trying to add the node. If node 5 tries to join in the meantime, the topology coordinator may switch to it and try to bootstrap it instead, but this is basically a 50% chance because it depends on the order of node 4 and node 5's host IDs in the topology_state_machine struct. It should be possible to fix the test with error recovery, but until then it is marked as skipped.

piodul · 2023-09-27T14:14:30Z

v5.4:

Worked around an issue described in raft topology: followups to cluster features/join node RPC PR #15555 (comment) which appears to fail in our CI but didn't occur locally (join_node_response handler waits for group0 to be started before continuing)
Removed the unnecessary boost::range imports as a bonus

scylladb-promoter · 2023-09-27T16:08:14Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3956/

piodul · 2023-09-27T16:18:55Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3956/

Apparently, we exceeded some rate limit in GitHub and that caused the CI to fail. Rerunning.

scylladb-promoter · 2023-09-27T18:18:02Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3961/

piodul requested a review from tgrabiec as a code owner August 28, 2023 10:30

piodul requested review from gleb-cloudius and kbr-scylla August 28, 2023 10:30

github-actions bot deleted a comment from aws-amplify-us-east-2 bot Aug 28, 2023