Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster features on raft: new procedure for joining group 0 #15196

Merged
merged 19 commits into from
Sep 28, 2023

Conversation

piodul
Copy link
Contributor

@piodul piodul commented Aug 28, 2023

This PR implements a new procedure for joining nodes to group 0, based on the description in the "Cluster features on Raft (v2)" document. This is a continuation of the previous PRs related to cluster features on raft (#14722, #14232), and the last piece necessary to replace cluster feature checks in gossip.

Current implementation relies on gossip shadow round to fetch the set of enabled features, determine whether the node supports all of the enabled features, and joins only if it is safe. As we are moving management of cluster features to group 0, we encounter a problem: the contents of group 0 itself may depend on features, hence it is not safe to join it unless we perform the feature check which depends on information in group 0. Hence, we have a dependency cycle.

In order to solve this problem, the algorithm for joining group 0 is modified, and verification of features and other parameters is offloaded to an existing node in group 0. Instead of directly asking the discovery leader to unconditionally add the node to the configuration with GROUP0_MODIFY_CONFIG, two different RPCs are added: JOIN_NODE_REQUEST and JOIN_NODE_RESPONSE. The main idea is as follows:

  • The new node sends JOIN_NODE_REQUEST to the discovery leader. It sends a bunch of information describing the node, including supported cluster features. The discovery leader verifies some of the parameters and adds the node in the none state to system.topology.
  • The topology coordinator picks up the request for the node to be joined (i.e. the node in none state), verifies its properties - including cluster features - and then:
    • If the node is accepted, the coordinator transitions it to boostrap/replace state and transitions the topology to join_group0 state. The node is added to group 0 and then JOIN_NODE_RESPONSE is sent to it with information that the node was accepted.
    • Otherwise, the node is moved to left state, told by the coordinator via JOIN_NODE_RESPONSE that it was rejected and it shuts down.

The procedure is not retryable - if a node fails to do it from start to end and crashes in between, it will not be allowed to retry it with the same host_id - JOIN_NODE_REQUEST will fail. The data directory must be cleared before attempting to add it again (so that a new host_id is generated).

More details about the procedure and the RPC are described in topology-over-raft.md.

Fixes: #15152

@piodul
Copy link
Contributor Author

piodul commented Aug 28, 2023

Some nice-to-haves that I didn't implement yet, but perhaps could be done in followups:

  • Error injection tests to verify idempotence properties of the procedure
  • Maybe add a new key in system.scylla_local to store the enabled features that were sent by the topology coordinator on boot instead of reusing the existing and legacy enabled_features key (not relevant to the implementation anymore)

or rejected.

In order to correlate requests with replies, the joining node uses a UUID called
"join cookie". Until the joining node acknowledges a reply from the coordinator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not look at the code how this "join cookie" is used, but can host_id can be used instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that host_id is generated and persisted early in the procedure. If the procedure crashes after persisting host_id, the next attempt might use some different parameters (which ones though?) but the same host_id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "host_id is generated and persisted early in the procedure" then after the crash the same host_id will be used. Theoretically one may change some parameters (cluster name? Snitch?) and and restart, but if we persist those before start joining we can detect the change and fail early. If features are changed we can detect it in join_request rpc since whatever was added into the topology will not match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to deal with lack of good names for the verb is to rename group0_modify_config to group0_modify_config_legacy and keep using the name group0_modify_config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to deal with lack of good names for the verb is to rename group0_modify_config to group0_modify_config_legacy and keep using the name group0_modify_config

I suppose that it's not a problem that the join_node name is less clear than group0_modify_config; the bigger problem is that we have two RPCs now which are both used in the handshake named join_node_request and join_node_response, and both of them are two-way. It's confusing to talk about "response to join_node_request" etc.

from the `system.topology` table.

In case of other failures (timeouts, connection issues etc.) the request is
simply retried.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a node that requests joining is permanently dead we need to stop retrying eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This should be implemented as a part of the error handling work.

service/join_node.hh Outdated Show resolved Hide resolved
@kbr-scylla
Copy link
Contributor

@gusev-p please review as well. Understanding this new procedure will probably be important to implement upgrades.

I'm guessing that to correctly handle potential node bootstraps in a mixed cluster (where some nodes have upgraded to use raft-topology but others did not), we'll want to distinguish whether the joining node uses the new RPC or the old GROUP0_MODIFY_CONFIG. And probably in a fully upgraded cluster we'll reject attempts that use the old way.

@scylladb-promoter
Copy link
Contributor

eventually removes it.
- If there is no topology request for that node, it is placed in
the `system.topology` table and an `ok` response is sent back.
- If the node has already joined, an `ok` response is sent back.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about the scenario if it's in the middle of joining (post-none state, but before becoming normal).

The node may try to start the procedure from scratch, while the coordinator is somewhere in the middle of it.

But perhaps handling that is not the responsibility of JOIN_NODE_REQUEST but a later step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I meant "topology coordinator accepted it". I hope it's more clear in the current version of the doc.

docs/dev/topology-over-raft.md Outdated Show resolved Hide resolved
docs/dev/topology-over-raft.md Outdated Show resolved Hide resolved

Upon noticing a node in `none` state and with `join` topology request, the
coordinator verifies parameters of the node (e.g. whether it supports all
of the enabled features) and decides whether to accept it or not. It sends
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do both the handler of JOIN_NODE_REQUEST, and now the topology coordinator, have to verify parameters? Does the coordinator have some additional knowledge? If yes, is the initial verification at all necessary, if there's a second verification done by the coordinator anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained the need for initial verification here: #15196 (comment)

As for the topology coordinator, we use it to do the "ultimate" verification and marking the node as accepted (i.e. continuing with bootstrapping/replacing the node) because we need to serialize it with marking a feature as enabled, which is currently also done by the topology coordinator.

docs/dev/topology-over-raft.md Outdated Show resolved Hide resolved
slogger.info("raft topology: join: returning ok for idempotency ({})", params.host_id);
co_return result;
} else {
assert(rs.state == node_state::none);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use on_internal_error instead

service/storage_service.cc Show resolved Hide resolved
service/storage_service.cc Outdated Show resolved Hide resolved

// If the validation failed, we will remove the request at the end.
bool should_drop = std::holds_alternative<join_node_response_params::rejected>(decision);
auto ip = _address_map.find(node.id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a FIXME, this should eventually be done by messaging_service, not by us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to do in v2, but will do on the next occasion (after the next round of comments).

service/storage_service.cc Outdated Show resolved Hide resolved
@piodul
Copy link
Contributor Author

piodul commented Aug 30, 2023

Following our discussion on the last meeting, I'll try to simplify the implementation not to use join cookies. For now, I have responded to the comments that I was sure will be relevant after the changes.

@piodul
Copy link
Contributor Author

piodul commented Sep 11, 2023

v2:

  • Fixed the bugs that caused the tests to fail,
  • Got rid of the join cookie. Now, the node needs to complete the handshake with the same snitch name, cluster name and features, or clear its data directory to generate a new host_id. This makes the procedure a bit simpler.
  • Supported features that the node originally started with are now stored in system.scylla_local under the initial_supported_features, used only for the duration of the handshake,
  • Snitch name is now persistently stored in system.scylla_local and verified on startup,
  • Storage service verbs related to raft now verify destination host_id (without it, node replace tests would fail),
  • During the handshake, joining node now waits for normal nodes to become UP (nodes to wait for are calculated based on group 0 state),
  • Per-node supported features are now kept in the replica_state structure, allowing for more convenient access.

@scylladb-promoter
Copy link
Contributor

idl/join_node.idl.hh Outdated Show resolved Hide resolved
@@ -6272,12 +6562,25 @@ void storage_service::init_messaging_service(sharded<db::system_distributed_keys
return ss.cleanup_tablet(tablet);
});
});
if (raft_topology_change_enabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up we can move existing topology RPCs into the if.

kbr-scylla added a commit that referenced this pull request Sep 27, 2023
…rom Piotr Dulikowski

This PR implements a new procedure for joining nodes to group 0, based
on the description in the "Cluster features on Raft (v2)" document. This
is a continuation of the previous PRs related to cluster features on
raft (#14722,
#14232), and the last piece
necessary to replace cluster feature checks in gossip.

Current implementation relies on gossip shadow round to fetch the set of
enabled features, determine whether the node supports all of the enabled
features, and joins only if it is safe. As we are moving management of
cluster features to group 0, we encounter a problem: the contents of
group 0 itself may depend on features, hence it is not safe to join it
unless we perform the feature check which depends on information in
group 0. Hence, we have a dependency cycle.

In order to solve this problem, the algorithm for joining group 0 is
modified, and verification of features and other parameters is offloaded
to an existing node in group 0. Instead of directly asking the discovery
leader to unconditionally add the node to the configuration with
`GROUP0_MODIFY_CONFIG`, two different RPCs are added:
`JOIN_NODE_REQUEST` and `JOIN_NODE_RESPONSE`. The main idea is as
follows:

- The new node sends `JOIN_NODE_REQUEST` to the discovery leader. It
  sends a bunch of information describing the node, including supported
  cluster features. The discovery leader verifies some of the parameters
  and adds the node in the `none` state to `system.topology`.
- The topology coordinator picks up the request for the node to be
  joined (i.e. the node in `none` state), verifies its properties -
  including cluster features - and then informs the new node about the
  decision via `JOIN_NODE_RESPONSE`: whether the new node will be joined
  or not.
        - If the new node is supposed to be joined, the topology
          coordinator will add it to group 0 and the new node will start
          its group 0 server.
        - Otherwise, the request is removed from `system.topology` and
          the new node is shut down.

The procedure is designed to be idempotent. A node can safely retry
joining the cluster even if it shut down while in the middle of the
previous attempt and restarted with a different configuration - a random
UUID called "join cookie" is used to differentiate the attempts.

More details about the RPC are described in `topology-over-raft.md`.

Fixes: #15152

Closes #15196

* github.com:scylladb/scylladb:
  tests: mark test_blocked_bootstrap as skipped
  storage_service: do not check features in shadow round
  storage_service: remove raft_{boostrap,replace}
  topology_coordinator: relax the check in enable_features
  raft_group0: insert replaced node info before server setup
  storage_service: use join node rpc to join the cluster
  topology_coordinator: handle joining nodes
  topology_state_machine: add join_group0 state
  storage_service: add join node RPC handlers
  raft: expose current_leader in raft::server
  storage_service: extract wait_for_live_nodes_timeout constant
  raft_group0: abstract out node joining handshake
  storage_service: pass raft_topology_change_enabled on rpc init
  rpc: add new join handshake verbs
  docs: document the new join procedure
  topology_state_machine: add supported_features to replica_state
  storage_service: check destination host ID in raft verbs
  group_state_machine: take reference to raft address map
  raft_group0: expose joined_group0
Currently, when the topology coordinator notices a request to join or
replace a node, the node is transitioned to an appropriate state and the
topology is moved to commit_new_generation/write_both_read_old, in a
single group 0 operation. In later commits, the topology coordinator
will accept/reject nodes based on the request, so we would like to have
a separate step - topology coordinator accepts, transitions to bootstrap
state, tells the node that it is accepted, and only then continues with
the topology transition.

This commits adds a new `join_group0` transition state that precedes
`commit_cdc_generation`.
The topology coordinator is updated to perform verification of joining
nodes and to send `JOIN_NODE_RESPONSE` RPC back to the joining node.
Now, the storage_service uses new RPCs to join the cluster. A new
handshaker is implemented and passed to group0 in order to make it
happen.
Currently, information about replaced node is put into the raft address
map after joining group 0 via `join_group0`. However, the new handshake
which happens when joining group 0 needs to read the group 0 state (so
that it can wait until it sees all normal nodes as UP). Loading the
topology state to memory involves resolving IP addresses of the normal
nodes, so the information about replaced node needs to be inserted
before the handshake happens.

This commit moves insertion of the replace node's data before the call
to `join_group0`.
Currently, `enable_features` requires that there are no topology in
progress and there are no nodes waiting to be joined. Now, after the new
handshake is implemented, we can drop the second condition because nodes
in `none` state are not a part of group 0 yet.

Additionally, the comments inside `enable_features` are clarified so
that they explain why it's safe to only include normal features when
doing the barrier and calculating features to enable.
The functionality of `raft_bootstrap` and `raft_replace` is handled by
the new handshake, so those functions can be removed.
The new joining procedure safely checks compatibility of
supported/enabled features, therefore there is no longer any need to do
it in the gossip shadow round.
With the new procedure to join nodes, testing the scenario in
`test_blocked_bootstrap` becomes very tricky. To recap, the test does
the following:

- Starts a 3-node cluster,
- Shuts down node 1,
- Tries to replace node 1 with node 4, but an error injection is
  triggered which causes node 4 to fail after it joins group 0. Note
  that pre-JOIN_NODE handshake, this would only result in node 4 being
  added to group 0 config, but no modification to the group 0 state
  itself is being done - the joining node is supposed to write a request
  to join.
- Tries to replace node 1 again with node 5, which should succeed.

The bug that this regression test was supposed to check for was that
node 5 would try to resolve all IPs of nodes added to group 0 config.
Because node 4 shuts down before advertising itself in gossip, the node
5 would get stuck.

The new procedure to join group 0 complicates the situation because a
request to join is written first to group 0 and only then the topology
coordinator modifies the group 0 config. It is possible to add an error
injection to the topology coordinator code so that it doesn't change the
group 0 state and proceeds with bootstrapping the node, but it will only
get stuck trying to add the node. If node 5 tries to join in the
meantime, the topology coordinator may switch to it and try to bootstrap
it instead, but this is basically a 50% chance because it depends on the
order of node 4 and node 5's host IDs in the topology_state_machine
struct.

It should be possible to fix the test with error recovery, but until
then it is marked as skipped.
@piodul
Copy link
Contributor Author

piodul commented Sep 27, 2023

v5.4:

@scylladb-promoter
Copy link
Contributor

@piodul
Copy link
Contributor Author

piodul commented Sep 27, 2023

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3956/

Apparently, we exceeded some rate limit in GitHub and that caused the CI to fail. Rerunning.

@scylladb-promoter
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants