New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_raft_recovery_procedure: bootstrapping node fails to start: waiting for schema information to complete #14066
Comments
ping. this dtest keeps failing |
Sorry missed that one |
Why is this test not part of the gating? |
Well, once it passes reliably we can add it to gating. |
The test simulates majority loss in 3 node cluster, then performs recovery with 1 node, removing the rest of the two. The new node claims there's schema mismatch with the existing node:
schema never synchronizes. But group 0 must be available, because the node managed to join it. Did we somehow lose availability right after joining, so the first node is not pushing schema commands to the new node? |
Schema on the new node is not changing, even after joining group 0. It's constantly the initial version:
Is the leader not pushing schema commands to the new node? |
But that couldn't be - the node managed to join group 0... |
Ah, The recovery procedure is broken! Previously it worked because of schema pulls outside Raft, but we got rid of those pulls! |
This is a major problem and a release blocker. We need to modify the recovery procedure so the recovered Raft state somehow includes schema. Which now doesn't happen due to lack of schema pulls! The old proposal for group 0 recovery included this: but then we settled on a "simpler" solution which turns out to be incorrect of course... |
Note: the implemented recovery/upgrade procedure does ensure schema synchronization in the existing cluster. So not only the recovery procedure is broken; upgrade is broken as well. |
Configuring Raft to do snapshot pushes even for index 0 would solve this - the incorrect assumption that was implicitly made in the current recovery procedure implementation is that the initial snapshot contains full group 0 state. We could say it does, the problem is we don't send index 0 snapshots. |
diff to catch the issue in test.py: diff --git a/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py b/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py
index cac0064aa5..b229d8fc7e 100644
--- a/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py
+++ b/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py
@@ -85,3 +85,8 @@ async def test_recovery_after_majority_loss(manager: ManagerClient, random_table
logging.info("Creating another table")
await random_tables.add_table(ncolumns=5)
+
+ logging.info("Booting new node")
+ await manager.server_add(config={
+ 'consistent_cluster_management': True
+ }) |
Fortunately test with the above modification passes on 5.2 and 5.3 - probably because of the outside-Raft schema pulls. |
Just leave the pulls in place during recovery for now. Schedule a real fix for 5.4 |
Pushing snapshot at index 0 sounds like a good idea to me as well. |
Let's add a separate test case, not patch an existing test case. This is a distinct scenario. |
Why? The separate test case would do exact same thing as this one, with the additional bootstrap step.
The pull needs to happen after recovery (or upgrade) was already done - when a new node joins a cluster that has recovered (upgraded). We're already operating in Raft mode at that moment. |
We can add a new type of command "sync schema from leader" and send it after the recovery. |
Alternative: when setting up the initial snapshot during recovery/upgrade, set its index to 1 (or 2). Then it will all happen automagically. |
POC: diff --git a/service/raft/raft_group0.cc b/service/raft/raft_group0.cc
index dc444ad215..f13ee8cd33 100644
--- a/service/raft/raft_group0.cc
+++ b/service/raft/raft_group0.cc
@@ -420,14 +420,16 @@ future<> raft_group0::join_group0(std::vector<gms::inet_address> seeds, bool as_
if (server == nullptr) {
// This is the first time discovery is run. Create and start a Raft server for group 0 on this node.
raft::configuration initial_configuration;
+ auto b = false;
if (g0_info.id == my_id) {
// We were chosen as the discovery leader.
// We should start a new group with this node as voter.
group0_log.info("Server {} chosen as discovery leader; bootstrapping group 0 from scratch", my_id);
initial_configuration.current.emplace(my_addr, true);
+ b = true;
}
// Bootstrap the initial configuration
- co_await raft_sys_table_storage(qp, group0_id, my_id).bootstrap(std::move(initial_configuration));
+ co_await raft_sys_table_storage(qp, group0_id, my_id).bootstrap(std::move(initial_configuration), b);
co_await start_server_for_group0(group0_id, ss, qp, mm, cdc_gen_service);
server = &_raft_gr.group0();
// FIXME if we crash now or after getting added to the config but before storing group 0 ID,
diff --git a/service/raft/raft_sys_table_storage.cc b/service/raft/raft_sys_table_storage.cc
index abf4ae7e01..4b945814ee 100644
--- a/service/raft/raft_sys_table_storage.cc
+++ b/service/raft/raft_sys_table_storage.cc
@@ -301,8 +301,8 @@ future<> raft_sys_table_storage::execute_with_linearization_point(std::function<
}
}
-future<> raft_sys_table_storage::bootstrap(raft::configuration initial_configuation) {
- raft::snapshot_descriptor snapshot;
+future<> raft_sys_table_storage::bootstrap(raft::configuration initial_configuation, bool b) {
+ raft::snapshot_descriptor snapshot{.idx{uint64_t{b}}};
snapshot.id = raft::snapshot_id::create_random_id();
snapshot.config = std::move(initial_configuation);
co_await store_snapshot_descriptor(snapshot, 0);
diff --git a/service/raft/raft_sys_table_storage.hh b/service/raft/raft_sys_table_storage.hh
index 27573b5b6a..55a068f1e4 100644
--- a/service/raft/raft_sys_table_storage.hh
+++ b/service/raft/raft_sys_table_storage.hh
@@ -71,9 +71,7 @@ class raft_sys_table_storage : public raft::persistence {
// Persist initial configuration of a new Raft group.
// To be called before start for the new group.
- // Uses a special snapshot id (0) to identify the snapshot
- // descriptor.
- future<> bootstrap(raft::configuration initial_configuation);
+ future<> bootstrap(raft::configuration initial_configuation, bool b);
private:
future<> do_store_log_entries(const std::vector<raft::log_entry_ptr>& entries); The modified test passes with this. |
When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In 91f609d we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: scylladb#14066
When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In 91f609d we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: scylladb#14066
When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In 91f609d we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: scylladb#14066
Cc @DoronArazii this test is failing consistently in dtest-daily-release (but work is in progress) |
When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In 91f609d we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: scylladb#14066
When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In 91f609d we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: scylladb#14066 Closes scylladb#14336
Hey, I think we are hit by this issue in Scylla 5.2. We cannot bootstrap a new node as it's stuck on:
Is this planned to be backported to 5.2 soon? Or is there some kind of a workaround that we could use for a one-time fix of a bootstrap? |
The commit that introduced the issue is not part of 5.2. Are you using consistent cluster management in 5.2? A discussion in a closed issue is not the best way to proceed, perhaps let's move on to slack? Generally, if schema propagation gets stuck, try an innocuous schema change, like create/drop an empty keyspace. |
Big thanks for the tip with a schema change, @kostja, it helped! (Btw: I did ask on slack but I did not get any answers.) |
@kbr-scylla please evaluate for backport |
No need to backport since in 5.2 schema is still being pulled outside Raft. |
Removing backport candidate label per above. |
When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In 91f609d we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: scylladb#14066 Closes scylladb#14336 (cherry picked from commit ff386e7) Backport note: contrary to the claims above, it turns out that it is actually necessary to create snapshots in clusters which bootstrap with Raft, because of tombstones in current schema state expire hence applying schema mutations from old Raft log entries is not really idempotent. Snapshot transfer, which transfers group 0 history and state_ids, prevents old entries from applying schema mutations over latest schema state. Ref: scylladb#16683
Seen consistently since https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/257/testReport/raft_recovery_test/TestRaftRecoverProcedure/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split011___test_raft_recovery_procedure_3_/
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/257/artifact/logs-full.release.011/1684987618896_raft_recovery_test.py%3A%3ATestRaftRecoverProcedure%3A%3Atest_raft_recovery_procedure%5B3%5D/node4.log
The text was updated successfully, but these errors were encountered: