Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_raft_recovery_procedure: bootstrapping node fails to start: waiting for schema information to complete #14066

Closed
bhalevy opened this issue May 29, 2023 · 28 comments
Assignees
Labels
P1 Urgent status/release blocker Preventing from a release to be promoted symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented May 29, 2023

Seen consistently since https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/257/testReport/raft_recovery_test/TestRaftRecoverProcedure/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split011___test_raft_recovery_procedure_3_/

ccmlib.node.TimeoutError: 25 May 2023 04:06:55 [node4] Missing: ['Starting listening for CQL clients|storage_service .* Starting to bootstrap|(compaction|database) -.*Resharding']:

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/257/artifact/logs-full.release.011/1684987618896_raft_recovery_test.py%3A%3ATestRaftRecoverProcedure%3A%3Atest_raft_recovery_procedure%5B3%5D/node4.log

Scylla version 5.4.0~dev-0.20230525.b0c40a2a0349 with build-id 1b6ce827bc6343fe511c3dc6cba526c27ed52e8e starting ...
...
INFO  2023-05-25 03:57:02,346 [shard 0] storage_service - entering JOINING mode
INFO  2023-05-25 03:57:02,347 [shard 0] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-05-25 03:57:02,351 [shard 1] compaction - [Compact system.scylla_local 34af3880-fab0-11ed-a61a-5b019f191e9e] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/md-11-big-Data.db:level=0]. 62kB to 57kB (~91% of original) in 5ms = 11MB/s. ~256 total partitions merged to 2.
INFO  2023-05-25 03:57:02,353 [shard 0] compaction - [Compact system.scylla_local 34acee90-fab0-11ed-b7ce-5b009f191e9e] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/mc-10-big-Data.db:level=0]. 11kB to 58kB (~514% of original) in 20ms = 2MB/s. ~256 total partitions merged to 2.
INFO  2023-05-25 03:57:02,354 [shard 0] compaction - [Compact system.local 34b0bf20-fab0-11ed-b7ce-5b009f191e9e] Compacting [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/me-8-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/mc-6-big-Data.db:level=0:origin=compaction]
INFO  2023-05-25 03:57:02,354 [shard 0] storage_service - waiting for ring information
INFO  2023-05-25 03:57:02,354 [shard 0] storage_service - waiting for schema information to complete
INFO  2023-05-25 03:57:02,362 [shard 0] compaction - [Compact system.local 34b0bf20-fab0-11ed-b7ce-5b009f191e9e] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/me-12-big-Data.db:level=0]. 13kB to 7kB (~54% of original) in 6ms = 1MB/s. ~256 total partitions merged to 1.
INFO  2023-05-25 03:57:02,362 [shard 0] compaction - [Compact system.scylla_local 34b1f7a0-fab0-11ed-b7ce-5b009f191e9e] Compacting [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-14-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-12-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/mc-10-big-Data.db:level=0:origin=compaction]
INFO  2023-05-25 03:57:02,369 [shard 0] compaction - [Compact system.scylla_local 34b1f7a0-fab0-11ed-b7ce-5b009f191e9e] Compacted 3 sstables to [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-16-big-Data.db:level=0]. 70kB to 59kB (~83% of original) in 5ms = 11MB/s. ~384 total partitions merged to 2.
INFO  2023-05-25 03:57:02,369 [shard 0] compaction - [Compact system.local 34b30910-fab0-11ed-b7ce-5b009f191e9e] Compacting [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/me-10-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/me-12-big-Data.db:level=0:origin=compaction]
INFO  2023-05-25 03:57:02,375 [shard 0] compaction - [Compact system.local 34b30910-fab0-11ed-b7ce-5b009f191e9e] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-iosncndn/test/node4/data/system/local-7ad54392bcdd35a684174e047860b377/me-14-big-Data.db:level=0]. 13kB to 7kB (~54% of original) in 4ms = 1MB/s. ~256 total partitions merged to 1.
INFO  2023-05-25 03:57:03,354 [shard 0] storage_service - waiting for schema information to complete
...
INFO  2023-05-25 04:06:55,414 [shard 0] storage_service - waiting for schema information to complete
INFO  2023-05-25 04:06:55,758 [shard 0] compaction_manager - Asked to stop
@bhalevy bhalevy added the triage/master Looking for assignee label May 29, 2023
@bhalevy
Copy link
Member Author

bhalevy commented May 29, 2023

Cc @kostja @kbr-scylla

@kostja kostja self-assigned this May 29, 2023
@bhalevy
Copy link
Member Author

bhalevy commented Jun 12, 2023

ping. this dtest keeps failing

@kostja
Copy link
Contributor

kostja commented Jun 12, 2023

Sorry missed that one

@kostja
Copy link
Contributor

kostja commented Jun 12, 2023

Why is this test not part of the gating?

@bhalevy
Copy link
Member Author

bhalevy commented Jun 12, 2023

Why is this test not part of the gating?

Well, once it passes reliably we can add it to gating.
Now the horse already escaped the barn.

@bhalevy bhalevy added tests/dtest P2 High Priority symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework and removed triage/master Looking for assignee labels Jun 13, 2023
@kbr-scylla
Copy link
Contributor

The test simulates majority loss in 3 node cluster, then performs recovery with 1 node, removing the rest of the two.
Then tries to boot a new node (ending up with 2-node cluster).

The new node claims there's schema mismatch with the existing node:

INFO  2023-06-20 17:48:34,447 [shard 0] migration_manager - Schema mismatch for 127.0.92.1 (59adb24e-f3cd-3e02-97f0-5b395827453f != 1b3d1ee9-2783-3500-8fb8-c7f47ac79209).
INFO  2023-06-20 17:48:34,457 [shard 0] migration_manager - Checking schema state for 127.0.92.1.

schema never synchronizes.

But group 0 must be available, because the node managed to join it. Did we somehow lose availability right after joining, so the first node is not pushing schema commands to the new node?
Or perhaps it did push all schema commands, but schema is not synchronizing for some other reason - e.g. the set of enabled features is different? But why would that happen?

@kbr-scylla
Copy link
Contributor

Schema on the new node is not changing, even after joining group 0. It's constantly the initial version:

INFO  2023-06-20 17:48:20,243 [shard 0] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f

...

INFO  2023-06-20 17:48:34,414 [shard 0] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f

Is the leader not pushing schema commands to the new node?
Or is there no leader?

@kbr-scylla
Copy link
Contributor

Or is there no leader?

But that couldn't be - the node managed to join group 0...

@kbr-scylla
Copy link
Contributor

Ah,
Of course

The recovery procedure is broken!
There are no commands to send after recovery which deletes Raft data...

Previously it worked because of schema pulls outside Raft, but we got rid of those pulls!
And the new node is not pulling a snapshot because we assume that we started from initial state, so no snapshot is required!

cc @gleb-cloudius @kostja @tgrabiec

@kbr-scylla
Copy link
Contributor

This is a major problem and a release blocker.
Fortunately in 5.3 we still pull schema outside Raft.

We need to modify the recovery procedure so the recovered Raft state somehow includes schema.
This is complicated because different nodes may potentially start from different state; we need to ensure that schema (and in the future, topology) is synchronized between those nodes before recreating group 0.

Which now doesn't happen due to lack of schema pulls!

The old proposal for group 0 recovery included this:
https://docs.google.com/document/d/1ZN1y0-VlXh_hzcpIXqHLqsM-2E1bGE2L56JXXa7e3ZA/edit

but then we settled on a "simpler" solution which turns out to be incorrect of course...
(well, it was correct when we depended on schema pulls, but it's not correct in the long run)

@kbr-scylla kbr-scylla added P1 Urgent status/release blocker Preventing from a release to be promoted and removed P2 High Priority labels Jun 20, 2023
@kbr-scylla
Copy link
Contributor

Note: the implemented recovery/upgrade procedure does ensure schema synchronization in the existing cluster.
However, new nodes won't get the schema - we don't send it through Raft (it's not part of the initial snapshot) and we no longer pull it outside Raft.

So not only the recovery procedure is broken; upgrade is broken as well.
If you start a cluster with Raft disabled, then upgrade it, then try to boot new node - the new node won't be able to obtain the current schema.

@kbr-scylla
Copy link
Contributor

Configuring Raft to do snapshot pushes even for index 0 would solve this - the incorrect assumption that was implicitly made in the current recovery procedure implementation is that the initial snapshot contains full group 0 state. We could say it does, the problem is we don't send index 0 snapshots.

@kbr-scylla
Copy link
Contributor

diff to catch the issue in test.py:

diff --git a/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py b/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py
index cac0064aa5..b229d8fc7e 100644
--- a/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py
+++ b/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py
@@ -85,3 +85,8 @@ async def test_recovery_after_majority_loss(manager: ManagerClient, random_table
 
     logging.info("Creating another table")
     await random_tables.add_table(ncolumns=5)
+
+    logging.info("Booting new node")
+    await manager.server_add(config={
+        'consistent_cluster_management': True
+    })

@kbr-scylla
Copy link
Contributor

Fortunately test with the above modification passes on 5.2 and 5.3 - probably because of the outside-Raft schema pulls.

@kostja
Copy link
Contributor

kostja commented Jun 20, 2023

Just leave the pulls in place during recovery for now. Schedule a real fix for 5.4

@kostja
Copy link
Contributor

kostja commented Jun 20, 2023

Pushing snapshot at index 0 sounds like a good idea to me as well.

@kostja
Copy link
Contributor

kostja commented Jun 20, 2023

diff to catch the issue in test.py:

diff --git a/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py b/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py
index cac0064aa5..b229d8fc7e 100644
--- a/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py
+++ b/test/topology_raft_disabled/test_raft_upgrade_majority_loss.py
@@ -85,3 +85,8 @@ async def test_recovery_after_majority_loss(manager: ManagerClient, random_table
 
     logging.info("Creating another table")
     await random_tables.add_table(ncolumns=5)
+
+    logging.info("Booting new node")
+    await manager.server_add(config={
+        'consistent_cluster_management': True
+    })

Let's add a separate test case, not patch an existing test case. This is a distinct scenario.

@kbr-scylla
Copy link
Contributor

Let's add a separate test case, not patch an existing test case. This is a distinct scenario.

Why? The separate test case would do exact same thing as this one, with the additional bootstrap step.

Just leave the pulls in place during recovery for now. Schedule a real fix for 5.4

The pull needs to happen after recovery (or upgrade) was already done - when a new node joins a cluster that has recovered (upgraded). We're already operating in Raft mode at that moment.

@gleb-cloudius
Copy link
Contributor

We can add a new type of command "sync schema from leader" and send it after the recovery.

@kbr-scylla
Copy link
Contributor

Alternative: when setting up the initial snapshot during recovery/upgrade, set its index to 1 (or 2). Then it will all happen automagically.

@kbr-scylla
Copy link
Contributor

Alternative: when setting up the initial snapshot during recovery/upgrade, set its index to 1 (or 2). Then it will all happen automagically.

POC:

diff --git a/service/raft/raft_group0.cc b/service/raft/raft_group0.cc
index dc444ad215..f13ee8cd33 100644
--- a/service/raft/raft_group0.cc
+++ b/service/raft/raft_group0.cc
@@ -420,14 +420,16 @@ future<> raft_group0::join_group0(std::vector<gms::inet_address> seeds, bool as_
         if (server == nullptr) {
             // This is the first time discovery is run. Create and start a Raft server for group 0 on this node.
             raft::configuration initial_configuration;
+            auto b = false;
             if (g0_info.id == my_id) {
                 // We were chosen as the discovery leader.
                 // We should start a new group with this node as voter.
                 group0_log.info("Server {} chosen as discovery leader; bootstrapping group 0 from scratch", my_id);
                 initial_configuration.current.emplace(my_addr, true);
+                b = true;
             }
             // Bootstrap the initial configuration
-            co_await raft_sys_table_storage(qp, group0_id, my_id).bootstrap(std::move(initial_configuration));
+            co_await raft_sys_table_storage(qp, group0_id, my_id).bootstrap(std::move(initial_configuration), b);
             co_await start_server_for_group0(group0_id, ss, qp, mm, cdc_gen_service);
             server = &_raft_gr.group0();
             // FIXME if we crash now or after getting added to the config but before storing group 0 ID,
diff --git a/service/raft/raft_sys_table_storage.cc b/service/raft/raft_sys_table_storage.cc
index abf4ae7e01..4b945814ee 100644
--- a/service/raft/raft_sys_table_storage.cc
+++ b/service/raft/raft_sys_table_storage.cc
@@ -301,8 +301,8 @@ future<> raft_sys_table_storage::execute_with_linearization_point(std::function<
     }
 }
 
-future<> raft_sys_table_storage::bootstrap(raft::configuration initial_configuation) {
-    raft::snapshot_descriptor snapshot;
+future<> raft_sys_table_storage::bootstrap(raft::configuration initial_configuation, bool b) {
+    raft::snapshot_descriptor snapshot{.idx{uint64_t{b}}};
     snapshot.id = raft::snapshot_id::create_random_id();
     snapshot.config = std::move(initial_configuation);
     co_await store_snapshot_descriptor(snapshot, 0);
diff --git a/service/raft/raft_sys_table_storage.hh b/service/raft/raft_sys_table_storage.hh
index 27573b5b6a..55a068f1e4 100644
--- a/service/raft/raft_sys_table_storage.hh
+++ b/service/raft/raft_sys_table_storage.hh
@@ -71,9 +71,7 @@ class raft_sys_table_storage : public raft::persistence {
 
     // Persist initial configuration of a new Raft group.
     // To be called before start for the new group.
-    // Uses a special snapshot id (0) to identify the snapshot
-    // descriptor.
-    future<> bootstrap(raft::configuration initial_configuation);
+    future<> bootstrap(raft::configuration initial_configuation, bool b);
 private:
 
     future<> do_store_log_entries(const std::vector<raft::log_entry_ptr>& entries);

The modified test passes with this.

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 21, 2023
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: scylladb#14066
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 21, 2023
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: scylladb#14066
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 21, 2023
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: scylladb#14066
@bhalevy
Copy link
Member Author

bhalevy commented Jun 25, 2023

Cc @DoronArazii this test is failing consistently in dtest-daily-release (but work is in progress)

@DoronArazii DoronArazii added this to the 5.3 milestone Jun 25, 2023
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 29, 2023
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: scylladb#14066
margdoc pushed a commit to margdoc/scylla that referenced this issue Jun 30, 2023
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: scylladb#14066

Closes scylladb#14336
@DoronArazii DoronArazii modified the milestones: 5.3, 5.4 Jul 4, 2023
@gdubicki
Copy link

gdubicki commented Aug 3, 2023

Hey, I think we are hit by this issue in Scylla 5.2. We cannot bootstrap a new node as it's stuck on:

[shard 0] storage_service - waiting for schema information to complete

Is this planned to be backported to 5.2 soon?

Or is there some kind of a workaround that we could use for a one-time fix of a bootstrap?

@kostja
Copy link
Contributor

kostja commented Aug 3, 2023

The commit that introduced the issue is not part of 5.2. Are you using consistent cluster management in 5.2? A discussion in a closed issue is not the best way to proceed, perhaps let's move on to slack? Generally, if schema propagation gets stuck, try an innocuous schema change, like create/drop an empty keyspace.

@gdubicki
Copy link

gdubicki commented Aug 4, 2023

Big thanks for the tip with a schema change, @kostja, it helped! (Btw: I did ask on slack but I did not get any answers.)

@avikivity
Copy link
Member

@kbr-scylla please evaluate for backport

@kbr-scylla
Copy link
Contributor

No need to backport since in 5.2 schema is still being pulled outside Raft.
Also backporting it would be a bit of a pain because it depends on some earlier changes.

@avikivity
Copy link
Member

Removing backport candidate label per above.

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jan 31, 2024
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: scylladb#14066

Closes scylladb#14336

(cherry picked from commit ff386e7)

Backport note: contrary to the claims above, it turns out that it is
actually necessary to create snapshots in clusters which bootstrap with
Raft, because of tombstones in current schema state expire hence
applying schema mutations from old Raft log entries is not really
idempotent. Snapshot transfer, which transfers group 0 history and
state_ids, prevents old entries from applying schema mutations over
latest schema state.

Ref: scylladb#16683
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Urgent status/release blocker Preventing from a release to be promoted symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants