[Backport 5.2] Raft snapshot fixes #17087

kbr-scylla · 2024-01-31T17:10:19Z

Backports required to fix #16683 in 5.2:

when creating first group 0 server, create a snapshot with non-empty ID, and start it at index 1 instead of 0 to force snapshot transfer to servers that join group 0
add an API to trigger Raft snapshot
use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot.

We create a snapshot (config only, but still), but do not assign it any id. Because of that it is not loaded on start. We do want it to be loaded though since the state of group0 will not be re-created from the log on restart because the entries will have outdated id and will be skipped. As a result in memory state machine state will not be restored. This is not a problem now since schema state it restored outside of raft code. Message-Id: <20230316112801.1004602-5-gleb@scylladb.com> (cherry picked from commit a690070)

When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In 91f609d we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: scylladb#14066 Closes scylladb#14336 (cherry picked from commit ff386e7) Backport note: contrary to the claims above, it turns out that it is actually necessary to create snapshots in clusters which bootstrap with Raft, because of tombstones in current schema state expire hence applying schema mutations from old Raft log entries is not really idempotent. Snapshot transfer, which transfers group 0 history and state_ids, prevents old entries from applying schema mutations over latest schema state. Ref: scylladb#16683

scylladb-promoter · 2024-01-31T20:18:01Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-5.2/job/scylla-ci/91/

gleb-cloudius · 2024-02-01T09:36:47Z

service/raft/raft_sys_table_storage.cc

@@ -303,6 +304,9 @@ future<> raft_sys_table_storage::execute_with_linearization_point(std::function<

 future<> raft_sys_table_storage::bootstrap(raft::configuration initial_configuation, bool nontrivial_snapshot) {
    auto init_index = nontrivial_snapshot ? raft::index_t{1} : raft::index_t{0};
+    utils::get_local_injector().inject("raft_sys_table_storage::bootstrap/init_index_0", [&init_index] {


Do we need to backport this if it is not used?

No, I'll remove

kbr-scylla · 2024-02-01T10:40:40Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-5.2/job/scylla-ci/91/

Damn. This is #12972, and this backport will cause more snapshot transfers, triggering the issue more often. Looks like we're not ignoring the error in 5.2 dtests :( (we only do it in SCT).

mykaul · 2024-02-01T10:43:28Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-5.2/job/scylla-ci/91/

Damn. This is #12972, and this backport will cause more snapshot transfers, triggering the issue more often. Looks like we're not ignoring the error in 5.2 dtests :( (we only do it in SCT).

But from that perspective, it's an artificial failure - it doesn't really mean the functionality is broken.
Or - do we need to backport some dtest fix to 5.2?

kbr-scylla · 2024-02-01T10:47:57Z

But from that perspective, it's an artificial failure - it doesn't really mean the functionality is broken.

The problem is that Scylla is closing connections after learning topology and this may happen in the middle of snapshot transfer. This is not a problem for Raft which retries the snapshot transfer indefinitely until it succeeds, so everything works. But there is an ERROR in the logs due to that single failed transfer. dtest infrastructure considers any ERRORs failures unless they are explicitly ignored by the test.

There was a fix that put some order into bootstrap so that the connection closing happens before we start snapshot transfers: #14354

But that fix was pretty complex and depended on other previous refactors, so we didn't backport it.

We have two options here:

a change to dtest in 5.2 to ignore this pattern in the logs (in all dtests)
change to Scylla to make it a WARN instead of ERROR

Actually I think I'll do the latter in this PR, the process will be smoother this way.

In case the snapshot update fails, we don't truncate commit log. Fixes scylladb#9603 Closes scylladb#15540 (cherry picked from commit 1640f83)

When a node joins the cluster, it closes connections after learning topology information from other nodes, in order to reopen them with correct encryption, compression etc. In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot transfer. This was fixed in later versions by putting some order into the bootstrap process with 50e8ec7 but the fix was not backported due to many prerequisites and complexity. Raft automatically recovers from interrupted snapshot transfer by retrying it eventually, and everything works. However an ERROR is reported due to that one failed snapshot transfer, and dtests dont like ERRORs -- they report the test case as failed if an ERROR happened in any node's logs even if the test passed otherwise. Here we apply a simple workaround to please dtests -- in this particular scenario, turn the ERROR into a WARN.

kbr-scylla · 2024-02-01T13:30:33Z

I prepared this workaround:

diff --git a/raft/server.cc b/raft/server.cc
index 48e15cf948..39a850c2c0 100644
--- a/raft/server.cc
+++ b/raft/server.cc
@@ -1064,6 +1064,18 @@ future<> server_impl::io_fiber(index_t last_stable) {
     co_return;
 }
 
+static bool is_closed_error(std::exception_ptr ep) {
+    try {
+        std::rethrow_exception(ep);
+    } catch (const seastar::rpc::remote_verb_error& e) {
+        return std::string_view{e.what()} == "connection is closed";
+    } catch (const seastar::rpc::closed_error&) {
+        return true;
+    } catch (...) {
+        return false;
+    }
+}
+
 void server_impl::send_snapshot(server_id dst, install_snapshot&& snp) {
     seastar::abort_source as;
     uint64_t id = _next_snapshot_transfer_id++;
@@ -1079,7 +1091,11 @@ void server_impl::send_snapshot(server_id dst, install_snapshot&& snp) {
             _snapshot_transfers.erase(dst);
             auto reply = raft::snapshot_reply{.current_term = _fsm->get_current_term(), .success = false};
             if (f.failed()) {
-                logger.error("[{}] Transferring snapshot to {} failed with: {}", _id, dst, f.get_exception());
+                auto ep = f.get_exception();
+                // Report our or remote's closed_error as WARNs instead of ERRORs.
+                // Workaround for scylladb/scylladb#12972 for ScyllaDB 5.2.
+                auto level = is_closed_error(ep) ? log_level::warn : log_level::error;
+                logger.log(level, "[{}] Transferring snapshot to {} failed with: {}", _id, dst, ep);
             } else {
                 logger.trace("[{}] Transferred snapshot to {}", _id, dst);
                 reply = f.get();

and tested manually that it works.

I'll also need to include #17106 in the backport so we need to get that into master first.

This allows the user of `raft::server` to cause it to create a snapshot and truncate the Raft log (leaving no trailing entries; in the future we may extend the API to specify number of trailing entries left if needed). In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots. One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in scylladb#16683. In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (ff386e7) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes). Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables). The PR adds the API to `raft::server` and a HTTP endpoint that uses it. In a follow-up PR, we plan to modify group 0 server startup logic to automatically call this API if it sees that no snapshot is present yet (to automatically fix the aforementioned 5.2 deployments once they upgrade.) Closes scylladb#16816 * github.com:scylladb/scylladb: raft: remove `empty()` from `fsm_output` test: add test for manual triggering of Raft snapshots api: add HTTP endpoint to trigger Raft snapshots raft: server: add `trigger_snapshot` API raft: server: track last persisted snapshot descriptor index raft: server: framework for handling server requests raft: server: inline `poll_fsm_output` raft: server: fix indentation raft: server: move `io_fiber`'s processing of `batch` to a separate function raft: move `poll_output()` from `fsm` to `server` raft: move `_sm_events` from `fsm` to `server` raft: fsm: remove constructor used only in tests raft: fsm: move trace message from `poll_output` to `has_output` raft: fsm: extract `has_output()` raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor` raft: server: pass `*_aborted` to `set_exception` call (cherry picked from commit d202d32) Backport notes: - `has_output()` has a smaller condition in the backported version (because the condition was smaller in `poll_output()`) - `process_fsm_output` has a smaller body (because `io_fiber` had a smaller body) in the backported version - the HTTP API is only started if `raft_group_registry` is started

… from Kamil Braun The persisted snapshot index may be 0 if the snapshot was created in older version of Scylla, which means snapshot transfer won't be triggered to a bootstrapping node. Commands present in the log may not cover all schema changes --- group 0 might have been created through the upgrade upgrade procedure, on a cluster with existing schema. So a deployment with index=0 snapshot is broken and we need to fix it. We can use the new `raft::server::trigger_snapshot` API for that. Also add a test. Fixes scylladb#16683 Closes scylladb#17072 * github.com:scylladb/scylladb: test: add test for fixing a broken group 0 snapshot raft_group0: trigger snapshot if existing snapshot index is 0 (cherry picked from commit 181f68f) Backport note: test_raft_fix_broken_snapshot had to be removed because the "error injections enabled at startup" feature does not yet exist in 5.2.

kbr-scylla · 2024-02-02T08:42:18Z

new version:

included fixes for new test flakiness
included 1640f83
included workaround for raft - remote_verb_error (connection is closed) when transferring snapshot #12972 to make dtests pass
in main.cc, register the API only if raft_group_registry is started (on master it's started unconditionally)
removed unused error injection

scylladb-promoter · 2024-02-02T10:25:17Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-5.2/job/scylla-ci/92/

Add workaround for scylladb/python-driver#295. Also an assert made at the end of the test was false, it is fixed with appropriate comment added. (cherry picked from commit 74bf60a)

At the end of the test, we wait until a restarted node receives a snapshot from the leader, and then verify that the log has been truncated. To check the snapshot, the test used the `system.raft_snapshots` table, while the log is stored in `system.raft`. Unfortunately, the two tables are not updated atomically when Raft persists a snapshot (scylladb#9603). We first update `system.raft_snapshots`, then `system.raft` (see `raft_sys_table_storage::store_snapshot_descriptor`). So after the wait finishes, there's no guarantee the log has been truncated yet -- there's a race between the test's last check and Scylla doing that last delete. But we can check the snapshot using `system.raft` instead of `system.raft_snapshots`, as `system.raft` has the latest ID. And since 1640f83, storing that ID and truncating the log in `system.raft` happens atomically. Closes scylladb#17106 (cherry picked from commit c911bf1)

scylladb-promoter · 2024-02-02T13:32:27Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-5.2/job/scylla-ci/93/

Gleb Natapov and others added 2 commits January 31, 2024 16:50

kbr-scylla requested a review from gleb-cloudius January 31, 2024 17:11

gleb-cloudius reviewed Feb 1, 2024

View reviewed changes

gleb-cloudius approved these changes Feb 1, 2024

View reviewed changes

michaelhly and others added 2 commits February 1, 2024 13:10

raft: Store snapshot update and truncate log atomically

84004ab

In case the snapshot update fails, we don't truncate commit log. Fixes scylladb#9603 Closes scylladb#15540 (cherry picked from commit 1640f83)

denesb added 2 commits February 1, 2024 15:38

kbr-scylla force-pushed the backport-snapshot-fixes branch from c254ef0 to 7f29213 Compare February 2, 2024 08:41

kbr-scylla added 2 commits February 2, 2024 11:31

test_raft_snapshot_request: fix flakiness

08021dc

Add workaround for scylladb/python-driver#295. Also an assert made at the end of the test was false, it is fixed with appropriate comment added. (cherry picked from commit 74bf60a)

kbr-scylla force-pushed the backport-snapshot-fixes branch from 7f29213 to 4e257c5 Compare February 2, 2024 10:34

kbr-scylla requested a review from gleb-cloudius February 2, 2024 13:41

gleb-cloudius approved these changes Feb 4, 2024

View reviewed changes

mykaul added Field-Tier1 Urgent issues as per FieldEngineering request area/raft labels Feb 4, 2024

scylladb-promoter merged commit 9291eaf into scylladb:branch-5.2 Feb 7, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport 5.2] Raft snapshot fixes #17087

[Backport 5.2] Raft snapshot fixes #17087

kbr-scylla commented Jan 31, 2024

scylladb-promoter commented Jan 31, 2024

gleb-cloudius Feb 1, 2024

kbr-scylla Feb 1, 2024

kbr-scylla commented Feb 1, 2024

mykaul commented Feb 1, 2024

kbr-scylla commented Feb 1, 2024 •

edited

kbr-scylla commented Feb 1, 2024 •

edited

kbr-scylla commented Feb 2, 2024 •

edited

scylladb-promoter commented Feb 2, 2024

scylladb-promoter commented Feb 2, 2024

[Backport 5.2] Raft snapshot fixes #17087

[Backport 5.2] Raft snapshot fixes #17087

Conversation

kbr-scylla commented Jan 31, 2024

scylladb-promoter commented Jan 31, 2024

gleb-cloudius Feb 1, 2024

Choose a reason for hiding this comment

kbr-scylla Feb 1, 2024

Choose a reason for hiding this comment

kbr-scylla commented Feb 1, 2024

mykaul commented Feb 1, 2024

kbr-scylla commented Feb 1, 2024 • edited

kbr-scylla commented Feb 1, 2024 • edited

kbr-scylla commented Feb 2, 2024 • edited

scylladb-promoter commented Feb 2, 2024

scylladb-promoter commented Feb 2, 2024

kbr-scylla commented Feb 1, 2024 •

edited

kbr-scylla commented Feb 1, 2024 •

edited

kbr-scylla commented Feb 2, 2024 •

edited