Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 5.2] Raft snapshot fixes #17087

Merged

Conversation

kbr-scylla
Copy link
Contributor

Backports required to fix #16683 in 5.2:

  • when creating first group 0 server, create a snapshot with non-empty ID, and start it at index 1 instead of 0 to force snapshot transfer to servers that join group 0
  • add an API to trigger Raft snapshot
  • use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot.

Gleb Natapov and others added 2 commits January 31, 2024 16:50
We create a snapshot (config only, but still), but do not assign it any
id. Because of that it is not loaded on start. We do want it to be
loaded though since the state of group0 will not be re-created from the
log on restart because the entries will have outdated id and will be
skipped. As a result in memory state machine state will not be restored.
This is not a problem now since schema state it restored outside of raft
code.

Message-Id: <20230316112801.1004602-5-gleb@scylladb.com>
(cherry picked from commit a690070)
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: scylladb#14066

Closes scylladb#14336

(cherry picked from commit ff386e7)

Backport note: contrary to the claims above, it turns out that it is
actually necessary to create snapshots in clusters which bootstrap with
Raft, because of tombstones in current schema state expire hence
applying schema mutations from old Raft log entries is not really
idempotent. Snapshot transfer, which transfers group 0 history and
state_ids, prevents old entries from applying schema mutations over
latest schema state.

Ref: scylladb#16683
@scylladb-promoter
Copy link
Contributor

@@ -303,6 +304,9 @@ future<> raft_sys_table_storage::execute_with_linearization_point(std::function<

future<> raft_sys_table_storage::bootstrap(raft::configuration initial_configuation, bool nontrivial_snapshot) {
auto init_index = nontrivial_snapshot ? raft::index_t{1} : raft::index_t{0};
utils::get_local_injector().inject("raft_sys_table_storage::bootstrap/init_index_0", [&init_index] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to backport this if it is not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'll remove

@kbr-scylla
Copy link
Contributor Author

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-5.2/job/scylla-ci/91/

Damn. This is #12972, and this backport will cause more snapshot transfers, triggering the issue more often. Looks like we're not ignoring the error in 5.2 dtests :( (we only do it in SCT).

@mykaul
Copy link
Contributor

mykaul commented Feb 1, 2024

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-5.2/job/scylla-ci/91/

Damn. This is #12972, and this backport will cause more snapshot transfers, triggering the issue more often. Looks like we're not ignoring the error in 5.2 dtests :( (we only do it in SCT).

But from that perspective, it's an artificial failure - it doesn't really mean the functionality is broken.
Or - do we need to backport some dtest fix to 5.2?

@kbr-scylla
Copy link
Contributor Author

kbr-scylla commented Feb 1, 2024

But from that perspective, it's an artificial failure - it doesn't really mean the functionality is broken.

The problem is that Scylla is closing connections after learning topology and this may happen in the middle of snapshot transfer. This is not a problem for Raft which retries the snapshot transfer indefinitely until it succeeds, so everything works. But there is an ERROR in the logs due to that single failed transfer. dtest infrastructure considers any ERRORs failures unless they are explicitly ignored by the test.

There was a fix that put some order into bootstrap so that the connection closing happens before we start snapshot transfers: #14354

But that fix was pretty complex and depended on other previous refactors, so we didn't backport it.

We have two options here:

  • a change to dtest in 5.2 to ignore this pattern in the logs (in all dtests)
  • change to Scylla to make it a WARN instead of ERROR

Actually I think I'll do the latter in this PR, the process will be smoother this way.

michaelhly and others added 2 commits February 1, 2024 13:10
In case the snapshot update fails, we don't truncate commit log.

Fixes scylladb#9603

Closes scylladb#15540

(cherry picked from commit 1640f83)
When a node joins the cluster, it closes connections after learning
topology information from other nodes, in order to reopen them with
correct encryption, compression etc.

In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot
transfer. This was fixed in later versions by putting some order into
the bootstrap process with 50e8ec7 but
the fix was not backported due to many prerequisites and complexity.

Raft automatically recovers from interrupted snapshot transfer by
retrying it eventually, and everything works. However an ERROR is
reported due to that one failed snapshot transfer, and dtests dont like
ERRORs -- they report the test case as failed if an ERROR happened in
any node's logs even if the test passed otherwise.

Here we apply a simple workaround to please dtests -- in this particular
scenario, turn the ERROR into a WARN.
@kbr-scylla
Copy link
Contributor Author

kbr-scylla commented Feb 1, 2024

I prepared this workaround:

diff --git a/raft/server.cc b/raft/server.cc
index 48e15cf948..39a850c2c0 100644
--- a/raft/server.cc
+++ b/raft/server.cc
@@ -1064,6 +1064,18 @@ future<> server_impl::io_fiber(index_t last_stable) {
     co_return;
 }
 
+static bool is_closed_error(std::exception_ptr ep) {
+    try {
+        std::rethrow_exception(ep);
+    } catch (const seastar::rpc::remote_verb_error& e) {
+        return std::string_view{e.what()} == "connection is closed";
+    } catch (const seastar::rpc::closed_error&) {
+        return true;
+    } catch (...) {
+        return false;
+    }
+}
+
 void server_impl::send_snapshot(server_id dst, install_snapshot&& snp) {
     seastar::abort_source as;
     uint64_t id = _next_snapshot_transfer_id++;
@@ -1079,7 +1091,11 @@ void server_impl::send_snapshot(server_id dst, install_snapshot&& snp) {
             _snapshot_transfers.erase(dst);
             auto reply = raft::snapshot_reply{.current_term = _fsm->get_current_term(), .success = false};
             if (f.failed()) {
-                logger.error("[{}] Transferring snapshot to {} failed with: {}", _id, dst, f.get_exception());
+                auto ep = f.get_exception();
+                // Report our or remote's closed_error as WARNs instead of ERRORs.
+                // Workaround for scylladb/scylladb#12972 for ScyllaDB 5.2.
+                auto level = is_closed_error(ep) ? log_level::warn : log_level::error;
+                logger.log(level, "[{}] Transferring snapshot to {} failed with: {}", _id, dst, ep);
             } else {
                 logger.trace("[{}] Transferred snapshot to {}", _id, dst);
                 reply = f.get();

and tested manually that it works.

I'll also need to include #17106 in the backport so we need to get that into master first.

This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to `raft::server` and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

Closes scylladb#16816

* github.com:scylladb/scylladb:
  raft: remove `empty()` from `fsm_output`
  test: add test for manual triggering of Raft snapshots
  api: add HTTP endpoint to trigger Raft snapshots
  raft: server: add `trigger_snapshot` API
  raft: server: track last persisted snapshot descriptor index
  raft: server: framework for handling server requests
  raft: server: inline `poll_fsm_output`
  raft: server: fix indentation
  raft: server: move `io_fiber`'s processing of `batch` to a separate function
  raft: move `poll_output()` from `fsm` to `server`
  raft: move `_sm_events` from `fsm` to `server`
  raft: fsm: remove constructor used only in tests
  raft: fsm: move trace message from `poll_output` to `has_output`
  raft: fsm: extract `has_output()`
  raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
  raft: server: pass `*_aborted` to `set_exception` call

(cherry picked from commit d202d32)

Backport notes:
- `has_output()` has a smaller condition in the backported version
  (because the condition was smaller in `poll_output()`)
- `process_fsm_output` has a smaller body (because `io_fiber` had a
  smaller body) in the backported version
- the HTTP API is only started if `raft_group_registry` is started
… from Kamil Braun

The persisted snapshot index may be 0 if the snapshot was created in
older version of Scylla, which means snapshot transfer won't be
triggered to a bootstrapping node. Commands present in the log may not
cover all schema changes --- group 0 might have been created through the
upgrade upgrade procedure, on a cluster with existing schema. So a
deployment with index=0 snapshot is broken and we need to fix it. We can
use the new `raft::server::trigger_snapshot` API for that.

Also add a test.

Fixes scylladb#16683

Closes scylladb#17072

* github.com:scylladb/scylladb:
  test: add test for fixing a broken group 0 snapshot
  raft_group0: trigger snapshot if existing snapshot index is 0

(cherry picked from commit 181f68f)

Backport note: test_raft_fix_broken_snapshot had to be removed because
the "error injections enabled at startup" feature does not yet exist in
5.2.
@kbr-scylla
Copy link
Contributor Author

kbr-scylla commented Feb 2, 2024

new version:

@scylladb-promoter
Copy link
Contributor

Add workaround for scylladb/python-driver#295.

Also an assert made at the end of the test was false, it is fixed with
appropriate comment added.

(cherry picked from commit 74bf60a)
At the end of the test, we wait until a restarted node receives a
snapshot from the leader, and then verify that the log has been
truncated.

To check the snapshot, the test used the `system.raft_snapshots` table,
while the log is stored in `system.raft`.

Unfortunately, the two tables are not updated atomically when Raft
persists a snapshot (scylladb#9603). We first update
`system.raft_snapshots`, then `system.raft` (see
`raft_sys_table_storage::store_snapshot_descriptor`). So after the wait
finishes, there's no guarantee the log has been truncated yet -- there's
a race between the test's last check and Scylla doing that last delete.

But we can check the snapshot using `system.raft` instead of
`system.raft_snapshots`, as `system.raft` has the latest ID. And since
1640f83, storing that ID and truncating
the log in `system.raft` happens atomically.

Closes scylladb#17106

(cherry picked from commit c911bf1)
@scylladb-promoter
Copy link
Contributor

@mykaul mykaul added Field-Tier1 Urgent issues as per FieldEngineering request area/raft labels Feb 4, 2024
@scylladb-promoter scylladb-promoter merged commit 9291eaf into scylladb:branch-5.2 Feb 7, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft Field-Tier1 Urgent issues as per FieldEngineering request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants