-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backport 5.2] Raft snapshot fixes #17087
[Backport 5.2] Raft snapshot fixes #17087
Conversation
We create a snapshot (config only, but still), but do not assign it any id. Because of that it is not loaded on start. We do want it to be loaded though since the state of group0 will not be re-created from the log on restart because the entries will have outdated id and will be skipped. As a result in memory state machine state will not be restored. This is not a problem now since schema state it restored outside of raft code. Message-Id: <20230316112801.1004602-5-gleb@scylladb.com> (cherry picked from commit a690070)
When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In 91f609d we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: scylladb#14066 Closes scylladb#14336 (cherry picked from commit ff386e7) Backport note: contrary to the claims above, it turns out that it is actually necessary to create snapshots in clusters which bootstrap with Raft, because of tombstones in current schema state expire hence applying schema mutations from old Raft log entries is not really idempotent. Snapshot transfer, which transfers group 0 history and state_ids, prevents old entries from applying schema mutations over latest schema state. Ref: scylladb#16683
CI state |
@@ -303,6 +304,9 @@ future<> raft_sys_table_storage::execute_with_linearization_point(std::function< | |||
|
|||
future<> raft_sys_table_storage::bootstrap(raft::configuration initial_configuation, bool nontrivial_snapshot) { | |||
auto init_index = nontrivial_snapshot ? raft::index_t{1} : raft::index_t{0}; | |||
utils::get_local_injector().inject("raft_sys_table_storage::bootstrap/init_index_0", [&init_index] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to backport this if it is not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'll remove
Damn. This is #12972, and this backport will cause more snapshot transfers, triggering the issue more often. Looks like we're not ignoring the error in 5.2 dtests :( (we only do it in SCT). |
But from that perspective, it's an artificial failure - it doesn't really mean the functionality is broken. |
The problem is that Scylla is closing connections after learning topology and this may happen in the middle of snapshot transfer. This is not a problem for Raft which retries the snapshot transfer indefinitely until it succeeds, so everything works. But there is an ERROR in the logs due to that single failed transfer. dtest infrastructure considers any ERRORs failures unless they are explicitly ignored by the test. There was a fix that put some order into bootstrap so that the connection closing happens before we start snapshot transfers: #14354 But that fix was pretty complex and depended on other previous refactors, so we didn't backport it. We have two options here:
Actually I think I'll do the latter in this PR, the process will be smoother this way. |
In case the snapshot update fails, we don't truncate commit log. Fixes scylladb#9603 Closes scylladb#15540 (cherry picked from commit 1640f83)
When a node joins the cluster, it closes connections after learning topology information from other nodes, in order to reopen them with correct encryption, compression etc. In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot transfer. This was fixed in later versions by putting some order into the bootstrap process with 50e8ec7 but the fix was not backported due to many prerequisites and complexity. Raft automatically recovers from interrupted snapshot transfer by retrying it eventually, and everything works. However an ERROR is reported due to that one failed snapshot transfer, and dtests dont like ERRORs -- they report the test case as failed if an ERROR happened in any node's logs even if the test passed otherwise. Here we apply a simple workaround to please dtests -- in this particular scenario, turn the ERROR into a WARN.
I prepared this workaround: diff --git a/raft/server.cc b/raft/server.cc
index 48e15cf948..39a850c2c0 100644
--- a/raft/server.cc
+++ b/raft/server.cc
@@ -1064,6 +1064,18 @@ future<> server_impl::io_fiber(index_t last_stable) {
co_return;
}
+static bool is_closed_error(std::exception_ptr ep) {
+ try {
+ std::rethrow_exception(ep);
+ } catch (const seastar::rpc::remote_verb_error& e) {
+ return std::string_view{e.what()} == "connection is closed";
+ } catch (const seastar::rpc::closed_error&) {
+ return true;
+ } catch (...) {
+ return false;
+ }
+}
+
void server_impl::send_snapshot(server_id dst, install_snapshot&& snp) {
seastar::abort_source as;
uint64_t id = _next_snapshot_transfer_id++;
@@ -1079,7 +1091,11 @@ void server_impl::send_snapshot(server_id dst, install_snapshot&& snp) {
_snapshot_transfers.erase(dst);
auto reply = raft::snapshot_reply{.current_term = _fsm->get_current_term(), .success = false};
if (f.failed()) {
- logger.error("[{}] Transferring snapshot to {} failed with: {}", _id, dst, f.get_exception());
+ auto ep = f.get_exception();
+ // Report our or remote's closed_error as WARNs instead of ERRORs.
+ // Workaround for scylladb/scylladb#12972 for ScyllaDB 5.2.
+ auto level = is_closed_error(ep) ? log_level::warn : log_level::error;
+ logger.log(level, "[{}] Transferring snapshot to {} failed with: {}", _id, dst, ep);
} else {
logger.trace("[{}] Transferred snapshot to {}", _id, dst);
reply = f.get(); and tested manually that it works. I'll also need to include #17106 in the backport so we need to get that into |
This allows the user of `raft::server` to cause it to create a snapshot and truncate the Raft log (leaving no trailing entries; in the future we may extend the API to specify number of trailing entries left if needed). In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots. One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in scylladb#16683. In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (ff386e7) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes). Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables). The PR adds the API to `raft::server` and a HTTP endpoint that uses it. In a follow-up PR, we plan to modify group 0 server startup logic to automatically call this API if it sees that no snapshot is present yet (to automatically fix the aforementioned 5.2 deployments once they upgrade.) Closes scylladb#16816 * github.com:scylladb/scylladb: raft: remove `empty()` from `fsm_output` test: add test for manual triggering of Raft snapshots api: add HTTP endpoint to trigger Raft snapshots raft: server: add `trigger_snapshot` API raft: server: track last persisted snapshot descriptor index raft: server: framework for handling server requests raft: server: inline `poll_fsm_output` raft: server: fix indentation raft: server: move `io_fiber`'s processing of `batch` to a separate function raft: move `poll_output()` from `fsm` to `server` raft: move `_sm_events` from `fsm` to `server` raft: fsm: remove constructor used only in tests raft: fsm: move trace message from `poll_output` to `has_output` raft: fsm: extract `has_output()` raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor` raft: server: pass `*_aborted` to `set_exception` call (cherry picked from commit d202d32) Backport notes: - `has_output()` has a smaller condition in the backported version (because the condition was smaller in `poll_output()`) - `process_fsm_output` has a smaller body (because `io_fiber` had a smaller body) in the backported version - the HTTP API is only started if `raft_group_registry` is started
… from Kamil Braun The persisted snapshot index may be 0 if the snapshot was created in older version of Scylla, which means snapshot transfer won't be triggered to a bootstrapping node. Commands present in the log may not cover all schema changes --- group 0 might have been created through the upgrade upgrade procedure, on a cluster with existing schema. So a deployment with index=0 snapshot is broken and we need to fix it. We can use the new `raft::server::trigger_snapshot` API for that. Also add a test. Fixes scylladb#16683 Closes scylladb#17072 * github.com:scylladb/scylladb: test: add test for fixing a broken group 0 snapshot raft_group0: trigger snapshot if existing snapshot index is 0 (cherry picked from commit 181f68f) Backport note: test_raft_fix_broken_snapshot had to be removed because the "error injections enabled at startup" feature does not yet exist in 5.2.
c254ef0
to
7f29213
Compare
new version:
|
CI state |
Add workaround for scylladb/python-driver#295. Also an assert made at the end of the test was false, it is fixed with appropriate comment added. (cherry picked from commit 74bf60a)
At the end of the test, we wait until a restarted node receives a snapshot from the leader, and then verify that the log has been truncated. To check the snapshot, the test used the `system.raft_snapshots` table, while the log is stored in `system.raft`. Unfortunately, the two tables are not updated atomically when Raft persists a snapshot (scylladb#9603). We first update `system.raft_snapshots`, then `system.raft` (see `raft_sys_table_storage::store_snapshot_descriptor`). So after the wait finishes, there's no guarantee the log has been truncated yet -- there's a race between the test's last check and Scylla doing that last delete. But we can check the snapshot using `system.raft` instead of `system.raft_snapshots`, as `system.raft` has the latest ID. And since 1640f83, storing that ID and truncating the log in `system.raft` happens atomically. Closes scylladb#17106 (cherry picked from commit c911bf1)
7f29213
to
4e257c5
Compare
CI state |
Backports required to fix #16683 in 5.2: