Add an API to trigger snapshot in Raft servers #16816

kbr-scylla · 2024-01-17T10:28:00Z

This allows the user of raft::server to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in #16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to raft::server and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

kbr-scylla · 2024-01-17T10:31:59Z

raft/fsm.hh

+        // Always 0 for non-local snapshots.
+        size_t max_trailing_entries;
+
+        // FIXME: include max_trailing_bytes here and in store_snapshot_descriptor


raft/fsm.hh

gleb-cloudius · 2024-01-17T11:20:52Z

raft/server.cc

@@ -1003,10 +1003,10 @@ future<> server_impl::io_fiber(index_t last_stable) {
            }

            if (batch.snp) {
-                auto& [snp, is_local] = *batch.snp;
+                auto& [snp, is_local, max_trailing_entries] = *batch.snp;


assert that max_trailing_entries if !is_local?

raft/server.cc

tzach · 2024-01-17T13:28:34Z

@kbr-scylla to validate: this is not related to snapshot API in /api-doc/storage_service.json, right?
This PR is about Raft log snapshot, while the storage one is sstable snapshot.

kbr-scylla · 2024-01-17T13:30:50Z

@tzach yes

tzach · 2024-01-17T13:36:51Z

It will be good to document why and when one should use the new API.

scylladb-promoter · 2024-01-17T13:54:32Z

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 2 hr 25 min
Builder: i-08149ebcfe29ff6f9 (m5d.12xlarge)

This looks like a minor oversight, in `server_impl::abort` there are multiple calls to `set_exception` on the different promises, only one of them would not receive `*_aborted`.

…shot_descriptor` This parameter says how many entries at most should be left trailing before the snapshot index. There are multiple places where this decision is made: - in `applier_fiber` when the server locally decides to take a snapshot due to log size pressure; this applies to the in-memory log - in `fsm::step` when the server received an `install_snapshot` message from the leader; this also applies to the in-memory log - and in `io_fiber` when calling `store_snapshot_descriptor`; this applies to the on-disk log. The logic of how many entries should be left trailing is calculated twice: - first, in `applier_fiber` or in `fsm::step` when truncating the in-memory log - and then again as the snapshot descriptor is being persisted. The logic is to take `_config.snapshot_trailing` for locally generated snapshots (coming from `applier_fiber`) and `0` for remote snapshots (from `fsm::step`). But there is already an error injection that changes the behavior of `applier_fiber` to leave `0` trailing entries. However, this doesn't affect the following `store_snapshot_descriptor` call which still uses `_config.snapshot_trailing`. So if the server got restarted, the entries which were truncated in-memory would get "revived" from disk. Fortunately, this is test-only code. However in future commits we'd like to change the logic of `applier_fiber` even further. So instead of having a separate calculation of trailing entries inside `io_fiber`, it's better for it to use the number that was already calculated once. This number is passed to `fsm::apply_snapshot` (by `applier_fiber` or `fsm::step`) and can then be received by `io_fiber` from `fsm_output` to use it inside `store_snapshot_descriptor`.

Also use the more efficient coroutine-specific `condition_variable::when` instead of `wait`.

In a later commit we'll move `poll_output` out of `fsm` and it won't have access to internals logged by this message (`_log.stable_idx()`). Besides, having it in `has_output` gives a more detailed trace. In particular we can now see values such as `stable_idx` and `last_idx` from the moment of returning a new fsm output, not only when poll started waiting for it (a lot of time can pass between these two events).

This constructor does not provide persisted commit index. It was only used in tests, so move it there, to the helper `fsm_debug` which inherits from `fsm`. Test cases which used `fsm` directly instead of `fsm_debug` were modified to use `fsm_debug` so they can access the constructor. `fsm_debug` doesn't change the behavior of `fsm`, only adds some helper members. This will be useful in following commits too.

In later commits we will use it to wake up `io_fiber` directly from `raft::server` based on events generated by `raft::server` itself -- not only from events generated by `raft::fsm`. `raft::fsm` still obtains a reference to the condition variable so it can keep signaling it.

`server` was the only user of this function and it can now be implemented using `fsm`'s public interface. In later commits we'll extend the logic of `io_fiber` to also subscribe to other events, triggered by `server` API calls, not only to outputs from `fsm`.

…unction

kbr-scylla · 2024-01-18T17:14:37Z

May be we can change the io_fiber to wait on two different sources of events (fsm generated and server generated),

This is the approach I took in new version.

Basically the _sm_events condition variable and polling logic was moved into raft::server. Then the logic was extended with second source of events.

scylladb-promoter · 2024-01-18T19:45:41Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests
✅ - dtest

Build Details:

Duration: 2 hr 32 min
Builder: spider8.cloudius-systems.com

gleb-cloudius · 2024-01-21T08:38:50Z

raft/fsm.cc

@@ -41,10 +41,6 @@ fsm::fsm(server_id id, term_t current_term, server_id voted_for, log log,
    }
 }

-fsm::fsm(server_id id, term_t current_term, server_id voted_for, log log,


I am not sure there is a merit to this patch. persisting commit index it optional. We happen to do it in Scylla and this is the only non test user, but the functionality is general.

Easier to make a conscious decision "I don't want to restore commit index" by explicitly providing index_t{0} in constructor. Minor nuisance but always better to make someone think for a second than not IMO.

Also fewer constructors to extend when adding new stuff like I do in later commit.

OK, but IMO the problem with requiring explicit index_t{0} is that it is not clear from the API that it is optional. How a user of the API knows it can provide index_t{0} and not the real value and the code will work?

You're pointing to a larger problem here --- the fact that fsm constructor is not documented at all.
I can fill this hole.

Documentation is good. Making signature more descriptive is also good (does not obliterate the need for the documentation of course). Making it std::optional would be more descriptive, but OTH not sure it is worth it.

raft/server.cc

gleb-cloudius · 2024-01-21T09:33:15Z

raft/server.cc

+                    logger.info("[{}] while taking snapshot term={} idx={} id={} due to request,"
+                           " fsm received a later snapshot at idx={}", _id, snp.term, snp.idx, snp.id, _fsm->log_last_snapshot_idx());
+                }
+                _stats.snapshots_taken++;


Can a lot of this be unified with the snapshot taken code above in the same function?

There are a bunch of differences, I don't see a straightforward way. Also it's only 2 "duplications" (but only on the surface) then IMO not worth pursuing this. With 3 or more I would consider further.

kbr-scylla · 2024-01-23T15:42:45Z

new version:

trigger_snapshot takes abort_source pointer instead of reference
trigger_snapshot returns true if it does create a snapshot and false otherwise (used it for logging in the rest endpoint)
logger calls inside trigger_snapshot demoted to debug (also one call was broken, fixed)

Diff: https://github.com/scylladb/scylladb/compare/3977cd16fa52327ae4a90c0ff592db1a0b731b99..e0a3b74b7e0f170abca4c7be5b0fdece7ba0c3a2

Also https://github.com/kbr-scylla/scylladb/commits/auto-snapshot/ contains the follow-up (last two commits), will send PR after this one is merged

Also verified that this + follow-up is backportable to 5.2 (with some pain, but doable) https://github.com/kbr-scylla/scylladb/commits/backport-auto-snapshot/ and manually checked that the snapshot will automatically be created after restart.

This allows the user of `raft::server` to ask it to create a snapshot and truncate the Raft log. In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots. One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in scylladb#16683. In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (ff386e7) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes). Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables).

This uses the `trigger_snapshot()` API added in previous commit on a server running for the given Raft group. It can be used for example in tests or in the context of disaster recovery (ref scylladb#16683).

Nobody remembered to keep this function up to date when adding stuff to `fsm_output`. Turns out that it's not being used by any Raft logic but only in some tests. That use case can now be replaced with `fsm::has_output()` which is also being used by `raft::server` code.

kbr-scylla · 2024-01-23T15:49:00Z

Another push, forgot to demote logs inside applier_fiber: https://github.com/scylladb/scylladb/compare/e0a3b74b7e0f170abca4c7be5b0fdece7ba0c3a2..1824c12975bc83668fdb2626885aedd4a69d8911

scylladb-promoter · 2024-01-23T20:29:49Z

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 4 hr 40 min
Builder: i-0ab25805e8f4c184f (m5ad.8xlarge)

kbr-scylla · 2024-01-24T09:03:47Z

Hi @scylladb/scylla-maint, who would like to review a change inside Raft core?

kbr-scylla · 2024-01-25T09:50:03Z

@tgrabiec please review

kbr-scylla · 2024-01-26T11:51:37Z

@tgrabiec ping please review

kbr-scylla · 2024-01-26T15:31:57Z

@scylladb/scylla-maint we need to merge this and my follow-up (which I didn't send yet but will after this one is merged) ASAP, the problem I'm fixing is already triggering PagerDuty on customer clusters and the only workaround we have is to perform 1000 schema changes.

avikivity · 2024-01-27T15:11:24Z

This allows the user of raft::server to cause it to create a snapshot and truncate the Raft log (leaving no trailing entries; in the future we may extend the API to specify number of trailing entries left if needed). In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in #16683.

Shouldn't this be fixed by always sending a snapshot? 5.2 is vulnerable to it when adding a new node. I remember it's worked around since the new node pulls the schema itself, but it seems wrong.

What does "Add an API" mean? an operator has to call it manually?

In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (ff386e7) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables).

Seems like this applies to group0 too. And should be automatic. The service that uses Raft should configure it with a function to determine whether

The PR adds the API to raft::server and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically call this API if it sees that no snapshot is present yet (to automatically fix the aforementioned 5.2 deployments once they upgrade.)

avikivity · 2024-01-27T15:11:48Z

Ah I see you promise to call it automatically.

kbr-scylla · 2024-01-29T10:00:44Z

What does "Add an API" mean? an operator has to call it manually?

In this PR yes, the raft/server API is used to implement a REST API which is then used in tests and can be called manually, but I'll send a follow-up to call it automatically too.

kbr-scylla · 2024-01-29T10:03:30Z

Shouldn't this be fixed by always sending a snapshot? 5.2 is vulnerable to it when adding a new node. I remember it's worked around since the new node pulls the schema itself, but it seems wrong.

Raft can't send a snapshot if there is none. So the fix is to create one. Yes 5.2 and 2023.1 is vulnerable (and we got paged on Friday because of this) and we need to merge this and the follow-up and backport them ASAP. I'll take care of the backports.

denesb

I have gone through the changes but of course I cannot say that I verified that they are correct. I will rely on Gleb to have done that.

Is there a way for users to cause harm to a cluster by using this new rest API?

gleb-cloudius · 2024-01-29T12:43:21Z

Is there a way for users to cause harm to a cluster by using this new rest API?

User may only create performance problems using this.

denesb · 2024-01-29T13:06:22Z

Ok, queued.

This allows the user of `raft::server` to cause it to create a snapshot and truncate the Raft log (leaving no trailing entries; in the future we may extend the API to specify number of trailing entries left if needed). In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots. One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in scylladb#16683. In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (ff386e7) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes). Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables). The PR adds the API to `raft::server` and a HTTP endpoint that uses it. In a follow-up PR, we plan to modify group 0 server startup logic to automatically call this API if it sees that no snapshot is present yet (to automatically fix the aforementioned 5.2 deployments once they upgrade.) Closes scylladb#16816 * github.com:scylladb/scylladb: raft: remove `empty()` from `fsm_output` test: add test for manual triggering of Raft snapshots api: add HTTP endpoint to trigger Raft snapshots raft: server: add `trigger_snapshot` API raft: server: track last persisted snapshot descriptor index raft: server: framework for handling server requests raft: server: inline `poll_fsm_output` raft: server: fix indentation raft: server: move `io_fiber`'s processing of `batch` to a separate function raft: move `poll_output()` from `fsm` to `server` raft: move `_sm_events` from `fsm` to `server` raft: fsm: remove constructor used only in tests raft: fsm: move trace message from `poll_output` to `has_output` raft: fsm: extract `has_output()` raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor` raft: server: pass `*_aborted` to `set_exception` call (cherry picked from commit d202d32) Backport notes: - `has_output()` has a smaller condition in the backported version (because the condition was smaller in `poll_output()`) - `process_fsm_output` has a smaller body (because `io_fiber` had a smaller body) in the backported version

This allows the user of `raft::server` to cause it to create a snapshot and truncate the Raft log (leaving no trailing entries; in the future we may extend the API to specify number of trailing entries left if needed). In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots. One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in scylladb#16683. In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (ff386e7) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes). Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables). The PR adds the API to `raft::server` and a HTTP endpoint that uses it. In a follow-up PR, we plan to modify group 0 server startup logic to automatically call this API if it sees that no snapshot is present yet (to automatically fix the aforementioned 5.2 deployments once they upgrade.) Closes scylladb#16816 * github.com:scylladb/scylladb: raft: remove `empty()` from `fsm_output` test: add test for manual triggering of Raft snapshots api: add HTTP endpoint to trigger Raft snapshots raft: server: add `trigger_snapshot` API raft: server: track last persisted snapshot descriptor index raft: server: framework for handling server requests raft: server: inline `poll_fsm_output` raft: server: fix indentation raft: server: move `io_fiber`'s processing of `batch` to a separate function raft: move `poll_output()` from `fsm` to `server` raft: move `_sm_events` from `fsm` to `server` raft: fsm: remove constructor used only in tests raft: fsm: move trace message from `poll_output` to `has_output` raft: fsm: extract `has_output()` raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor` raft: server: pass `*_aborted` to `set_exception` call (cherry picked from commit d202d32) Backport notes: - `has_output()` has a smaller condition in the backported version (because the condition was smaller in `poll_output()`) - `process_fsm_output` has a smaller body (because `io_fiber` had a smaller body) in the backported version - the HTTP API is only started if `raft_group_registry` is started

This allows the user of `raft::server` to cause it to create a snapshot and truncate the Raft log (leaving no trailing entries; in the future we may extend the API to specify number of trailing entries left if needed). In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots. One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in scylladb#16683. In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (ff386e7) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes). Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables). The PR adds the API to `raft::server` and a HTTP endpoint that uses it. In a follow-up PR, we plan to modify group 0 server startup logic to automatically call this API if it sees that no snapshot is present yet (to automatically fix the aforementioned 5.2 deployments once they upgrade.) Closes scylladb#16816 * github.com:scylladb/scylladb: raft: remove `empty()` from `fsm_output` test: add test for manual triggering of Raft snapshots api: add HTTP endpoint to trigger Raft snapshots raft: server: add `trigger_snapshot` API raft: server: track last persisted snapshot descriptor index raft: server: framework for handling server requests raft: server: inline `poll_fsm_output` raft: server: fix indentation raft: server: move `io_fiber`'s processing of `batch` to a separate function raft: move `poll_output()` from `fsm` to `server` raft: move `_sm_events` from `fsm` to `server` raft: fsm: remove constructor used only in tests raft: fsm: move trace message from `poll_output` to `has_output` raft: fsm: extract `has_output()` raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor` raft: server: pass `*_aborted` to `set_exception` call (cherry picked from commit d202d32) Backport note: the HTTP API is only started if raft_group_registry is started.

kbr-scylla requested review from tgrabiec and gleb-cloudius January 17, 2024 10:28

kbr-scylla mentioned this pull request Jan 17, 2024

Bootstrap fails due to inconsistent schema after group0 catch up when table was recreated in the past #16683

Closed

kbr-scylla commented Jan 17, 2024

View reviewed changes

gleb-cloudius reviewed Jan 17, 2024

View reviewed changes

raft/fsm.hh Show resolved Hide resolved

gleb-cloudius reviewed Jan 17, 2024

View reviewed changes

raft/server.cc Outdated Show resolved Hide resolved

kbr-scylla added 10 commits January 18, 2024 18:05

raft: server: pass *_aborted to set_exception call

40cd91c

This looks like a minor oversight, in `server_impl::abort` there are multiple calls to `set_exception` on the different promises, only one of them would not receive `*_aborted`.

raft: fsm: extract has_output()

f6d4377

Also use the more efficient coroutine-specific `condition_variable::when` instead of `wait`.

raft: server: move io_fiber's processing of batch to a separate f…

5277809

…unction

raft: server: fix indentation

754a7b5

raft: server: inline poll_fsm_output

8d9b0a6

kbr-scylla force-pushed the trigger-snapshot branch from 120b236 to 29198cf Compare January 18, 2024 17:12