Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an API to trigger snapshot in Raft servers #16816

Merged
merged 16 commits into from
Jan 30, 2024

Conversation

kbr-scylla
Copy link
Contributor

This allows the user of raft::server to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in #16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to raft::server and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

// Always 0 for non-local snapshots.
size_t max_trailing_entries;

// FIXME: include max_trailing_bytes here and in store_snapshot_descriptor
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raft/server.cc Outdated
@@ -1003,10 +1003,10 @@ future<> server_impl::io_fiber(index_t last_stable) {
}

if (batch.snp) {
auto& [snp, is_local] = *batch.snp;
auto& [snp, is_local, max_trailing_entries] = *batch.snp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that max_trailing_entries if !is_local?

raft/server.cc Outdated Show resolved Hide resolved
@tzach
Copy link
Contributor

tzach commented Jan 17, 2024

@kbr-scylla to validate: this is not related to snapshot API in /api-doc/storage_service.json, right?
This PR is about Raft log snapshot, while the storage one is sstable snapshot.

@kbr-scylla
Copy link
Contributor Author

@tzach yes

@tzach
Copy link
Contributor

tzach commented Jan 17, 2024

It will be good to document why and when one should use the new API.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 2 hr 25 min
  • Builder: i-08149ebcfe29ff6f9 (m5d.12xlarge)

This looks like a minor oversight, in `server_impl::abort` there are
multiple calls to `set_exception` on the different promises, only one of
them would not receive `*_aborted`.
…shot_descriptor`

This parameter says how many entries at most should be left trailing
before the snapshot index. There are multiple places where this
decision is made:
- in `applier_fiber` when the server locally decides to take a snapshot
  due to log size pressure; this applies to the in-memory log
- in `fsm::step` when the server received an `install_snapshot` message
  from the leader; this also applies to the in-memory log
- and in `io_fiber` when calling `store_snapshot_descriptor`; this
  applies to the on-disk log.

The logic of how many entries should be left trailing is calculated
twice:
- first, in `applier_fiber` or in `fsm::step` when truncating the
  in-memory log
- and then again as the snapshot descriptor is being persisted.

The logic is to take `_config.snapshot_trailing` for locally generated
snapshots (coming from `applier_fiber`) and `0` for remote snapshots
(from `fsm::step`).

But there is already an error injection that changes the behavior of
`applier_fiber` to leave `0` trailing entries. However, this doesn't
affect the following `store_snapshot_descriptor` call which still uses
`_config.snapshot_trailing`. So if the server got restarted, the entries
which were truncated in-memory would get "revived" from disk.
Fortunately, this is test-only code.

However in future commits we'd like to change the logic of
`applier_fiber` even further. So instead of having a separate
calculation of trailing entries inside `io_fiber`, it's better for it to
use the number that was already calculated once. This number is passed to
`fsm::apply_snapshot` (by `applier_fiber` or `fsm::step`) and can then
be received by `io_fiber` from `fsm_output` to use it inside
`store_snapshot_descriptor`.
Also use the more efficient coroutine-specific
`condition_variable::when` instead of `wait`.
In a later commit we'll move `poll_output` out of `fsm` and it won't
have access to internals logged by this message (`_log.stable_idx()`).

Besides, having it in `has_output` gives a more detailed trace. In
particular we can now see values such as `stable_idx` and `last_idx`
from the moment of returning a new fsm output, not only when poll
started waiting for it (a lot of time can pass between these two
events).
This constructor does not provide persisted commit index. It was only
used in tests, so move it there, to the helper `fsm_debug` which
inherits from `fsm`.

Test cases which used `fsm` directly instead of `fsm_debug` were
modified to use `fsm_debug` so they can access the constructor.
`fsm_debug` doesn't change the behavior of `fsm`, only adds some helper
members. This will be useful in following commits too.
In later commits we will use it to wake up `io_fiber` directly from
`raft::server` based on events generated by `raft::server` itself -- not
only from events generated by `raft::fsm`.

`raft::fsm` still obtains a reference to the condition variable so it
can keep signaling it.
`server` was the only user of this function and it can now be
implemented using `fsm`'s public interface.

In later commits we'll extend the logic of `io_fiber` to also subscribe
to other events, triggered by `server` API calls, not only to outputs
from `fsm`.
@kbr-scylla
Copy link
Contributor Author

May be we can change the io_fiber to wait on two different sources of events (fsm generated and server generated),

This is the approach I took in new version.

Basically the _sm_events condition variable and polling logic was moved into raft::server. Then the logic was extended with second source of events.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests
✅ - dtest

Build Details:

  • Duration: 2 hr 32 min
  • Builder: spider8.cloudius-systems.com

@@ -41,10 +41,6 @@ fsm::fsm(server_id id, term_t current_term, server_id voted_for, log log,
}
}

fsm::fsm(server_id id, term_t current_term, server_id voted_for, log log,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure there is a merit to this patch. persisting commit index it optional. We happen to do it in Scylla and this is the only non test user, but the functionality is general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to make a conscious decision "I don't want to restore commit index" by explicitly providing index_t{0} in constructor. Minor nuisance but always better to make someone think for a second than not IMO.

Also fewer constructors to extend when adding new stuff like I do in later commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but IMO the problem with requiring explicit index_t{0} is that it is not clear from the API that it is optional. How a user of the API knows it can provide index_t{0} and not the real value and the code will work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're pointing to a larger problem here --- the fact that fsm constructor is not documented at all.
I can fill this hole.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation is good. Making signature more descriptive is also good (does not obliterate the need for the documentation of course). Making it std::optional would be more descriptive, but OTH not sure it is worth it.

raft/server.cc Outdated Show resolved Hide resolved
raft/server.cc Outdated Show resolved Hide resolved
raft/server.cc Outdated Show resolved Hide resolved
raft/server.cc Outdated Show resolved Hide resolved
logger.info("[{}] while taking snapshot term={} idx={} id={} due to request,"
" fsm received a later snapshot at idx={}", _id, snp.term, snp.idx, snp.id, _fsm->log_last_snapshot_idx());
}
_stats.snapshots_taken++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a lot of this be unified with the snapshot taken code above in the same function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a bunch of differences, I don't see a straightforward way. Also it's only 2 "duplications" (but only on the surface) then IMO not worth pursuing this. With 3 or more I would consider further.

@kbr-scylla kbr-scylla force-pushed the trigger-snapshot branch 2 times, most recently from 8ba435f to e0a3b74 Compare January 23, 2024 15:39
@kbr-scylla
Copy link
Contributor Author

new version:

  • trigger_snapshot takes abort_source pointer instead of reference
  • trigger_snapshot returns true if it does create a snapshot and false otherwise (used it for logging in the rest endpoint)
  • logger calls inside trigger_snapshot demoted to debug (also one call was broken, fixed)

Diff: https://github.com/scylladb/scylladb/compare/3977cd16fa52327ae4a90c0ff592db1a0b731b99..e0a3b74b7e0f170abca4c7be5b0fdece7ba0c3a2

Also https://github.com/kbr-scylla/scylladb/commits/auto-snapshot/ contains the follow-up (last two commits), will send PR after this one is merged

Also verified that this + follow-up is backportable to 5.2 (with some pain, but doable) https://github.com/kbr-scylla/scylladb/commits/backport-auto-snapshot/ and manually checked that the snapshot will automatically be created after restart.

This allows the user of `raft::server` to ask it to create a snapshot
and truncate the Raft log. In a later commit we'll add a REST endpoint
to Scylla to trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).
This uses the `trigger_snapshot()` API added in previous commit on a
server running for the given Raft group.

It can be used for example in tests or in the context of disaster
recovery (ref scylladb#16683).
Nobody remembered to keep this function up to date when adding stuff to
`fsm_output`.

Turns out that it's not being used by any Raft logic but only in some
tests. That use case can now be replaced with `fsm::has_output()` which
is also being used by `raft::server` code.
@kbr-scylla
Copy link
Contributor Author

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 4 hr 40 min
  • Builder: i-0ab25805e8f4c184f (m5ad.8xlarge)

@kbr-scylla
Copy link
Contributor Author

Hi @scylladb/scylla-maint, who would like to review a change inside Raft core?

@kbr-scylla
Copy link
Contributor Author

@tgrabiec please review

@kbr-scylla
Copy link
Contributor Author

@tgrabiec ping please review

@kbr-scylla
Copy link
Contributor Author

@scylladb/scylla-maint we need to merge this and my follow-up (which I didn't send yet but will after this one is merged) ASAP, the problem I'm fixing is already triggering PagerDuty on customer clusters and the only workaround we have is to perform 1000 schema changes.

@avikivity
Copy link
Member

This allows the user of raft::server to cause it to create a snapshot and truncate the Raft log (leaving no trailing entries; in the future we may extend the API to specify number of trailing entries left if needed). In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in #16683.

Shouldn't this be fixed by always sending a snapshot? 5.2 is vulnerable to it when adding a new node. I remember it's worked around since the new node pulls the schema itself, but it seems wrong.

What does "Add an API" mean? an operator has to call it manually?

In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (ff386e7) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables).

Seems like this applies to group0 too. And should be automatic. The service that uses Raft should configure it with a function to determine whether

The PR adds the API to raft::server and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically call this API if it sees that no snapshot is present yet (to automatically fix the aforementioned 5.2 deployments once they upgrade.)

@avikivity
Copy link
Member

Ah I see you promise to call it automatically.

@kbr-scylla
Copy link
Contributor Author

What does "Add an API" mean? an operator has to call it manually?

In this PR yes, the raft/server API is used to implement a REST API which is then used in tests and can be called manually, but I'll send a follow-up to call it automatically too.

@kbr-scylla
Copy link
Contributor Author

Shouldn't this be fixed by always sending a snapshot? 5.2 is vulnerable to it when adding a new node. I remember it's worked around since the new node pulls the schema itself, but it seems wrong.

Raft can't send a snapshot if there is none. So the fix is to create one. Yes 5.2 and 2023.1 is vulnerable (and we got paged on Friday because of this) and we need to merge this and the follow-up and backport them ASAP. I'll take care of the backports.

Copy link
Contributor

@denesb denesb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone through the changes but of course I cannot say that I verified that they are correct. I will rely on Gleb to have done that.

Is there a way for users to cause harm to a cluster by using this new rest API?

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Jan 29, 2024

Is there a way for users to cause harm to a cluster by using this new rest API?

User may only create performance problems using this.

@denesb
Copy link
Contributor

denesb commented Jan 29, 2024

Ok, queued.

@scylladb-promoter scylladb-promoter merged commit d202d32 into scylladb:master Jan 30, 2024
4 checks passed
kbr-scylla pushed a commit to kbr-scylla/scylladb that referenced this pull request Jan 31, 2024
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to `raft::server` and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

Closes scylladb#16816

* github.com:scylladb/scylladb:
  raft: remove `empty()` from `fsm_output`
  test: add test for manual triggering of Raft snapshots
  api: add HTTP endpoint to trigger Raft snapshots
  raft: server: add `trigger_snapshot` API
  raft: server: track last persisted snapshot descriptor index
  raft: server: framework for handling server requests
  raft: server: inline `poll_fsm_output`
  raft: server: fix indentation
  raft: server: move `io_fiber`'s processing of `batch` to a separate function
  raft: move `poll_output()` from `fsm` to `server`
  raft: move `_sm_events` from `fsm` to `server`
  raft: fsm: remove constructor used only in tests
  raft: fsm: move trace message from `poll_output` to `has_output`
  raft: fsm: extract `has_output()`
  raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
  raft: server: pass `*_aborted` to `set_exception` call

(cherry picked from commit d202d32)

Backport notes:
- `has_output()` has a smaller condition in the backported version
  (because the condition was smaller in `poll_output()`)
- `process_fsm_output` has a smaller body (because `io_fiber` had a
  smaller body) in the backported version
kbr-scylla pushed a commit to kbr-scylla/scylladb that referenced this pull request Jan 31, 2024
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to `raft::server` and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

Closes scylladb#16816

* github.com:scylladb/scylladb:
  raft: remove `empty()` from `fsm_output`
  test: add test for manual triggering of Raft snapshots
  api: add HTTP endpoint to trigger Raft snapshots
  raft: server: add `trigger_snapshot` API
  raft: server: track last persisted snapshot descriptor index
  raft: server: framework for handling server requests
  raft: server: inline `poll_fsm_output`
  raft: server: fix indentation
  raft: server: move `io_fiber`'s processing of `batch` to a separate function
  raft: move `poll_output()` from `fsm` to `server`
  raft: move `_sm_events` from `fsm` to `server`
  raft: fsm: remove constructor used only in tests
  raft: fsm: move trace message from `poll_output` to `has_output`
  raft: fsm: extract `has_output()`
  raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
  raft: server: pass `*_aborted` to `set_exception` call

(cherry picked from commit d202d32)

Backport notes:
- `has_output()` has a smaller condition in the backported version
  (because the condition was smaller in `poll_output()`)
- `process_fsm_output` has a smaller body (because `io_fiber` had a
  smaller body) in the backported version
kbr-scylla pushed a commit to kbr-scylla/scylladb that referenced this pull request Feb 2, 2024
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to `raft::server` and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

Closes scylladb#16816

* github.com:scylladb/scylladb:
  raft: remove `empty()` from `fsm_output`
  test: add test for manual triggering of Raft snapshots
  api: add HTTP endpoint to trigger Raft snapshots
  raft: server: add `trigger_snapshot` API
  raft: server: track last persisted snapshot descriptor index
  raft: server: framework for handling server requests
  raft: server: inline `poll_fsm_output`
  raft: server: fix indentation
  raft: server: move `io_fiber`'s processing of `batch` to a separate function
  raft: move `poll_output()` from `fsm` to `server`
  raft: move `_sm_events` from `fsm` to `server`
  raft: fsm: remove constructor used only in tests
  raft: fsm: move trace message from `poll_output` to `has_output`
  raft: fsm: extract `has_output()`
  raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
  raft: server: pass `*_aborted` to `set_exception` call

(cherry picked from commit d202d32)

Backport notes:
- `has_output()` has a smaller condition in the backported version
  (because the condition was smaller in `poll_output()`)
- `process_fsm_output` has a smaller body (because `io_fiber` had a
  smaller body) in the backported version
- the HTTP API is only started if `raft_group_registry` is started
kbr-scylla pushed a commit to kbr-scylla/scylladb that referenced this pull request Feb 2, 2024
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to `raft::server` and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

Closes scylladb#16816

* github.com:scylladb/scylladb:
  raft: remove `empty()` from `fsm_output`
  test: add test for manual triggering of Raft snapshots
  api: add HTTP endpoint to trigger Raft snapshots
  raft: server: add `trigger_snapshot` API
  raft: server: track last persisted snapshot descriptor index
  raft: server: framework for handling server requests
  raft: server: inline `poll_fsm_output`
  raft: server: fix indentation
  raft: server: move `io_fiber`'s processing of `batch` to a separate function
  raft: move `poll_output()` from `fsm` to `server`
  raft: move `_sm_events` from `fsm` to `server`
  raft: fsm: remove constructor used only in tests
  raft: fsm: move trace message from `poll_output` to `has_output`
  raft: fsm: extract `has_output()`
  raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
  raft: server: pass `*_aborted` to `set_exception` call

(cherry picked from commit d202d32)

Backport note: the HTTP API is only started if raft_group_registry is
started.
kbr-scylla pushed a commit to kbr-scylla/scylladb that referenced this pull request Feb 2, 2024
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to `raft::server` and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

Closes scylladb#16816

* github.com:scylladb/scylladb:
  raft: remove `empty()` from `fsm_output`
  test: add test for manual triggering of Raft snapshots
  api: add HTTP endpoint to trigger Raft snapshots
  raft: server: add `trigger_snapshot` API
  raft: server: track last persisted snapshot descriptor index
  raft: server: framework for handling server requests
  raft: server: inline `poll_fsm_output`
  raft: server: fix indentation
  raft: server: move `io_fiber`'s processing of `batch` to a separate function
  raft: move `poll_output()` from `fsm` to `server`
  raft: move `_sm_events` from `fsm` to `server`
  raft: fsm: remove constructor used only in tests
  raft: fsm: move trace message from `poll_output` to `has_output`
  raft: fsm: extract `has_output()`
  raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
  raft: server: pass `*_aborted` to `set_exception` call

(cherry picked from commit d202d32)

Backport note: the HTTP API is only started if raft_group_registry is
started.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants