raft topology: send cdc generation data in parts #13962

margdoc · 2023-05-19T15:25:19Z

The CDC generation data can be large and not fit in a single command. This pr splits it into multiple mutations by smartly picking a mutation_size_threshold and sending each mutation as a separate group 0 command.

Commands are sent sequentially to avoid concurrency problems.

Topology snapshots contain only mutation of current CDC generation data but don't contain any previous or future generations. If a new generation of data is being broadcasted but hasn't been entirely applied yet, the applied part won't be sent in a snapshot. New or delayed nodes can never get the applied part in this scenario.

Send the entire cdc_generations_v3 table in the snapshot to resolve this problem.

A mechanism to remove old CDC generations will be introduced as a follow-up.

margdoc · 2023-05-19T15:25:44Z

@kbr-scylla please review

scylladb-promoter · 2023-05-19T18:20:18Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1352/

service/raft/group0_state_machine.cc

cdc/generation.cc

kbr-scylla · 2023-05-23T17:14:57Z

Please also rebase the PR on master - #13683 got merged

margdoc · 2023-05-26T13:34:20Z

v2:

rebased on master

margdoc · 2023-05-26T14:29:12Z

v3:

s/result_size_estimate/total_size_estimate
Added a comment to the "cdc_generation_mutations_replication" injection
Moved write_topology_mutations to group0_state_machine.cc and remove call to it in topology_transition
Added precondition to topology_transition and removed unnecessary arguments
s/write_topology_mutations/write_mutations_to_database
Added a comment to the "write_mutations" struct

scylladb-promoter · 2023-05-26T17:06:38Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1494/

service/raft/group0_state_machine.cc

service/raft/group0_state_machine.hh

service/storage_service.cc

cdc/generation.cc

test/topology_experimental_raft/test_cdc_generation_data.py

margdoc · 2023-05-31T14:48:37Z

v4:

Added more comments
Added a description of the proposed solution in the commit message
Boot only one node after the injection in the test

v5:

Rebased on master

scylladb-promoter · 2023-05-31T17:30:26Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1600/

service/raft/group0_state_machine.cc

idl/group0_state_machine.idl.hh

service/raft/group0_state_machine.hh

service/storage_service.cc

kbr-scylla · 2023-06-01T09:42:15Z

@gleb-cloudius @piodul please review

service/raft/group0_state_machine.cc

avikivity · 2023-06-01T10:13:20Z

The CDC generation data can be large and not fit in a single command. This pr splits it into multiple mutations by smartly picking a mutation_size_threshold and sending each mutation as a separate group 0 command.

Commands are sent sequentially to avoid concurrency problems.

What happens if one command succeeds and the next fails? Do we end up in an inconsistent state?

service/storage_service.cc

cdc/generation.cc

service/storage_service.cc

scylladb-promoter · 2023-07-05T11:35:52Z

@scylladb/scylla-maint please review and trigger CI for this pull request

margdoc · 2023-07-05T11:38:30Z

v11:

fixed typo in commit message
fixed comment about bootstrapped nodes

kbr-scylla · 2023-07-05T12:12:25Z

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/2276/

scylladb-promoter · 2023-07-05T13:40:26Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/2276/

kbr-scylla · 2023-07-05T13:45:36Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/2276/

Looks like the second node booted by the test hanged due to topology coordinator errors:
https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/2276/artifact/testlog/x86_64/debug/scylla-147.log

WARN  2023-07-05 16:16:30,396 [shard 0] storage_service - raft topology: got stream_ranges request while my tokens state is {} and node state is none
WARN  2023-07-05 16:16:30,488 [shard 0] storage_service - raft topology: got stream_ranges request while my tokens state is {} and node state is none
...

kbr-scylla

Please fix the order of commits due to the reasons explained by Piotr here: #13962 (comment)

and please investigate the failure.

service/storage_service.cc

Provide useful description for `write_mutations` and `broadcast_tables_query` that is stored in `system.group0_history`. Reduces scope of issue scylladb#13370.

For now, `raft_sys_table_storage::_max_mutation_size` equals `max_mutation_size` (half of the commitlog segment size), so with some additional information, it can exceed this threshold resulting in throwing an exception when writing mutation to the commitlog. A batch of raft commands has the size at most `group0_state_machine::merger::max_command_size` (half of the commitlog segment size). It doesn't have additional metadata, but it may have a size of exactly `max_mutation_size`. It shouldn't make any trouble, but it is prefered to be careful. Make `raft_sys_table_storage::_max_mutation_size` and `group0_state_machine::merger::max_command_size` more strict to leave space for metadata. Fixed typo "1204" => "1024".

…_size` `get_cdc_generation_mutations` splits data to mutations of maximal size `mutation_size_treshold`. Before this commit it was hardcoded to 2 MB. Calculate `mutation_size_threshold` to leave space for cdc generation data and not exceed `max_command_size`.

…ion_mutations snapshot Topology snapshots contain only mutation of current CDC generation data but don't contain any previous or future generations. If new a generation of data is being broadcasted but hasn't been entirely applied yet, the applied part won't be sent in a snapshot. In this scenario, new or delayed nodes can never get the applied part. Send entire cdc_generations_v3 table in the snapshot to resolve this problem. As a follow-up, a mechanism to remove old CDC generations will be introduced.

margdoc · 2023-07-07T11:44:38Z

v12:

added paragraph to prepare_and_broadcast_cdc_generation_data commit's message and comment
fixed merger bug
reordered commits

… from all merged commands If `group0_state_machine` applies all commands individually (without batching), the resulting current `state_id` -- which will be compared with the `prev_state_id` of the next command if it is a guarded command -- equals the maximum of the `next_state_id` of all commands applied up to this point. That's because the current `state_id` is obtained from the history table by taking the row with the largest clustering key. When `group0_state_machine::apply` is called with a batch of commands, the current `state_id` is loaded from `system.group0_history` to `merger::last_group0_state_id` only once. When a command is merged, its `next_state_id` overwrites `last_group0_state_id`, regardless of their order. Let's consider the following situation: The leader sends two unguarded `write_mutations` commands concurrently, with timeuuids T1 and T2, where T1 < T2. Leader waits to apply them and sends guarded `topology_change` with `prev_state_id` equal T2. Suppose that the command with timeuuid T2 is committed first, and these commands are small enough that all of `write_mutations` could be merged into one command. Some followers can get all of these three commands before its `fsm` polls them. In this situation, `group0_state_machine::apply` is called with all three of them and `merger` will merge both `write_mutations` into one command. After that, `merger::last_group0_state_id` will be equal to T1 (this command was committed as the second one). When it processes the `topology_change` command, it will compare its `prev_state_id` and `merger::last_group0_state_id`, resulting in making this command a no-op (which wouldn't happen if the commands were applied individually). Such a scenario results in inconsistent results: one replica applies `topology_change`, but another makes it a no-op.

This function takes guard and calls its destructor. It's used to not call raw destructor.

Broadcasts all mutations returned from `prepare_new_cdc_generation_data` except the last one. Each mutation is sent in separate raft command. It takes `group0_guard`, and if the number of mutations is greater than one, the guard is dropped, and a new one is created and returned, otherwise the old one will be returned. Commands are sent in parallel and unguarded (the guard used for sending the last mutation will guarantee that the term hasn't been changed). Returns the generation's UUID, guard and last mutation, which will be sent with additional topology data by the caller. If we send the last mutation in the `write_mutation` command, we would use a total of `n + 1` commands instead of `n-1 + 1` (where `n` is the number of mutations), so it's better to send it in `topology_change` (we need to send it after all `write_mutations`) with some small metadata. With the default commitlog segment size, `mutation_size_threshold` will be 4 MB. In large clusters e.g. 100 nodes, 64 shards per node, 256 vnodes cdc generation data can reach the size of 30 MB, thus there will be no more than 8 commands. In a multi-DC cluster with 100ms latencies between DCs, this operation should take about 200ms since we send the commands concurrently, but even if the commands were replicated sequentially by Raft, it should take no more than 1.6s, which is incomparably smaller than bootstrapping operation (bootstrapping is quick if there is no data in the cluster, but usually if one has 100 nodes they have tons of data, so indeed streaming/repair will take much longer (hours/days)). Fixes FIXME in pr scylladb#13683.

…_unknown` exception in topology coordinator loop When the topology_cooridnator fiber gets `raft::commit_status_unknown`, it prints an error. This exception is not an error in this case, and it can be thrown when the leader has changed. It can happen in `add_entry_unguarded` while sending a part of the CDC generation data in the `write_mutations` command. Catch this exception in `topology_coordinator::run` and print a warning.

This test limits `commitlog_segment_size_in_mb` to 2, thus `max_command_size` is limited to less than 1 MB. It adds an injection which copies mutations generated by `get_cdc_generation_mutations` n times, where n is picked that the memory size of all mutations exceeds `max_command_size`. This test passes if cdc generation data is committed by raft in multiple commands. If all the data is committed in a single command, the leader node will loop trying to send raft command and getting the error: ``` storage_service - raft topology: topology change coordinator fiber got error raft::command_is_too_big_error (Command size {} is greater than the configured limit {}) ```

scylladb-promoter · 2023-07-07T13:34:38Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/2333/

margdoc requested a review from tgrabiec as a code owner May 19, 2023 15:25

kbr-scylla requested changes May 22, 2023

View reviewed changes

service/raft/group0_state_machine.cc Outdated Show resolved Hide resolved

cdc/generation.cc Outdated Show resolved Hide resolved

cdc/generation.cc Outdated Show resolved Hide resolved

margdoc force-pushed the raft_topology_cdc_generation_data_in_parts branch from 5ac7640 to e3d341d Compare May 26, 2023 13:33

margdoc force-pushed the raft_topology_cdc_generation_data_in_parts branch 2 times, most recently from 3c91bc8 to d2fd8ca Compare May 26, 2023 14:24

margdoc requested a review from kbr-scylla May 26, 2023 14:29

kbr-scylla requested changes May 28, 2023

View reviewed changes

margdoc force-pushed the raft_topology_cdc_generation_data_in_parts branch from d2fd8ca to 4c65a94 Compare May 31, 2023 14:38

margdoc requested a review from nyh as a code owner May 31, 2023 14:38

margdoc force-pushed the raft_topology_cdc_generation_data_in_parts branch from 4c65a94 to 498e090 Compare May 31, 2023 14:44

margdoc requested a review from kbr-scylla May 31, 2023 14:50

kbr-scylla reviewed Jun 1, 2023

View reviewed changes

gleb-cloudius reviewed Jun 1, 2023

View reviewed changes

service/raft/group0_state_machine.cc Show resolved Hide resolved

mykaul added this to the 5.4 milestone Jun 1, 2023

mykaul added area/raft area/cdc labels Jun 1, 2023

piodul reviewed Jun 1, 2023

View reviewed changes

margdoc force-pushed the raft_topology_cdc_generation_data_in_parts branch from 498e090 to 8620d19 Compare June 6, 2023 11:39

margdoc requested a review from denesb as a code owner June 6, 2023 11:39

margdoc force-pushed the raft_topology_cdc_generation_data_in_parts branch from 8620d19 to 5697dc1 Compare June 6, 2023 11:40

kbr-scylla approved these changes Jul 5, 2023

View reviewed changes

margdoc force-pushed the raft_topology_cdc_generation_data_in_parts branch from 4bfe367 to 744599f Compare July 5, 2023 11:35

kbr-scylla approved these changes Jul 5, 2023

View reviewed changes

kbr-scylla requested changes Jul 5, 2023

View reviewed changes

gleb-cloudius reviewed Jul 6, 2023

View reviewed changes

service/storage_service.cc Show resolved Hide resolved

margdoc added 4 commits July 7, 2023 13:11

raft: add description argument to add_entry_unguarded

760d415

Provide useful description for `write_mutations` and `broadcast_tables_query` that is stored in `system.group0_history`. Reduces scope of issue scylladb#13370.

margdoc force-pushed the raft_topology_cdc_generation_data_in_parts branch from 744599f to d3eafbf Compare July 7, 2023 11:36

margdoc mentioned this pull request Jul 7, 2023

raft: add unit tests for group0_state_machine::merger (e.g. using the correct state_id) #14568

Closed

margdoc requested a review from kbr-scylla July 7, 2023 11:45

margdoc added 2 commits July 7, 2023 13:49

raft: add release_guard

04c38c6

This function takes guard and calls its destructor. It's used to not call raw destructor.

margdoc force-pushed the raft_topology_cdc_generation_data_in_parts branch from a61cd5d to 6f3a00d Compare July 7, 2023 11:49

margdoc added 3 commits July 7, 2023 13:56

margdoc force-pushed the raft_topology_cdc_generation_data_in_parts branch from 6f3a00d to 4e3c97d Compare July 7, 2023 11:56

kbr-scylla approved these changes Jul 7, 2023

View reviewed changes

scylladb-promoter merged commit de7f668 into scylladb:master Jul 7, 2023
3 checks passed

kbr-scylla mentioned this pull request Aug 25, 2023

raft topology: cdc: make CDC_GENERATIONS_V3 table single-partition, timeuuid-sorted #15163

Closed

patjed41 mentioned this pull request Sep 8, 2023

raft topology: add garbage collection for internal CDC generations table #15323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft topology: send cdc generation data in parts #13962

raft topology: send cdc generation data in parts #13962

margdoc commented May 19, 2023 •

edited

margdoc commented May 19, 2023

scylladb-promoter commented May 19, 2023

kbr-scylla commented May 23, 2023

margdoc commented May 26, 2023

margdoc commented May 26, 2023

scylladb-promoter commented May 26, 2023

margdoc commented May 31, 2023 •

edited

scylladb-promoter commented May 31, 2023

kbr-scylla commented Jun 1, 2023

avikivity commented Jun 1, 2023

scylladb-promoter commented Jul 5, 2023

margdoc commented Jul 5, 2023

kbr-scylla commented Jul 5, 2023

scylladb-promoter commented Jul 5, 2023

kbr-scylla commented Jul 5, 2023

kbr-scylla left a comment

margdoc commented Jul 7, 2023 •

edited

scylladb-promoter commented Jul 7, 2023

raft topology: send cdc generation data in parts #13962

raft topology: send cdc generation data in parts #13962

Conversation

margdoc commented May 19, 2023 • edited

margdoc commented May 19, 2023

scylladb-promoter commented May 19, 2023

kbr-scylla commented May 23, 2023

margdoc commented May 26, 2023

margdoc commented May 26, 2023

scylladb-promoter commented May 26, 2023

margdoc commented May 31, 2023 • edited

scylladb-promoter commented May 31, 2023

kbr-scylla commented Jun 1, 2023

avikivity commented Jun 1, 2023

scylladb-promoter commented Jul 5, 2023

margdoc commented Jul 5, 2023

kbr-scylla commented Jul 5, 2023

scylladb-promoter commented Jul 5, 2023

kbr-scylla commented Jul 5, 2023

kbr-scylla left a comment

Choose a reason for hiding this comment

margdoc commented Jul 7, 2023 • edited

scylladb-promoter commented Jul 7, 2023

margdoc commented May 19, 2023 •

edited

margdoc commented May 31, 2023 •

edited

margdoc commented Jul 7, 2023 •

edited