Raft - failed transfer snapshot due too large mutation #13864

soyacz · 2023-05-12T08:51:43Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

In test, we create 1 node, then create 5000 tables and then add another nodes to cluster.
As soon as additional node boots we can see errors (with some context):

May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] gossip - InetAddress 10.4.2.1 is now UP, status = UNKNOWN
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] raft_group_registry - marking Raft server ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 as alive for raft groups
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] raft - [0316ba6c-7727-4468-815b-e02259fcf0dd] Transferring snapshot to ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 failed with: seastar::rpc::remote_verb_error (connection is closed)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
[...]
May 11 17:16:06 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
May 11 17:16:07 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] raft - [0316ba6c-7727-4468-815b-e02259fcf0dd] Transferring snapshot to ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 failed with: seastar::rpc::remote_verb_error (Mutation of 28395492 bytes is too large for the maximum size of 16777216)

Added node meantime fails pulling schema:

May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - setup_group0: joining group 0...
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - server ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 found no local group 0. Discovering...
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - server ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 found group 0 with group id ab396540-efc3-11ed-b012-8f3cc913200d, leader 0316ba6c-7727-4468-815b-e02259fcf0dd
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - Server ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 is starting group 0 with id ab396540-efc3-11ed-b012-8f3cc913200d
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard 10] compaction - [Compact system.local 8136ad50-f01f-11ed-8d60-2e89c5521e4e] Compacted 2 sstables to [/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/mc-100-big-Data.db:level=0]. 28kB to 15kB (~53% of original) in 17ms = 903kB/s. ~256 total partitions merged to 1.
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - server ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 joined group 0 with group id ab396540-efc3-11ed-b012-8f3cc913200d
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - setup_group0: successfully joined group 0.
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - setup_group0: ensuring that the cluster has fully upgraded to use Raft...
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] storage_service - Set host_id=0316ba6c-7727-4468-815b-e02259fcf0dd to be owned by node=10.4.1.11
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] gossip - InetAddress 10.4.1.11 is now UP, status = NORMAL
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Requesting schema pull from 10.4.1.11:0
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Pulling schema from 10.4.1.11:0
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Fail to pull schema from 10.4.1.11: seastar::rpc::closed_error (connection is closed)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Requesting schema pull from 10.4.1.11:0
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Pulling schema from 10.4.1.11:0
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla-jmx[5792]: Picked up JAVA_TOOL_OPTIONS:
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 systemd[1]: Stopping Scylla JMX...
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 systemd[1]: scylla-jmx.service: Deactivated successfully.
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 systemd[1]: Stopped Scylla JMX.
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 systemd[1]: Started Scylla JMX.
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla-jmx[5802]: Picked up JAVA_TOOL_OPTIONS:
May 11 17:16:06 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla-jmx[5802]: Using config file: /etc/scylla/scylla.yaml
May 11 17:16:07 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Fail to pull schema from 10.4.1.11: std::invalid_argument (Mutation of 28395492 bytes is too large for the maximum size of 16777216)
May 11 17:16:07 longevity-5000-tables-2023-1-db-node-944683ea-2 systemd[1]: Started Scylla Housekeeping restart mode.

this is the applied schema (multiplied 5k times):
https://github.com/scylladb/scylla-qa-internal/blob/master/cust_d/templated_tables_mv.yaml

Impact

Failed to add new nodes.

How frequently does it reproduce?

Hard to say, this is the first occurance. Previously we ran this test with 2023.1.0~rc1-20230208.fe3cc281ec73 and didn't face this issue (some details here: #12972)

Installation details

Kernel Version: 5.15.0-1035-aws
Scylla version (or git commit hash): 2023.1.0~rc5-20230429.a47bcb26e42e with build-id d2644a8364f13d14d25be6b9d3c69f84612192bd

Cluster size: 1 nodes (i3.8xlarge)

Scylla Nodes used in this run:

longevity-5000-tables-2023-1-db-node-944683ea-2 (52.16.116.17 | 10.4.2.1) (shards: 30)
longevity-5000-tables-2023-1-db-node-944683ea-1 (34.249.171.184 | 10.4.1.11) (shards: 30)

OS / Image: ami-05e7801837cea47d9 (aws: eu-west-1)

Test: scale-5000-tables-test
Test id: 944683ea-6f39-4248-9317-9a2ff15f5713
Test name: enterprise-2023.1/scale/scale-5000-tables-test
Test config file(s):

longevity-5000-tables.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 944683ea-6f39-4248-9317-9a2ff15f5713
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 944683ea-6f39-4248-9317-9a2ff15f5713

Logs:

db-cluster-944683ea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/944683ea-6f39-4248-9317-9a2ff15f5713/20230511_183014/db-cluster-944683ea.tar.gz
sct-runner-events-944683ea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/944683ea-6f39-4248-9317-9a2ff15f5713/20230511_183014/sct-runner-events-944683ea.tar.gz
sct-944683ea.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/944683ea-6f39-4248-9317-9a2ff15f5713/20230511_183014/sct-944683ea.log.tar.gz
monitor-set-944683ea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/944683ea-6f39-4248-9317-9a2ff15f5713/20230511_183014/monitor-set-944683ea.tar.gz
loader-set-944683ea.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/944683ea-6f39-4248-9317-9a2ff15f5713/20230511_183014/loader-set-944683ea.tar.gz

Jenkins job URL

The text was updated successfully, but these errors were encountered:

DoronArazii · 2023-05-14T11:08:42Z

@kostja please have a look

kostja · 2023-05-15T07:01:36Z

@gleb-cloudius @kbr-scylla I don't think we will be able to avoid solving this issue. it pops up too frequently.

gleb-cloudius · 2023-05-15T08:08:53Z

@gleb-cloudius @kbr-scylla I don't think we will be able to avoid solving this issue. it pops up too frequently.

They should increase the limit. Problem solved.

mykaul · 2023-05-15T09:01:04Z

@gleb-cloudius @kbr-scylla I don't think we will be able to avoid solving this issue. it pops up too frequently.

They should increase the limit. Problem solved.

I thought so too, @avikivity wasn't happy with that approach.

gleb-cloudius · 2023-05-15T09:05:46Z

@gleb-cloudius @kbr-scylla I don't think we will be able to avoid solving this issue. it pops up too frequently.

They should increase the limit. Problem solved.

I thought so too, @avikivity wasn't happy with that approach.

Isn't he? I do not see how this is a regression since raft uses the same schema pull code as non raft case, so but how long is he willing to push topology/tablet work in order to fix this non regression?

kbr-scylla · 2023-05-15T09:06:53Z

We should understand what is that too large mutation.

Are we trying to commit such a large mutation to the Raft log? I doubt it. Each table is created by a separate mutation in a separate Raft command, so even though there are 5000 tables in the cluster, we won't (shouldn't) create a large Raft command because of it.

So it must be related somehow to the snapshot pulling code. Perhaps if the tables are living in the same keyspace, the whole description of all 5000 tables is represented by a single mutation, and that mutation is indeed large?

But even then, if I recall correctly, this error message is coming from the commitlog (?) when we're trying to put a too-large mutation in the commitlog. In that case, why do we even need to involve the commitlog during schema pulls? Are we actually putting the entire thing we pull into the commitlog? If that's the case, the solution should be easy - split the mutation into smaller ones before storing it.

mykaul · 2023-05-15T09:16:47Z

@kbr-scylla - do you understand which limit, or is it both that are problematic?
This:
Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
and/or
Mutation of 28395492 bytes is too large for the maximum size of 16777216)

gleb-cloudius · 2023-05-15T09:16:58Z

There was a looong discussion about all this here https://github.com/scylladb/scylla-enterprise/issues/2435 where it happened without any raft. There is no point repeating it here.

gleb-cloudius · 2023-05-15T09:20:55Z

@gleb-cloudius @kbr-scylla I don't think we will be able to avoid solving this issue. it pops up too frequently.

They should increase the limit. Problem solved.

I thought so too, @avikivity wasn't happy with that approach.

And I have it documented that while he may be not happy but he consider it as a solution :) :
https://github.com/scylladb/scylla-enterprise/issues/2435#issuecomment-1274523327
https://github.com/scylladb/scylla-enterprise/issues/2435#issuecomment-1330897308

kbr-scylla · 2023-05-15T09:21:52Z

@kbr-scylla - do you understand which limit, or is it both that are problematic? This: Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit) and/or Mutation of 28395492 bytes is too large for the maximum size of 16777216)

The second limit is based on the commitlog segment size limit (IIRC it's the segment limit times some constant), and it's causing the failure here.

    , commitlog_segment_size_in_mb(this, "commitlog_segment_size_in_mb", value_status::Used, 64,
        "Sets the size of the individual commitlog file segments. A commitlog segment may be archived, deleted, or recycled after all its data has been flushed to SSTables. This amount of data can potentially include commitlog segments from every table in the system. The default size is usually suitable for most commitlog archiving, but if you want a finer granularity, 8 or 16 MB is reasonable. See Commit log archive configuration.\n"
        "Related information: Commit log archive configuration")

we hit it when we try to store the mutation in commitlog.

The first limit appears when we query the table. It's just a soft limit and won't cause a failure, only a warning, but there's also a corresponding hard limit, and it could also cause a failure if we reach it.

    , max_memory_for_unlimited_query_soft_limit(this, "max_memory_for_unlimited_query_soft_limit", liveness::LiveUpdate, value_status::Used, uint64_t(1) << 20,
            "Maximum amount of memory a query, whose memory consumption is not naturally limited, is allowed to consume, e.g. non-paged and reverse queries. "
            "This is the soft limit, there will be a warning logged for queries violating this limit.")
    , max_memory_for_unlimited_query_hard_limit(this, "max_memory_for_unlimited_query_hard_limit", "max_memory_for_unlimited_query", liveness::LiveUpdate, value_status::Used, (uint64_t(100) << 20),
            "Maximum amount of memory a query, whose memory consumption is not naturally limited, is allowed to consume, e.g. non-paged and reverse queries. "
            "This is the hard limit, queries violating this limit will be aborted.")

gleb-cloudius · 2023-05-15T09:31:46Z

Do we enforce the hard query limit for internal queries?

kbr-scylla · 2023-05-15T09:34:30Z

I haven't checked thoroughly but I think yes, it goes through the same database::query/query_mutations path which uses get_unlimited_query_max_result_size. The fact that we got the soft limit warning is evidence.

gleb-cloudius · 2023-05-15T09:35:47Z

I haven't checked thoroughly but I think yes, it goes through the same database::query/query_mutations path which uses get_unlimited_query_max_result_size. The fact that we got the soft limit warning is evidence.

We may warn, but not enforce. IIRC there was such ideal, but I am not sure it was ever implemented.

mykaul · 2023-05-15T09:36:59Z

So I see 3(?) issues here (and thanks Gleb for pointing out the relevant previous discussion):

Commit log segment size - I think it's fair and reasonable to increase the size. I'm somewhat worried it may not be tested (@roydahan ?)
query limit - may be benign? What if there's a hard limit?
Should snapshot transfer go through the commitlog? - I'm unsure I understand the conclusion here.

kbr-scylla · 2023-05-15T09:42:16Z

Should snapshot transfer go through the commitlog? - I'm unsure I understand the conclusion here.

I guess we do need to persist the mutations, but not necessarily in a single segment.

It's possible to split a large mutation into smaller ones across clustering key boundaries, for example. E.g. if there's a single mutation for describing all 5000 tables, we can split it into 5000 mutations describing single table each.

We do a lot of mutation splitting in CDC code so it's doable.

This won't solve the query limit problem. But that's also solvable, we could do a paged query.

I guess the question is whether all that is worth it.

gleb-cloudius · 2023-05-15T09:50:19Z

Should snapshot transfer go through the commitlog? - I'm unsure I understand the conclusion here.

I guess we do need to persist the mutations, but not necessarily in a single segment.

They are very intentionally stored in a single segment. In fact a lot of effort was spent on it. Why?
https://github.com/scylladb/scylla-enterprise/issues/2435#issuecomment-1276474157

Everything was discussed on previously already.

kostja · 2023-05-15T09:53:01Z

Increasing the limit is a viable workaround if few people stumble over it rarely.

@gleb-cloudius no need to repeat the discussion. We should proceed to implementing one of the solutions discussed. The fact that more people stumble over this changes the impact of the problem from medium to high, and this is impact the priority with which we should proceed to implementing the solution.

gleb-cloudius · 2023-05-15T09:56:38Z

Our internal testing is not more people. And IMO the limit should be increased on case-by-case basis. So QA should re-run with larger limit.

gleb-cloudius · 2023-05-15T09:59:05Z

And the previous discussion did no lead to any meaningful conclusion about the resolution except the agreement that the workaround is good enough.

roydahan · 2023-05-15T10:15:35Z

This internal test is based on a schema of a customer we have.

gleb-cloudius · 2023-05-15T10:20:17Z

The issue linked also the existing customer that applied workaround. May be even the same one.

mykaul · 2023-05-15T10:29:29Z

This internal test is based on a schema of a customer we have.

Then let's increase the commit log segment for this specific case and retry.

gleb-cloudius · 2023-05-15T10:29:52Z

Should snapshot transfer go through the commitlog? - I'm unsure I understand the conclusion here.

I guess we do need to persist the mutations, but not necessarily in a single segment.

They are very intentionally stored in a single segment. In fact a lot of effort was spent on it. Why? scylladb/scylla-enterprise#2435 (comment)

If we will make raft_barrier mandatory on boot we may not need to store the schema pull in the commitlog.

roydahan · 2023-05-15T11:39:31Z

This internal test is based on a schema of a customer we have.

Then let's increase the commit log segment for this specific case and retry.

And how the users will know what to set and when?
If we have a documentation on how and when to do it, I'll be happy to follow.

kbr-scylla · 2023-08-02T14:40:11Z

We may consider backporting e6099c4 to 5.2/2023.1

…Patryk Jędrzejczak Fixes #14668 In #14668, we have decided to introduce a new `scylla.yaml` variable for the schema commitlog segment size and set it to 128MB. The reason is that segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. This `schema_commitlog_segment_size_in_mb variable` variable is now added to `scylla.yaml` and `db/config`. Additionally, we do not derive the commitlog sync period for schema commitlog anymore because schema commitlog runs in batch mode, so it doesn't need this parameter. It has also been discussed in #14668. Closes #14704 * github.com:scylladb/scylladb: replica: do not derive the commitlog sync period for schema commitlog config: set schema_commitlog_segment_size_in_mb to 128 config: add schema_commitlog_segment_size_in_mb variable (cherry picked from commit e6099c4)

In #14668, we have decided to introduce a new scylla.yaml variable for the schema commitlog segment size. The segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. Therefore, increasing the schema commitlog segment size is sometimes necessary. (cherry picked from commit 5b167a4)

kbr-scylla · 2023-08-02T16:08:12Z

I backported 4cd5847 to 5.2 (so it will eventually land in 2023.1 as well) which allows configuring the schema commitlog segment size separately.

kbr-scylla · 2023-08-02T16:34:49Z

Reproducer based on master branch: kbr-scylla@ce905a7

I don't think there's anything else interesting to do with this issue, closing.

kostja · 2023-08-02T18:31:40Z

I think we're mixing two problems here. One, is atomicity of multi-entry updates of the schema. The key outstanding issue here is #9603. However, Tomek's commit doesn't fix it, because these two statements are still executed as independent updates, and each can land in an own segment.

Perhaps it enables fixing this problem, but doesn't immediately fix it.

Another issue is the atomicity of snapshot transfer, which doesn't need the entire snapshot data to be written to the commit log as a single mutation at all - we can write every mutation of the snapshot to the commit log separately, after all, if any such write fails we will not update the aforementioned snapshot descriptor anyway.

So I think for the purposes of the snapshot transfer it is actually fine to use individual writes to the commit log.

To summarize, I believe your conclusions @kbr-scylla and direction you took with this issue are incorrect.

gleb-cloudius · 2023-08-03T06:48:36Z

See #2805

But does the atomic commitlog write is still needed with schema over raft? If raft snapshot application fails in the middle it will be re-tried on reboot (well we do not require raft barrier on reboot now, but we will eventually and we still can add a persistent flag barrier_ob_boot that will be set before snapshot application and cleared after).

kbr-scylla · 2023-08-03T08:32:34Z

One, is atomicity of multi-entry updates of the schema. The key outstanding issue here is #9603. However, Tomek's commit doesn't fix it, because these two statements are still executed as independent updates, and each can land in an own segment.

#9603 has nothing to do with schema update atomicity.

Another issue is the atomicity of snapshot transfer, which doesn't need the entire snapshot data to be written to the commit log as a single mutation at all - we can write every mutation of the snapshot to the commit log separately, after all, if any such write fails we will not update the aforementioned snapshot descriptor anyway.

What does the snapshot descriptor have to do with it? If you write a schema mutation, it immediately becomes observable for the next boot regardless of whether you update snapshot descriptor or not. If you have a batch of schema mutations and you successfully write only some of them, you will observe broken schema state.

kbr-scylla · 2023-08-03T08:37:59Z

But does the atomic commitlog write is still needed with schema over raft? If raft snapshot application fails in the middle it will be re-tried on reboot (well we do not require raft barrier on reboot now, but we will eventually and we still can add a persistent flag barrier_ob_boot that will be set before snapshot application and cleared after).

We replay committed entries on boot from the last snapshot descriptor, barrier or not. This should solve this problem but it doesn't because of our "broken" implementation of snapshot transfer which modifies the state too early (it should be modified in load_snapshot, transfer_snapshot should only save the data pulled from the other node). So even when we replay the entries on reboot we may still be in some broken half-applied state because of a snapshot pull that failed in the middle.

kbr-scylla · 2023-08-03T08:38:22Z

So even when we replay the entries on reboot we may still be in some broken half-applied state because of a snapshot pull that failed in the middle.

(If we don't use atomic commitlog updates.)

gleb-cloudius · 2023-08-03T10:25:11Z

So even when we replay the entries on reboot we may still be in some broken half-applied state because of a snapshot pull that failed in the middle.

(If we don't use atomic commitlog updates.)

Well, yes. If snapshot transfer fails and node reboots raft thinks that it is on the previous snapshot version but in practice it is in some inconsistent state. May be we need to modify snapshot transfer to be correct.

kostja · 2023-08-03T10:25:26Z

Raft snapshot transfer is not atomic after this patch anyway: it consists of at least 4 independent commit log writes, and a failure can happen in between each of them:

schema transfer
topology state trasnfer
writing the snapshot descriptor to disk (2 writes currently)

There is a broad issue of restart in inconsistent state. There is no point in patching one fragment of this problem.

kostja · 2023-08-03T10:27:50Z

The issue of a restart in inconsistent state is non-existent. We are not supposed to serve queries until we catch up with group0. We're not full members of raft group0 before that either. So there is nothing we can do to user data or cluster consistency in this state. The raft snapshot itself contains information about user schema and cluster topology. Its latest state doesn't impact what we do at boot, before we catch up with group0. So, the harm of starting in the partial state is imaginary.

gleb-cloudius · 2023-08-03T10:34:12Z

Raft snapshot transfer is not atomic after this patch anyway: it consists of at least 4 independent commit log writes, and a failure can happen in between each of them:
* schema transfer

* topology state trasnfer

* writing the snapshot descriptor to disk (2 writes currently)
There is a broad issue of restart in inconsistent state. There is no point in patching one fragment of this problem.

There is no requirement for a snapshot transfer to be atomic in Raft. The problem is that we mix transfer with application.

gleb-cloudius · 2023-08-03T10:39:44Z

The issue of a restart in inconsistent state is non-existent. We are not supposed to serve queries until we catch up with group0.

This is not the case today, but it is planned eventually. But the we should also make sure we do not reply regular commit log before this as well.

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

kostja · 2023-08-03T11:13:01Z

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

kbr-scylla · 2023-08-03T11:15:50Z

The issue of a restart in inconsistent state is non-existent. We are not supposed to serve queries until we catch up with group0. We're not full members of raft group0 before that either. So there is nothing we can do to user data or cluster consistency in this state.

What about internal data that we may need to read before we catch up with group 0? Lots of things depend on schema, not only user queries.

kbr-scylla · 2023-08-03T11:16:50Z

What about internal data that we may need to read before we catch up with group 0? Lots of things depend on schema, not only user queries.

For example, we just debugged #14944 with @piodul

Turns out the problem happens because on restart, we observe a partially applied topology_change command!

gleb-cloudius · 2023-08-03T16:31:37Z

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

No we do not. But we will not be voted as a leader.

gleb-cloudius · 2023-08-03T16:49:17Z

The issue of a restart in inconsistent state is non-existent. We are not supposed to serve queries until we catch up with group0. We're not full members of raft group0 before that either. So there is nothing we can do to user data or cluster consistency in this state.

What about internal data that we may need to read before we catch up with group 0? Lots of things depend on schema, not only user queries.

On local schema yes, but what depends on non system schema during then boot?

kostja · 2023-08-03T20:24:25Z

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

No we do not. But we will not be voted as a leader.

How do you imagine we get request_vote rpc but not get append entries rpc? I mean, of course it's possible theoretically, or with an asymmetric partitioning, but in practice it presumes reordering of TCP traffic.

kostja · 2023-08-03T20:27:18Z

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

No we do not. But we will not be voted as a leader.

How do you imagine we get request_vote rpc but not get append entries rpc? I mean, of course it's possible theoretically, or with an asymmetric partitioning, but in practice it presumes reordering of TCP traffic.

Moreover, voting in such case would be fine - thanks to quorum guarantees, the majority will not vote for an outdated leader.

gleb-cloudius · 2023-08-06T07:54:38Z

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

No we do not. But we will not be voted as a leader.

How do you imagine we get request_vote rpc but not get append entries rpc?

This is not hard to imagine. If a cluster has no leader at the time outdated node rejoins it it will get vote request without getting any entries.

gleb-cloudius · 2023-08-06T07:55:24Z

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

No we do not. But we will not be voted as a leader.

How do you imagine we get request_vote rpc but not get append entries rpc? I mean, of course it's possible theoretically, or with an asymmetric partitioning, but in practice it presumes reordering of TCP traffic.

Moreover, voting in such case would be fine - thanks to quorum guarantees, the majority will not vote for an outdated leader.

That what "we will not be voted as a leader" means above, yes.

roydahan added the triage/oss label May 14, 2023

DoronArazii assigned kostja May 14, 2023

DoronArazii added this to the 5.3 milestone May 14, 2023

DoronArazii added the status/regression label May 14, 2023

mykaul added the area/raft label May 14, 2023

kostja assigned gleb-cloudius May 15, 2023

kbr-scylla closed this as completed Aug 2, 2023

kostja reopened this Aug 2, 2023

kostja closed this as completed Aug 3, 2023

mykaul removed the backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed label Dec 13, 2023

xxuvincent mentioned this issue Jan 29, 2024

Node failed to join the cluster when I use the experimental feature “consistent-topology-changes” #16869

Open

bhalevy mentioned this issue Feb 29, 2024

[tablets] test_add_many_nodes_under_load failed due to Transferring snapshot to <host_id> failed with: seastar::rpc::remote_verb_error (Mutation of 17699180 bytes is too large for the maximum size of 16777216) #17573

Closed

Raft - failed transfer snapshot due too large mutation #13864

Raft - failed transfer snapshot due too large mutation #13864

Comments

soyacz commented May 12, 2023

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

DoronArazii commented May 14, 2023

kostja commented May 15, 2023

gleb-cloudius commented May 15, 2023

mykaul commented May 15, 2023

gleb-cloudius commented May 15, 2023

kbr-scylla commented May 15, 2023

mykaul commented May 15, 2023

gleb-cloudius commented May 15, 2023

gleb-cloudius commented May 15, 2023 • edited

kbr-scylla commented May 15, 2023

gleb-cloudius commented May 15, 2023

kbr-scylla commented May 15, 2023

gleb-cloudius commented May 15, 2023

mykaul commented May 15, 2023

kbr-scylla commented May 15, 2023

gleb-cloudius commented May 15, 2023

kostja commented May 15, 2023

gleb-cloudius commented May 15, 2023

gleb-cloudius commented May 15, 2023

roydahan commented May 15, 2023

gleb-cloudius commented May 15, 2023

mykaul commented May 15, 2023

gleb-cloudius commented May 15, 2023

roydahan commented May 15, 2023

kbr-scylla commented Aug 2, 2023

kbr-scylla commented Aug 2, 2023

kbr-scylla commented Aug 2, 2023

kostja commented Aug 2, 2023 • edited

gleb-cloudius commented Aug 3, 2023

kbr-scylla commented Aug 3, 2023

kbr-scylla commented Aug 3, 2023

kbr-scylla commented Aug 3, 2023

gleb-cloudius commented Aug 3, 2023

kostja commented Aug 3, 2023

kostja commented Aug 3, 2023 • edited

gleb-cloudius commented Aug 3, 2023

gleb-cloudius commented Aug 3, 2023

kostja commented Aug 3, 2023

kbr-scylla commented Aug 3, 2023

kbr-scylla commented Aug 3, 2023

gleb-cloudius commented Aug 3, 2023

gleb-cloudius commented Aug 3, 2023

kostja commented Aug 3, 2023 • edited

kostja commented Aug 3, 2023

gleb-cloudius commented Aug 6, 2023

gleb-cloudius commented Aug 6, 2023

gleb-cloudius commented May 15, 2023 •

edited

kostja commented Aug 2, 2023 •

edited

kostja commented Aug 3, 2023 •

edited

kostja commented Aug 3, 2023 •

edited