Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft - failed transfer snapshot due too large mutation #13864

Closed
1 of 2 tasks
soyacz opened this issue May 12, 2023 · 76 comments
Closed
1 of 2 tasks

Raft - failed transfer snapshot due too large mutation #13864

soyacz opened this issue May 12, 2023 · 76 comments
Assignees
Labels
area/raft P1 Urgent status/regression tests/longevity Issue detected during longevity
Milestone

Comments

@soyacz
Copy link
Contributor

soyacz commented May 12, 2023

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

In test, we create 1 node, then create 5000 tables and then add another nodes to cluster.
As soon as additional node boots we can see errors (with some context):

May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] gossip - InetAddress 10.4.2.1 is now UP, status = UNKNOWN
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] raft_group_registry - marking Raft server ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 as alive for raft groups
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] raft - [0316ba6c-7727-4468-815b-e02259fcf0dd] Transferring snapshot to ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 failed with: seastar::rpc::remote_verb_error (connection is closed)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
[...]
May 11 17:16:06 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
May 11 17:16:07 longevity-5000-tables-2023-1-db-node-944683ea-1 scylla[5704]:  [shard  0] raft - [0316ba6c-7727-4468-815b-e02259fcf0dd] Transferring snapshot to ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 failed with: seastar::rpc::remote_verb_error (Mutation of 28395492 bytes is too large for the maximum size of 16777216)

Added node meantime fails pulling schema:

May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - setup_group0: joining group 0...
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - server ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 found no local group 0. Discovering...
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - server ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 found group 0 with group id ab396540-efc3-11ed-b012-8f3cc913200d, leader 0316ba6c-7727-4468-815b-e02259fcf0dd
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - Server ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 is starting group 0 with id ab396540-efc3-11ed-b012-8f3cc913200d
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard 10] compaction - [Compact system.local 8136ad50-f01f-11ed-8d60-2e89c5521e4e] Compacted 2 sstables to [/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/mc-100-big-Data.db:level=0]. 28kB to 15kB (~53% of original) in 17ms = 903kB/s. ~256 total partitions merged to 1.
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - server ef7a485b-b1ea-4e5b-82bc-2ab43fbe5be5 joined group 0 with group id ab396540-efc3-11ed-b012-8f3cc913200d
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - setup_group0: successfully joined group 0.
May 11 17:16:02 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] raft_group0 - setup_group0: ensuring that the cluster has fully upgraded to use Raft...
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] storage_service - Set host_id=0316ba6c-7727-4468-815b-e02259fcf0dd to be owned by node=10.4.1.11
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] gossip - InetAddress 10.4.1.11 is now UP, status = NORMAL
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Requesting schema pull from 10.4.1.11:0
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Pulling schema from 10.4.1.11:0
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Fail to pull schema from 10.4.1.11: seastar::rpc::closed_error (connection is closed)
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Requesting schema pull from 10.4.1.11:0
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Pulling schema from 10.4.1.11:0
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla-jmx[5792]: Picked up JAVA_TOOL_OPTIONS:
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 systemd[1]: Stopping Scylla JMX...
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 systemd[1]: scylla-jmx.service: Deactivated successfully.
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 systemd[1]: Stopped Scylla JMX.
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 systemd[1]: Started Scylla JMX.
May 11 17:16:03 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla-jmx[5802]: Picked up JAVA_TOOL_OPTIONS:
May 11 17:16:06 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla-jmx[5802]: Using config file: /etc/scylla/scylla.yaml
May 11 17:16:07 longevity-5000-tables-2023-1-db-node-944683ea-2 scylla[5691]:  [shard  0] migration_manager - Fail to pull schema from 10.4.1.11: std::invalid_argument (Mutation of 28395492 bytes is too large for the maximum size of 16777216)
May 11 17:16:07 longevity-5000-tables-2023-1-db-node-944683ea-2 systemd[1]: Started Scylla Housekeeping restart mode.

this is the applied schema (multiplied 5k times):
https://github.com/scylladb/scylla-qa-internal/blob/master/cust_d/templated_tables_mv.yaml

Impact

Failed to add new nodes.

How frequently does it reproduce?

Hard to say, this is the first occurance. Previously we ran this test with 2023.1.0~rc1-20230208.fe3cc281ec73 and didn't face this issue (some details here: #12972)

Installation details

Kernel Version: 5.15.0-1035-aws
Scylla version (or git commit hash): 2023.1.0~rc5-20230429.a47bcb26e42e with build-id d2644a8364f13d14d25be6b9d3c69f84612192bd

Cluster size: 1 nodes (i3.8xlarge)

Scylla Nodes used in this run:

  • longevity-5000-tables-2023-1-db-node-944683ea-2 (52.16.116.17 | 10.4.2.1) (shards: 30)
  • longevity-5000-tables-2023-1-db-node-944683ea-1 (34.249.171.184 | 10.4.1.11) (shards: 30)

OS / Image: ami-05e7801837cea47d9 (aws: eu-west-1)

Test: scale-5000-tables-test
Test id: 944683ea-6f39-4248-9317-9a2ff15f5713
Test name: enterprise-2023.1/scale/scale-5000-tables-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 944683ea-6f39-4248-9317-9a2ff15f5713
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 944683ea-6f39-4248-9317-9a2ff15f5713

Logs:

Jenkins job URL

@DoronArazii
Copy link

@kostja please have a look

@kostja
Copy link
Contributor

kostja commented May 15, 2023

@gleb-cloudius @kbr-scylla I don't think we will be able to avoid solving this issue. it pops up too frequently.

@gleb-cloudius
Copy link
Contributor

@gleb-cloudius @kbr-scylla I don't think we will be able to avoid solving this issue. it pops up too frequently.

They should increase the limit. Problem solved.

@mykaul
Copy link
Contributor

mykaul commented May 15, 2023

@gleb-cloudius @kbr-scylla I don't think we will be able to avoid solving this issue. it pops up too frequently.

They should increase the limit. Problem solved.

I thought so too, @avikivity wasn't happy with that approach.

@gleb-cloudius
Copy link
Contributor

@gleb-cloudius @kbr-scylla I don't think we will be able to avoid solving this issue. it pops up too frequently.

They should increase the limit. Problem solved.

I thought so too, @avikivity wasn't happy with that approach.

Isn't he? I do not see how this is a regression since raft uses the same schema pull code as non raft case, so but how long is he willing to push topology/tablet work in order to fix this non regression?

@kbr-scylla
Copy link
Contributor

We should understand what is that too large mutation.

Are we trying to commit such a large mutation to the Raft log? I doubt it. Each table is created by a separate mutation in a separate Raft command, so even though there are 5000 tables in the cluster, we won't (shouldn't) create a large Raft command because of it.

So it must be related somehow to the snapshot pulling code. Perhaps if the tables are living in the same keyspace, the whole description of all 5000 tables is represented by a single mutation, and that mutation is indeed large?

But even then, if I recall correctly, this error message is coming from the commitlog (?) when we're trying to put a too-large mutation in the commitlog. In that case, why do we even need to involve the commitlog during schema pulls? Are we actually putting the entire thing we pull into the commitlog? If that's the case, the solution should be easy - split the mutation into smaller ones before storing it.

@mykaul
Copy link
Contributor

mykaul commented May 15, 2023

@kbr-scylla - do you understand which limit, or is it both that are problematic?
This:
Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
and/or
Mutation of 28395492 bytes is too large for the maximum size of 16777216)

@gleb-cloudius
Copy link
Contributor

There was a looong discussion about all this here https://github.com/scylladb/scylla-enterprise/issues/2435 where it happened without any raft. There is no point repeating it here.

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented May 15, 2023

@gleb-cloudius @kbr-scylla I don't think we will be able to avoid solving this issue. it pops up too frequently.

They should increase the limit. Problem solved.

I thought so too, @avikivity wasn't happy with that approach.

And I have it documented that while he may be not happy but he consider it as a solution :) :
https://github.com/scylladb/scylla-enterprise/issues/2435#issuecomment-1274523327
https://github.com/scylladb/scylla-enterprise/issues/2435#issuecomment-1330897308

@kbr-scylla
Copy link
Contributor

@kbr-scylla - do you understand which limit, or is it both that are problematic? This: Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit) and/or Mutation of 28395492 bytes is too large for the maximum size of 16777216)

The second limit is based on the commitlog segment size limit (IIRC it's the segment limit times some constant), and it's causing the failure here.

    , commitlog_segment_size_in_mb(this, "commitlog_segment_size_in_mb", value_status::Used, 64,
        "Sets the size of the individual commitlog file segments. A commitlog segment may be archived, deleted, or recycled after all its data has been flushed to SSTables. This amount of data can potentially include commitlog segments from every table in the system. The default size is usually suitable for most commitlog archiving, but if you want a finer granularity, 8 or 16 MB is reasonable. See Commit log archive configuration.\n"
        "Related information: Commit log archive configuration")

we hit it when we try to store the mutation in commitlog.

The first limit appears when we query the table. It's just a soft limit and won't cause a failure, only a warning, but there's also a corresponding hard limit, and it could also cause a failure if we reach it.

    , max_memory_for_unlimited_query_soft_limit(this, "max_memory_for_unlimited_query_soft_limit", liveness::LiveUpdate, value_status::Used, uint64_t(1) << 20,
            "Maximum amount of memory a query, whose memory consumption is not naturally limited, is allowed to consume, e.g. non-paged and reverse queries. "
            "This is the soft limit, there will be a warning logged for queries violating this limit.")
    , max_memory_for_unlimited_query_hard_limit(this, "max_memory_for_unlimited_query_hard_limit", "max_memory_for_unlimited_query", liveness::LiveUpdate, value_status::Used, (uint64_t(100) << 20),
            "Maximum amount of memory a query, whose memory consumption is not naturally limited, is allowed to consume, e.g. non-paged and reverse queries. "
            "This is the hard limit, queries violating this limit will be aborted.")

@gleb-cloudius
Copy link
Contributor

Do we enforce the hard query limit for internal queries?

@kbr-scylla
Copy link
Contributor

I haven't checked thoroughly but I think yes, it goes through the same database::query/query_mutations path which uses get_unlimited_query_max_result_size. The fact that we got the soft limit warning is evidence.

@gleb-cloudius
Copy link
Contributor

I haven't checked thoroughly but I think yes, it goes through the same database::query/query_mutations path which uses get_unlimited_query_max_result_size. The fact that we got the soft limit warning is evidence.

We may warn, but not enforce. IIRC there was such ideal, but I am not sure it was ever implemented.

@mykaul
Copy link
Contributor

mykaul commented May 15, 2023

So I see 3(?) issues here (and thanks Gleb for pointing out the relevant previous discussion):

  1. Commit log segment size - I think it's fair and reasonable to increase the size. I'm somewhat worried it may not be tested (@roydahan ?)
  2. query limit - may be benign? What if there's a hard limit?
  3. Should snapshot transfer go through the commitlog? - I'm unsure I understand the conclusion here.

@kbr-scylla
Copy link
Contributor

Should snapshot transfer go through the commitlog? - I'm unsure I understand the conclusion here.

I guess we do need to persist the mutations, but not necessarily in a single segment.

It's possible to split a large mutation into smaller ones across clustering key boundaries, for example. E.g. if there's a single mutation for describing all 5000 tables, we can split it into 5000 mutations describing single table each.

We do a lot of mutation splitting in CDC code so it's doable.

This won't solve the query limit problem. But that's also solvable, we could do a paged query.

I guess the question is whether all that is worth it.

@gleb-cloudius
Copy link
Contributor

Should snapshot transfer go through the commitlog? - I'm unsure I understand the conclusion here.

I guess we do need to persist the mutations, but not necessarily in a single segment.

They are very intentionally stored in a single segment. In fact a lot of effort was spent on it. Why?
https://github.com/scylladb/scylla-enterprise/issues/2435#issuecomment-1276474157

Everything was discussed on previously already.

@kostja
Copy link
Contributor

kostja commented May 15, 2023

Increasing the limit is a viable workaround if few people stumble over it rarely.

@gleb-cloudius no need to repeat the discussion. We should proceed to implementing one of the solutions discussed. The fact that more people stumble over this changes the impact of the problem from medium to high, and this is impact the priority with which we should proceed to implementing the solution.

@gleb-cloudius
Copy link
Contributor

Our internal testing is not more people. And IMO the limit should be increased on case-by-case basis. So QA should re-run with larger limit.

@gleb-cloudius
Copy link
Contributor

And the previous discussion did no lead to any meaningful conclusion about the resolution except the agreement that the workaround is good enough.

@roydahan
Copy link

This internal test is based on a schema of a customer we have.

@gleb-cloudius
Copy link
Contributor

The issue linked also the existing customer that applied workaround. May be even the same one.

@mykaul
Copy link
Contributor

mykaul commented May 15, 2023

This internal test is based on a schema of a customer we have.

Then let's increase the commit log segment for this specific case and retry.

@gleb-cloudius
Copy link
Contributor

Should snapshot transfer go through the commitlog? - I'm unsure I understand the conclusion here.

I guess we do need to persist the mutations, but not necessarily in a single segment.

They are very intentionally stored in a single segment. In fact a lot of effort was spent on it. Why? scylladb/scylla-enterprise#2435 (comment)

If we will make raft_barrier mandatory on boot we may not need to store the schema pull in the commitlog.

@roydahan
Copy link

This internal test is based on a schema of a customer we have.

Then let's increase the commit log segment for this specific case and retry.

And how the users will know what to set and when?
If we have a documentation on how and when to do it, I'll be happy to follow.

@kbr-scylla
Copy link
Contributor

We may consider backporting e6099c4 to 5.2/2023.1

kbr-scylla added a commit that referenced this issue Aug 2, 2023
…Patryk Jędrzejczak

Fixes #14668

In #14668, we have decided to introduce a new `scylla.yaml` variable for the schema commitlog segment size and set it to 128MB. The reason is that segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. This `schema_commitlog_segment_size_in_mb variable` variable is now added to `scylla.yaml` and `db/config`.

Additionally,  we do not derive the commitlog sync period for schema commitlog anymore because schema commitlog runs in batch mode, so it doesn't need this parameter. It has also been discussed in #14668.

Closes #14704

* github.com:scylladb/scylladb:
  replica: do not derive the commitlog sync period for schema commitlog
  config: set schema_commitlog_segment_size_in_mb to 128
  config: add schema_commitlog_segment_size_in_mb variable

(cherry picked from commit e6099c4)
kbr-scylla pushed a commit that referenced this issue Aug 2, 2023
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.

(cherry picked from commit 5b167a4)
@kbr-scylla
Copy link
Contributor

I backported 4cd5847 to 5.2 (so it will eventually land in 2023.1 as well) which allows configuring the schema commitlog segment size separately.

@kbr-scylla
Copy link
Contributor

Reproducer based on master branch: kbr-scylla@ce905a7

I don't think there's anything else interesting to do with this issue, closing.

@kostja
Copy link
Contributor

kostja commented Aug 2, 2023

I think we're mixing two problems here. One, is atomicity of multi-entry updates of the schema. The key outstanding issue here is #9603. However, Tomek's commit doesn't fix it, because these two statements are still executed as independent updates, and each can land in an own segment.

Perhaps it enables fixing this problem, but doesn't immediately fix it.

Another issue is the atomicity of snapshot transfer, which doesn't need the entire snapshot data to be written to the commit log as a single mutation at all - we can write every mutation of the snapshot to the commit log separately, after all, if any such write fails we will not update the aforementioned snapshot descriptor anyway.

So I think for the purposes of the snapshot transfer it is actually fine to use individual writes to the commit log.

To summarize, I believe your conclusions @kbr-scylla and direction you took with this issue are incorrect.

@kostja kostja reopened this Aug 2, 2023
@gleb-cloudius
Copy link
Contributor

See #2805

But does the atomic commitlog write is still needed with schema over raft? If raft snapshot application fails in the middle it will be re-tried on reboot (well we do not require raft barrier on reboot now, but we will eventually and we still can add a persistent flag barrier_ob_boot that will be set before snapshot application and cleared after).

@kbr-scylla
Copy link
Contributor

One, is atomicity of multi-entry updates of the schema. The key outstanding issue here is #9603. However, Tomek's commit doesn't fix it, because these two statements are still executed as independent updates, and each can land in an own segment.

#9603 has nothing to do with schema update atomicity.

Another issue is the atomicity of snapshot transfer, which doesn't need the entire snapshot data to be written to the commit log as a single mutation at all - we can write every mutation of the snapshot to the commit log separately, after all, if any such write fails we will not update the aforementioned snapshot descriptor anyway.

What does the snapshot descriptor have to do with it? If you write a schema mutation, it immediately becomes observable for the next boot regardless of whether you update snapshot descriptor or not. If you have a batch of schema mutations and you successfully write only some of them, you will observe broken schema state.

@kbr-scylla
Copy link
Contributor

But does the atomic commitlog write is still needed with schema over raft? If raft snapshot application fails in the middle it will be re-tried on reboot (well we do not require raft barrier on reboot now, but we will eventually and we still can add a persistent flag barrier_ob_boot that will be set before snapshot application and cleared after).

We replay committed entries on boot from the last snapshot descriptor, barrier or not. This should solve this problem but it doesn't because of our "broken" implementation of snapshot transfer which modifies the state too early (it should be modified in load_snapshot, transfer_snapshot should only save the data pulled from the other node). So even when we replay the entries on reboot we may still be in some broken half-applied state because of a snapshot pull that failed in the middle.

@kbr-scylla
Copy link
Contributor

So even when we replay the entries on reboot we may still be in some broken half-applied state because of a snapshot pull that failed in the middle.

(If we don't use atomic commitlog updates.)

@gleb-cloudius
Copy link
Contributor

So even when we replay the entries on reboot we may still be in some broken half-applied state because of a snapshot pull that failed in the middle.

(If we don't use atomic commitlog updates.)

Well, yes. If snapshot transfer fails and node reboots raft thinks that it is on the previous snapshot version but in practice it is in some inconsistent state. May be we need to modify snapshot transfer to be correct.

@kostja
Copy link
Contributor

kostja commented Aug 3, 2023

Raft snapshot transfer is not atomic after this patch anyway: it consists of at least 4 independent commit log writes, and a failure can happen in between each of them:

  • schema transfer
  • topology state trasnfer
  • writing the snapshot descriptor to disk (2 writes currently)

There is a broad issue of restart in inconsistent state. There is no point in patching one fragment of this problem.

@kostja
Copy link
Contributor

kostja commented Aug 3, 2023

The issue of a restart in inconsistent state is non-existent. We are not supposed to serve queries until we catch up with group0. We're not full members of raft group0 before that either. So there is nothing we can do to user data or cluster consistency in this state. The raft snapshot itself contains information about user schema and cluster topology. Its latest state doesn't impact what we do at boot, before we catch up with group0. So, the harm of starting in the partial state is imaginary.

@gleb-cloudius
Copy link
Contributor

Raft snapshot transfer is not atomic after this patch anyway: it consists of at least 4 independent commit log writes, and a failure can happen in between each of them:

* schema transfer

* topology state trasnfer

* writing the snapshot descriptor to disk (2 writes currently)

There is a broad issue of restart in inconsistent state. There is no point in patching one fragment of this problem.

There is no requirement for a snapshot transfer to be atomic in Raft. The problem is that we mix transfer with application.

@gleb-cloudius
Copy link
Contributor

The issue of a restart in inconsistent state is non-existent. We are not supposed to serve queries until we catch up with group0.

This is not the case today, but it is planned eventually. But the we should also make sure we do not reply regular commit log before this as well.

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

@kostja
Copy link
Contributor

kostja commented Aug 3, 2023

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

@kbr-scylla
Copy link
Contributor

The issue of a restart in inconsistent state is non-existent. We are not supposed to serve queries until we catch up with group0. We're not full members of raft group0 before that either. So there is nothing we can do to user data or cluster consistency in this state.

What about internal data that we may need to read before we catch up with group 0? Lots of things depend on schema, not only user queries.

@kbr-scylla
Copy link
Contributor

What about internal data that we may need to read before we catch up with group 0? Lots of things depend on schema, not only user queries.

For example, we just debugged #14944 with @piodul

Turns out the problem happens because on restart, we observe a partially applied topology_change command!

@gleb-cloudius
Copy link
Contributor

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

No we do not. But we will not be voted as a leader.

@gleb-cloudius
Copy link
Contributor

The issue of a restart in inconsistent state is non-existent. We are not supposed to serve queries until we catch up with group0. We're not full members of raft group0 before that either. So there is nothing we can do to user data or cluster consistency in this state.

What about internal data that we may need to read before we catch up with group 0? Lots of things depend on schema, not only user queries.

On local schema yes, but what depends on non system schema during then boot?

@kostja
Copy link
Contributor

kostja commented Aug 3, 2023

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

No we do not. But we will not be voted as a leader.

How do you imagine we get request_vote rpc but not get append entries rpc? I mean, of course it's possible theoretically, or with an asymmetric partitioning, but in practice it presumes reordering of TCP traffic.

@kostja kostja closed this as completed Aug 3, 2023
@kostja
Copy link
Contributor

kostja commented Aug 3, 2023

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

No we do not. But we will not be voted as a leader.

How do you imagine we get request_vote rpc but not get append entries rpc? I mean, of course it's possible theoretically, or with an asymmetric partitioning, but in practice it presumes reordering of TCP traffic.

Moreover, voting in such case would be fine - thanks to quorum guarantees, the majority will not vote for an outdated leader.

@gleb-cloudius
Copy link
Contributor

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

No we do not. But we will not be voted as a leader.

How do you imagine we get request_vote rpc but not get append entries rpc?

This is not hard to imagine. If a cluster has no leader at the time outdated node rejoins it it will get vote request without getting any entries.

@gleb-cloudius
Copy link
Contributor

We're not full members of raft group0 before that either.

Why? A snapshot transfer does not mean a node is bootstrapping.

In order to vote, we need to catch up with the log.

No we do not. But we will not be voted as a leader.

How do you imagine we get request_vote rpc but not get append entries rpc? I mean, of course it's possible theoretically, or with an asymmetric partitioning, but in practice it presumes reordering of TCP traffic.

Moreover, voting in such case would be fine - thanks to quorum guarantees, the majority will not vote for an outdated leader.

That what "we will not be voted as a leader" means above, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft P1 Urgent status/regression tests/longevity Issue detected during longevity
Projects
None yet
Development

No branches or pull requests

9 participants