-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raft - failed transfer snapshot due too large mutation #13864
Comments
@kostja please have a look |
@gleb-cloudius @kbr-scylla I don't think we will be able to avoid solving this issue. it pops up too frequently. |
They should increase the limit. Problem solved. |
I thought so too, @avikivity wasn't happy with that approach. |
Isn't he? I do not see how this is a regression since raft uses the same schema pull code as non raft case, so but how long is he willing to push topology/tablet work in order to fix this non regression? |
We should understand what is that too large mutation. Are we trying to commit such a large mutation to the Raft log? I doubt it. Each table is created by a separate mutation in a separate Raft command, so even though there are 5000 tables in the cluster, we won't (shouldn't) create a large Raft command because of it. So it must be related somehow to the snapshot pulling code. Perhaps if the tables are living in the same keyspace, the whole description of all 5000 tables is represented by a single mutation, and that mutation is indeed large? But even then, if I recall correctly, this error message is coming from the commitlog (?) when we're trying to put a too-large mutation in the commitlog. In that case, why do we even need to involve the commitlog during schema pulls? Are we actually putting the entire thing we pull into the commitlog? If that's the case, the solution should be easy - split the mutation into smaller ones before storing it. |
@kbr-scylla - do you understand which limit, or is it both that are problematic? |
There was a looong discussion about all this here https://github.com/scylladb/scylla-enterprise/issues/2435 where it happened without any raft. There is no point repeating it here. |
And I have it documented that while he may be not happy but he consider it as a solution :) : |
The second limit is based on the commitlog segment size limit (IIRC it's the segment limit times some constant), and it's causing the failure here.
we hit it when we try to store the mutation in commitlog. The first limit appears when we query the table. It's just a soft limit and won't cause a failure, only a warning, but there's also a corresponding hard limit, and it could also cause a failure if we reach it. , max_memory_for_unlimited_query_soft_limit(this, "max_memory_for_unlimited_query_soft_limit", liveness::LiveUpdate, value_status::Used, uint64_t(1) << 20,
"Maximum amount of memory a query, whose memory consumption is not naturally limited, is allowed to consume, e.g. non-paged and reverse queries. "
"This is the soft limit, there will be a warning logged for queries violating this limit.")
, max_memory_for_unlimited_query_hard_limit(this, "max_memory_for_unlimited_query_hard_limit", "max_memory_for_unlimited_query", liveness::LiveUpdate, value_status::Used, (uint64_t(100) << 20),
"Maximum amount of memory a query, whose memory consumption is not naturally limited, is allowed to consume, e.g. non-paged and reverse queries. "
"This is the hard limit, queries violating this limit will be aborted.") |
Do we enforce the hard query limit for internal queries? |
I haven't checked thoroughly but I think yes, it goes through the same |
We may warn, but not enforce. IIRC there was such ideal, but I am not sure it was ever implemented. |
So I see 3(?) issues here (and thanks Gleb for pointing out the relevant previous discussion):
|
I guess we do need to persist the mutations, but not necessarily in a single segment. It's possible to split a large mutation into smaller ones across clustering key boundaries, for example. E.g. if there's a single mutation for describing all 5000 tables, we can split it into 5000 mutations describing single table each. We do a lot of mutation splitting in CDC code so it's doable. This won't solve the query limit problem. But that's also solvable, we could do a paged query. I guess the question is whether all that is worth it. |
They are very intentionally stored in a single segment. In fact a lot of effort was spent on it. Why? Everything was discussed on previously already. |
Increasing the limit is a viable workaround if few people stumble over it rarely. @gleb-cloudius no need to repeat the discussion. We should proceed to implementing one of the solutions discussed. The fact that more people stumble over this changes the impact of the problem from medium to high, and this is impact the priority with which we should proceed to implementing the solution. |
Our internal testing is not more people. And IMO the limit should be increased on case-by-case basis. So QA should re-run with larger limit. |
And the previous discussion did no lead to any meaningful conclusion about the resolution except the agreement that the workaround is good enough. |
This internal test is based on a schema of a customer we have. |
The issue linked also the existing customer that applied workaround. May be even the same one. |
Then let's increase the commit log segment for this specific case and retry. |
If we will make raft_barrier mandatory on boot we may not need to store the schema pull in the commitlog. |
And how the users will know what to set and when? |
We may consider backporting e6099c4 to 5.2/2023.1 |
…Patryk Jędrzejczak Fixes #14668 In #14668, we have decided to introduce a new `scylla.yaml` variable for the schema commitlog segment size and set it to 128MB. The reason is that segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. This `schema_commitlog_segment_size_in_mb variable` variable is now added to `scylla.yaml` and `db/config`. Additionally, we do not derive the commitlog sync period for schema commitlog anymore because schema commitlog runs in batch mode, so it doesn't need this parameter. It has also been discussed in #14668. Closes #14704 * github.com:scylladb/scylladb: replica: do not derive the commitlog sync period for schema commitlog config: set schema_commitlog_segment_size_in_mb to 128 config: add schema_commitlog_segment_size_in_mb variable (cherry picked from commit e6099c4)
In #14668, we have decided to introduce a new scylla.yaml variable for the schema commitlog segment size. The segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. Therefore, increasing the schema commitlog segment size is sometimes necessary. (cherry picked from commit 5b167a4)
I backported 4cd5847 to 5.2 (so it will eventually land in 2023.1 as well) which allows configuring the schema commitlog segment size separately. |
Reproducer based on I don't think there's anything else interesting to do with this issue, closing. |
I think we're mixing two problems here. One, is atomicity of multi-entry updates of the schema. The key outstanding issue here is #9603. However, Tomek's commit doesn't fix it, because these two statements are still executed as independent updates, and each can land in an own segment. Perhaps it enables fixing this problem, but doesn't immediately fix it. Another issue is the atomicity of snapshot transfer, which doesn't need the entire snapshot data to be written to the commit log as a single mutation at all - we can write every mutation of the snapshot to the commit log separately, after all, if any such write fails we will not update the aforementioned snapshot descriptor anyway. So I think for the purposes of the snapshot transfer it is actually fine to use individual writes to the commit log. To summarize, I believe your conclusions @kbr-scylla and direction you took with this issue are incorrect. |
But does the atomic commitlog write is still needed with schema over raft? If raft snapshot application fails in the middle it will be re-tried on reboot (well we do not require raft barrier on reboot now, but we will eventually and we still can add a persistent flag |
#9603 has nothing to do with schema update atomicity.
What does the snapshot descriptor have to do with it? If you write a schema mutation, it immediately becomes observable for the next boot regardless of whether you update snapshot descriptor or not. If you have a batch of schema mutations and you successfully write only some of them, you will observe broken schema state. |
We replay committed entries on boot from the last snapshot descriptor, barrier or not. This should solve this problem but it doesn't because of our "broken" implementation of snapshot transfer which modifies the state too early (it should be modified in load_snapshot, transfer_snapshot should only save the data pulled from the other node). So even when we replay the entries on reboot we may still be in some broken half-applied state because of a snapshot pull that failed in the middle. |
(If we don't use atomic commitlog updates.) |
Well, yes. If snapshot transfer fails and node reboots raft thinks that it is on the previous snapshot version but in practice it is in some inconsistent state. May be we need to modify snapshot transfer to be correct. |
Raft snapshot transfer is not atomic after this patch anyway: it consists of at least 4 independent commit log writes, and a failure can happen in between each of them:
There is a broad issue of restart in inconsistent state. There is no point in patching one fragment of this problem. |
The issue of a restart in inconsistent state is non-existent. We are not supposed to serve queries until we catch up with group0. We're not full members of raft group0 before that either. So there is nothing we can do to user data or cluster consistency in this state. The raft snapshot itself contains information about user schema and cluster topology. Its latest state doesn't impact what we do at boot, before we catch up with group0. So, the harm of starting in the partial state is imaginary. |
There is no requirement for a snapshot transfer to be atomic in Raft. The problem is that we mix transfer with application. |
This is not the case today, but it is planned eventually. But the we should also make sure we do not reply regular commit log before this as well.
Why? A snapshot transfer does not mean a node is bootstrapping. |
In order to vote, we need to catch up with the log. |
What about internal data that we may need to read before we catch up with group 0? Lots of things depend on schema, not only user queries. |
No we do not. But we will not be voted as a leader. |
On local schema yes, but what depends on non system schema during then boot? |
How do you imagine we get request_vote rpc but not get append entries rpc? I mean, of course it's possible theoretically, or with an asymmetric partitioning, but in practice it presumes reordering of TCP traffic. |
Moreover, voting in such case would be fine - thanks to quorum guarantees, the majority will not vote for an outdated leader. |
This is not hard to imagine. If a cluster has no leader at the time outdated node rejoins it it will get vote request without getting any entries. |
That what "we will not be voted as a leader" means above, yes. |
Issue description
In test, we create 1 node, then create 5000 tables and then add another nodes to cluster.
As soon as additional node boots we can see errors (with some context):
Added node meantime fails pulling schema:
this is the applied schema (multiplied 5k times):
https://github.com/scylladb/scylla-qa-internal/blob/master/cust_d/templated_tables_mv.yaml
Impact
Failed to add new nodes.
How frequently does it reproduce?
Hard to say, this is the first occurance. Previously we ran this test with
2023.1.0~rc1-20230208.fe3cc281ec73
and didn't face this issue (some details here: #12972)Installation details
Kernel Version: 5.15.0-1035-aws
Scylla version (or git commit hash):
2023.1.0~rc5-20230429.a47bcb26e42e
with build-idd2644a8364f13d14d25be6b9d3c69f84612192bd
Cluster size: 1 nodes (i3.8xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-05e7801837cea47d9
(aws: eu-west-1)Test:
scale-5000-tables-test
Test id:
944683ea-6f39-4248-9317-9a2ff15f5713
Test name:
enterprise-2023.1/scale/scale-5000-tables-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 944683ea-6f39-4248-9317-9a2ff15f5713
$ hydra investigate show-logs 944683ea-6f39-4248-9317-9a2ff15f5713
Logs:
Jenkins job URL
The text was updated successfully, but these errors were encountered: