-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cassandra-stress failed to create test table due to lack of schema agreement #9906
Comments
Whoever triaging this, please note it's a test with "slow disks" (using EBS gp3) - so the nodes are probably bounded by disk. |
happens also on 4.6 Installation details Test: Issue description ==================================== run
node 1 log has
cassandra-stress creates keyspace
starting to get
storage_service - fail to update schema_version for
==================================== Restore Monitor Stack command: Test id: Logs: |
I realize it is a little late to mention this, but the logs say: The disks are literally full. It is perhaps not strange it stalls...? |
Quoting bentsi The out of disk space happens after - the schema disagreement issues - so the question stands as to why we have that. The cause of out of diskpace is a different nemesis that is inflicting this on purpose and trying to see how we handle that In any case this is a new workload with alower disk so we are not blocking on this 4.6 for now |
happened again on a different run Installation details Test:
Issue description ==================================== node1 is reporting a schema disagreement
casssndra stress
==================================== Restore Monitor Stack command: Test id: Logs: Jenkins job URL |
@eliransin / @psarna sending this your way as well - I am not sure this is not a client issue as well I checked the last report on 4.6.
schema only
I did not have time to look but I think it does converge
at least according to the logs - not sure if its a client issue ot something else - taking to long to converge |
@tgrabiec can you please look at the last output and see if you can explain why schema is still pulled |
Had what looks like same issue happen on
and fails around 10 seconds later
on the scylla side we can see that keyspace599 took around 16 seconds to be created and it was created previously:
after it is created, schema is succesfully propagated:
then when c-s fails, this is what scylla log contains:
Maybe there's an issue with reporting already existing keyspaces to driver? Installation detailsKernel Version: 5.11.0-1022-aws Scylla running with shards number (live nodes):
OS / Image: Test:
Logs:
|
I looked at db-cluster-f23d21db. The schema converges, but one of the nodes, longevity-tls-1tb-7d-4-6-db-node-f23d21db-3, seems slower than other nodes. I don't see excessive number of pulls. It converges eventually. Probably schema pulls are slow because the disk is overloaded (since it is a test with a slow disk). Probably duplicate of #1459 @slivne Let me know if I should dig deeper. |
@roydahan - the ebs (slow disks) setup is new - can we do a run on 4.5 and see that it happens on 4.5 as well (if its simple - not sure the AMI will be able to support this). @tgrabiec thanks - yes I need more help
|
The same error happened in 4h-100gb run of 4.6.rc5-0.20220203.5694ec189 with build-id e06d3596d9782c9c4bdc0e06125e5b867dd117a5 : Table creation started at
longevity-100gb-4h-4-6-db-node-d05cb422-1 [3.87.140.168 | 10.0.14.238]
longevity-100gb-4h-4-6-db-node-d05cb422-2 [54.172.252.153 | 10.0.13.204]
longevity-100gb-4h-4-6-db-node-d05cb422-3 [3.90.9.134 | 10.0.14.204]
longevity-100gb-4h-4-6-db-node-d05cb422-4 [54.145.171.94 | 10.0.12.9]
longevity-100gb-4h-4-6-db-node-d05cb422-5 (18.212.110.153 | 10.0.14.148)
longevity-100gb-4h-4-6-db-node-d05cb422-6 (3.88.239.241 | 10.0.13.128)
Because of the long time it took for the schema agreement, the c-s thread failed (but it took 4 hours for it to end) nodes:
Logs:
|
@ShlomiBalalis What kind of disk was used in d05cb422-5e2e-4510-b85a-3ee529209672 ? Are nodes uniform, or is longevity-100gb-4h-4-6-db-node-d05cb422-5 special in some way? Looks like schema merge was delayed for about a minute on longevity-100gb-4h-4-6-db-node-d05cb422-5: Logs of node 5:
Looking at that shard's metrics (ignore the "commitlog" in the titles, the graphs show total for all io queues): I/O queues experience high queuing time during schema merge. The queue (seastar side) grows large (mainly commitlog and query classes). The queue to the disk is low (0-1 per shard, not shown above, above you can see aggregate for all shards). There is concurrent user read load ramping up during schema change. CPU is underutilized. No task quota violations. You can see that the total bandwidth is lower during schema change (1.4 GiB/s -> 1 GiB/s), and when it grows to its previous high the queues get drained and progress is made. Here's one hypothesis. The disk runs close to its capacity during this event. During schema change (or right before it), something makes latency of requests higher than expected by the I/O scheduler. This effectively reduces disks bandwidth. This causes overload situation which slows down the schema merge. When bandwidth goes back to normal, the situation heals and schema change completes. One suspect are the fsyncs issued during schema merge. \cc @xemul |
The nodes are uniform, I believe the data disks are the default nvme that is attached with |
@mykaul are we planning to chase after this issue in 5.2? |
No, and eventually we'll need to see if it reproduces with Raft. |
The issue was reproduced while testing 5.2 (using raft): Issue descriptionIn this run, the User Cassandra Stress thread (that uses this profile) was DOA because the cluster failed to achieve schema agreement:
Grepping the logs of the nodes, it seems that node 1's schema version was updated nearly instantly, but nodes 2 to 4 took over 10 seconds more to do the same:
Node 2:
Node 3:
Node 4:
The run does use raft ( ImpactMajor. How frequently does it reproduce?It's the only stress thread in this run that failed with this error. Installation detailsKernel Version: 5.15.0-1028-aws Cluster size: 4 nodes (i3.4xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
@kostja - can your team take a look at the latest failure? |
To simplify the scenario, what happens here is as follow:
|
Raft tables write to the commitlog in stronlgy consistent mode, that is, they don't return until the commit log confirms the writes. If IO subsystem is overloaded we can quite realistically get more than 10 seconds to commit a write. On top of that, the current schema refresh code (the code path that loads the committed writes into memory and prints the referenced commit message) is not optimal - essentially it reloads entire schema on each schema change. @tgrabiec recently sent a pull request to speed this up. I'm looking at it with hope. So I'm afraid as much as we promised correctness with raft based topology changes we haven't delivered on speed, and the above two changes should improve on that. Meanwhile, what we can do next to further isolate the problem, is to reproduce the failure with raft_group0 and raft logs enabled. That would allow us to correlate the time a schema change happens in Raft with the time it hits the commit log and the time it is then loaded into schema cache. There is a small chance that the fixes mentioned above are not sufficient to address the issue, that seeing the raft logs would help us identify that. |
I can see in the logs:
So the optimization to not flush the memtables is not active, which is probably the cause. Are you overwriting the scylla.yaml? By default it should have this:
|
@juliayakovlev - please see above - and in general, would be good to know, what if any, we change from default YAML. |
@tgrabiec |
Yes, the default is used, but the default is to not force the schema commit log, in order to support rollback during rolling upgrade. If default is used, you need another round of restarts after full upgrade. For new installs we put the |
@tgrabiec I'm a bit confused with your last comment. |
@roydahan Yes, the scylla.yaml in the repo has "force_schema_commit_log: true" since the feature was introduced. In general, settings in scylla.yaml should be preserved unless explicitly changed. |
Going back to this, I think found out what happened with this attribute (force_schema_commit_log). However, in this case, this attribute default value is different between db/config.cc (false) and scylla.yaml (true): Hence, the tests didn't change this attribute and also didn't include it in our test scylla.yaml. @tgrabiec why do we have such a case that the default in the code is different than the yaml? |
First, you should know two facts about the schema commit log feature. It's not possible to enable it on the node in a hot manner, while the node is running, you need a restart. Secondly, you can't safely downgrade the node if the feature is enabled. So the default is false because when upgrading, we want to enable this feature only after all nodes are upgraded, to allow downgrades during rolling upgrade. You need a round of restarts after the upgrade. We want to avoid the need for a round of restarts in fresh clusters, hence scylla.yaml forces the feature to be enabled (there is no need to support rollback, so we can do it). |
@roydahan - can you provide an update on this? |
Looks like using the force_schema_commit_log solves the issue. |
Changed MS to 5.2 since the issue is resolved with this flag in 5.2. |
Installation details
Kernel version:
5.11.0-1022-aws
Scylla version (or git commit hash):
4.7.dev-0.20211230.12fa68fe6 with build-id 938d162cf1be1483853bf5b5417f5daf11e5ac24
Cluster size: 6 nodes (r5b.2xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI):
ami-0668e251eb4f0f394
(aws: eu-west-1)Scylla running with shards number (live nodes):
longevity-100gb-4h-master-db-node-bce0924c-1 (34.252.180.109 | 10.0.3.53): 8 shards
longevity-100gb-4h-master-db-node-bce0924c-2 (34.244.58.179 | 10.0.1.246): 8 shards
longevity-100gb-4h-master-db-node-bce0924c-3 (3.250.166.213 | 10.0.0.220): 8 shards
longevity-100gb-4h-master-db-node-bce0924c-4 (54.216.84.128 | 10.0.1.149): 8 shards
longevity-100gb-4h-master-db-node-bce0924c-5 (54.75.69.84 | 10.0.2.137): 8 shards
longevity-100gb-4h-master-db-node-bce0924c-6 (34.244.236.61 | 10.0.3.61): 8 shards
Test:
longevity-100gb-4h-ebs-gp3-test
Test name:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Issue description
Test started with running of cassandra-stress load. There are a lot of WriteTimeoutException. But nodes are not overloaded.
In parallel during first nemesis (NoCorruptRepair) 10 keyspaces are created by running cassandra-stress commands.
Creation of
drop_table_during_repair_ks_0.standard1
table failed because of luck of schema agreement (see logcassandra-stress-l0-c0-k1-c8336276-d647-4b88-90dd-34615541c1f0.log
in the loader-set logs):schema_version_loading_failed
andmutation_write_timeout_exception
on the longevity-100gb-4h-master-db-node-bce0924c-3 nodeSame failure for
drop_table_during_repair_ks_9
keyspace.Restore Monitor Stack command:
$ hydra investigate show-monitor bce0924c-4be2-49a3-afdd-35cf020461b8
Restore monitor on AWS instance using Jenkins job
Show all stored logs command:
$ hydra investigate show-logs bce0924c-4be2-49a3-afdd-35cf020461b8
Test id:
bce0924c-4be2-49a3-afdd-35cf020461b8
Logs:
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/bce0924c-4be2-49a3-afdd-35cf020461b8/20220109_060331/db-cluster-bce0924c.tar.gz
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/bce0924c-4be2-49a3-afdd-35cf020461b8/20220109_060331/loader-set-bce0924c.tar.gz
sct - https://cloudius-jenkins-test.s3.amazonaws.com/bce0924c-4be2-49a3-afdd-35cf020461b8/20220109_060331/sct-runner-bce0924c.tar.gz
Jenkins job URL
The text was updated successfully, but these errors were encountered: