-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reader_concurrency_semaphore
times out a lot on 2/6 nodes and a little on 4/6 having ~90% load in cluster
#13322
Comments
@denesb - can you take a look? (I assume this is without your latest changes in this area - perhaps they need backports) |
@vponomaryov I don't see anything wrong in the printouts. This seems like plain overload to me. |
Overload is 100%, the problem appeared at 90% load.
We have run it too few times in a last year: As you can see, it is one such failure. Also, one more weekly run happened here: https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-cdc-100gb-4h-test/381/consoleFull So, either the bug is flaky or, probably, the disk perf (real perf, the io_properties.yaml file is the same on all nodes) in the failed run was really bad... |
I don't see any disk reads in the semaphore printouts, its all cache. Of course that is just a snapshot, metrics would give a more accurate picture. |
@vponomaryov how is the data prepared for this test? And what is the read-pattern? Has something changed around that recently? |
Graphs confirm there are not disk reads whatsoever, read is 100% cache. |
Shards 3, 12 and to some extent 13 is busier on all the nodes that have timeouts. Consequently reads on these shards queue up and time out. |
@vponomaryov was speculative reads always on in this test? |
No, no changes have been made for more than 1 year.
Yes, It has been enabled for 3 years already in the test config which gets used here. |
I have no idea what recent changes could have contributed to this failure, if any. |
we encounter it during testing of 5.2: and bisected it there:
the only CDC related change is: @kbr- can you take a look ? now that this change is found itself on the release branch... |
(for the future: please don't tag my old account, instead tag my new account @kbr-scylla so I get a notification on my work email) The CDC change you mentioned only affects the LWT path, but there are no LWT queries in the cassandra-stress profiles that @vponomaryov has posted. So it's not that. @denesb is it plausible that it's the same cause as #11803? If so maybe we should rerun once the fix gets promoted on 5.2 (9d384e3) (currently 5.2 promotion appears to be stuck though... there was a dtest backport missing, it should be backported now so let's hope it promotes) |
It is plausible. Worth a try. |
@denesb this case should block 5.2 ? |
I have no idea because I still don't know what caused these timeouts. #11803, the issue @kbr-scylla suggested is a plausible cause. The fix for that was backported to 5.2 only last week, so I think it is worth to re-run reproducer and see if it still reproduces with 9d384e3 (5.2 backport of the fix). If the problem doesn't reproduce then we can close this as duplicate of #11803. |
Okay, thanks, @vponomaryov FYI ^^ please keep us posted (RC5 has the fix). |
@vponomaryov any updates? |
Ran the The overall behavior stayed, but we haven't reached the case with timeouts. There are still lots of the
Logs:test-id: 444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f |
@denesb can you please take a look and provide your feedback on how bad is this? |
I don't understand, you say you didn't reach the case with timeout, but you post a timeout from the logs below. Do you mean that you didn't see as many timeouts as before? If yes, how significant was the difference? Did the amount of timeouts go back to the "old" value (before you opened this issue)? Or not quite?
From first look, it still looks like reads are bound by CPU. I will have a look at the metrics later. |
There are no timeouts in
JFYI: the test |
Does that mean that the way the cluster behaves is within normal parameters? Or do you still consider the current state to be a regression (only a smaller one than what it used to be). What I want to find out with all these questions, is whether we are still dealing with a regression here, compared to the "good" state before. I just want to avoid going on a wild goose chase, investigating a regression that doesn't exist. |
The diff in timeouts is following: scylla_version:
scylla_version:
And one of the recent runs with
So, it can be said that |
@denesb can you please refer to Roy's assumption ^^ |
I have no idea whether they are the same or not. Timeouts are just the first symptom one will see if something goes wrong on a node. The root cause could be wide array of things (including of course the semaphore itself). |
It is important to not conflate different issue just because they all print semaphore diagnostics to the logs. These issues can have wildly different causes and conflating them all means we will never close any of these issues (we already have way to many unclosed timeout related issues). Based on #13322 (comment) I'm leaning towards closing this issue and focusing on the other ones that were not investigated yet. |
Thanks Botond. /Cc @roydahan i'm going with Botond ^^ and closing it. please raise a flag if you have any objection. |
Issue description
Describe your issue in detail and steps it took to produce it.
Impact
A lot of query timeouts.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Kernel Version: 5.15.0-1031-aws
Scylla version (or git commit hash):
5.3.0~dev-20230316.5705df77a155
with build-id3aa7c396c9e0eab7f08c8ab7dd82a548cf8d41e6
Cluster size: 6 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0cbdbeb82ff6b7e4f
(aws: eu-west-1)Test:
longevity-cdc-100gb-4h-test
Test id:
99364f24-5e01-4f5f-9acf-90656769f227
Test name:
scylla-master/longevity/longevity-cdc-100gb-4h-test
Test config file(s):
Details:
In this test run the
no_corrupt_repair
nemesis was attempted to be run.And only following parts were executed before and during the buggy behavior of Scylla:
drop_table_during_repair_ks_%integer%.standard1
The load was growing from 80% to 90% and following errors started appearing in DB logs:
node-5
node-3:
node-1:
The node-5 with the highest number of permits in above kind of error started returning a lot of query time outs causing following errors:
And following is visible in loaders logs:
The above problems caused the
peers info
queries, made by SCT as part of the nemesis, timeout on thisnode-5
for almost 3 hours. It stopped timing out when the loader started decreasing the load reaching the end of load schedule.So, load started finishing, our time outs stopped and the test tearDown started. The rest of errors in SCT is SCT-related.
NOTE: This bug's effect is very similar to this one: #12552
But this one is not related to the upgrade and appears when the 90% load is reached.
Monitor:
$ hydra investigate show-monitor 99364f24-5e01-4f5f-9acf-90656769f227
$ hydra investigate show-logs 99364f24-5e01-4f5f-9acf-90656769f227
Logs:
Jenkins job URL
The text was updated successfully, but these errors were encountered: