-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cqlsh fails with "Operation timed out for system.peers" (after nodetool refresh
used)
#11016
Comments
Happening also this week run mostly during
seems that since the Installation detailsKernel Version: 5.15.0-1015-aws Scylla Nodes used in this run:
OS / Image: Test:
Logs:No logs captured during this run. |
@fruch how do I get to the cluster node logs? |
it's currently running so they aren't collected yet. |
Looking at Amongst the thousands of There are many reader_concurrency_semaphore timed out messages on shard 12:
Until 2022-07-15T18:31:04+00:00 there are close to 1000 reader_concurrency_semaphore timed out message on shard 12. @denesb please look into this. |
The actual diagnostics dump is missing from the logs. Wasn't this fixed already? |
where do you expect the fix to be ? |
This log is a multi-line one and we had the problem in the past of SCT only copying the first line into its own logs. There was an issue (don't remember which) where this was discussed and I thought it was fixed. |
I hope the job also collects the node logs verbatim. |
the answer is no, (I'll try to handle it in scylladb/scylla-cluster-tests#5026) but the faulty machine is up, here is one example from it:
|
I need a coredump from this node while it is producing these symptoms. |
the test case ended, so it's a bit too late for that. |
Ok. Next time you reproduce this, please kill scylla with SIGABRT and upload the coredump, so I can have a look at what the problematic shard is doing. |
@denesb why do we have multi-line messages in the log? |
Reproduced 3 times during longevity-twcs-48h Installation detailsKernel Version: 5.15.0-1015-aws Scylla Nodes used in this run:
OS / Image: Test: Issue descriptionDuring the test 3 nemeses failed with same ReadTimeout error when sent "describe keyspaces" to the node1 (private IP 10.0.0.17):
Scylla-bench commands are running at this time:
Logs:
|
I know its not ideal but this report is very long and if I made it single-line it would be completely unreadable for humans. |
We could print it as json string that could be easily printed later with json pretty printer.. |
This also makes the log unreadable for humans, without the tool at hand. The questions is: who do we want to make the logs easy to read for? Humans or scripts? |
I think that in most cases, this particular error target audience is scylla engineering, rather than users, since it involves an internal data structure that's not exposed in the user interface. Therefore I'd expect scylla staff dealing with it to have the tool. |
This is far from being the only multi-line log message. Are we requiring all of them to be converted to single line JSON? Note that multi-line logs are especially common in engineer oriented logs, typically where more details are required. |
maybe we should add an option to scylla that will be set in automated runs to print those logs in a single line. |
I don't mind that, but again, how far are we willing to go to avoid having parsing multi-line logs? From where I'm sitting this doesn't seem like a hard thing to do. Logs have a very well defined start sequence, which is easy to find and thus it can be used to sequence the log stream into individual multi-line logs. |
Is this a duplicate of #10405? |
@xemul ^^ |
|
Do we have a metrics showing how many IOs a cache-missing request needs? |
The plan is to generalize CPU and IO classes so that default IO class would just naturally disappear |
Yup: // This is small enough, and well-defined. Easier to just read it all at once
future<> sstable::read_toc() noexcept {
...
return with_file(new_sstable_component_file(_read_error_handler, component_type::TOC, open_flags::ro), [this] (file f) {
auto bufptr = allocate_aligned_buffer<char>(4096, 4096);
auto buf = bufptr.get();
auto fut = f.dma_read(0, buf, 4096); // <<<<< uses default IO class |
Not the case here
|
Erm... Doesn't look like loading at all in fact:
on all nodes |
Ah, so probably streaming happens because decommissioning some node(s) |
So node 10.4.0.178 first got decommissioned
then immediately joins back
and repair kicks in
then removenode happens
then rpair again
So most of the streaming badness happens here
|
Happened again on during TRUNCATE command, it failed after 4m, even the timeout is 600s
Installation detailsKernel Version: 5.15.0-1031-aws Cluster size: 6 nodes (i3.4xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
@xemul where is it standing? |
/Cc @fruch ^^ |
There's no specific reproducer or case for this one, it would wait for the next release, when we'll run some of those cases again to see if this is still happening |
We can start with master. |
@fruch , we are building 2023.1.1 , so perhaps we should re-run the test that hit it in the first place |
This was happening in multiple cases for the last year, why try reproducing it on 2023.1.1 ? and not on any other release ? Anyhow, we don't have any specific case that reproduces it clearly |
I'm closing this for the time being. If it reproduces - please re-open. |
Installation details
Kernel Version: 5.13.0-1031-aws
Scylla version (or git commit hash):
5.1.dev-20220706.a0ffbf3291b7
with build-id3490fa9f14da510e97a1d0f53f693cac13a70494
Cluster size: 6 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-07d73e5ea1fc772eb
(aws: eu-west-1)Test:
longevity-50gb-3days
Test id:
fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
Test name:
scylla-master/longevity/longevity-50gb-3days
Test config file(s):
Issue description
during
disrupt_nodetool_refresh
, when test tries to verify the snapshot was refreshed correctly,the cqlsh command timeout on
system.peers
$ hydra investigate show-monitor fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
$ hydra investigate show-logs fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
Logs:
No logs captured during this run. test is still running
http://34.246.190.165:3000/d/ks-master/longevity-50gb-3days-scylla-per-server-metrics-nemesis-master?orgId=1&from=now-6h&to=now
Jenkins job URL
it's ~2022-07-11 12:02:19
The text was updated successfully, but these errors were encountered: