-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[longevity-tls-50gb-3d]: test timeout event that stress finishes in time #5345
Comments
Reproduced in 5.0.5 as well with a shorter longevity Installation detailsKernel Version: 5.15.0-1018-gcp Cluster size: 6 nodes (n1-highmem-16) Scylla Nodes used in this run:
OS / Image: Test: Issue description>>>>>>>
Yet it still attempted to connect to a node that was (successfully) decommissioned during the test:
<<<<<<<
Logs:
|
IIRC we had a similar issue, and the problem was something with the driver or the log writing taking too long, so the stress although finished, was stuck writing to logs or something like that... |
We are not using docker for cassandra-stress, at least not yet |
I think this error might be related to my 2 other commits: #5074 - where based on duration we timeout stress thread. We add 10 minutes margin to stress command and it might be to little (especially, when no-warmup flag is not set). But this also could be result of c-s hang we notice sometimes (that's why we added this fix). The other thing is, that we didn't see this error because timeouts were not catched up/working properly - this was fixed in #5311 I propose to prolong this 10m period by 5% of duration (so for short c-s we don't wait too long). |
I verified stress logs, and in first issue comment test, in So I don't think increasing timeout will help. We need to find out why these c-s threads freeze. e.g. see Installation detailsKernel Version: 5.15.0-1021-azure Cluster size: 6 nodes (Standard_L8s_v3) Scylla Nodes used in this run:
OS / Image: Test: Issue description
Logs:
|
Got something similar in 2022.2 job (although the test wasn't timed out) Installation detailsKernel Version: 5.15.0-1022-aws Scylla Nodes used in this run:
OS / Image: Test: Issue descriptionIn the beginning two stress threads were started among many others:
The test run reached the end
After that we got a timeout error for the stress threads
All the other stress threads finished, but the ones mentioned above failed:
Logs:
|
we are most likely seeing something similar in a rolling upgrade test: Installation detailsKernel Version: 3.10.0-1160.76.1.el7.x86_64 Scylla Nodes used in this run:
OS / Image: Test: Issue description>>>>>>> test timed out during step 5 (out of 10), because these steps took some 30 min more than usually they do (we run multiple times c-s and gemini stress in this test) <<<<<<<
Logs:
|
Same issue reproduced with azure: Issue description
some cassandra-stress continue to try to connect to removed node after c-s reported results which cause error:
Impacttest marked as failed with error or critical error How frequently does it reproduce?latest 3 runs on azure Installation detailsKernel Version: 5.15.0-1051-azure Cluster size: 6 nodes (Standard_L8s_v3) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
What are the loader logs showing ? I don't think that it's related to this year old issue |
I can agree with Israel. This error looks related to scylladb/java-driver#258 : The c-s failed to stop after DecommissionStreamingErr nemesis. |
Installation details
Kernel Version: 5.15.0-1020-aws
Scylla version (or git commit hash):
5.2.0~dev-20221002.060dda8e00b7
with build-idb4e08d869feb2fecb04c6ea45eb8946cade10c70
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
...
OS / Image:
ami-014a8e66eac5a37c5
(aws: us-east-1)Test:
longevity-50gb-3days
Test id:
ee8833a4-d087-47f6-9865-3db37c29f94b
Test name:
scylla-master/longevity/longevity-50gb-3days
Test config file(s):
Issue description
Test times out at 15:13, while waiting for stress threads to finish:
while most main stress threads are finished, one still fails on the timeout:
looking at the log, seem like it's keep trying to contact non-existing node, and that slows it down a bit.
$ hydra investigate show-monitor ee8833a4-d087-47f6-9865-3db37c29f94b
$ hydra investigate show-logs ee8833a4-d087-47f6-9865-3db37c29f94b
Logs:
Jenkins job URL
The text was updated successfully, but these errors were encountered: