[longevity-tls-50gb-3d]: test timeout event that stress finishes in time #5345

fruch · 2022-10-11T09:16:26Z

Installation details

Kernel Version: 5.15.0-1020-aws
Scylla version (or git commit hash): 5.2.0~dev-20221002.060dda8e00b7 with build-id b4e08d869feb2fecb04c6ea45eb8946cade10c70

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-master-db-node-ee8833a4-9 (34.228.71.135 | 10.12.10.33) (shards: 14)
longevity-tls-50gb-3d-master-db-node-ee8833a4-8 (54.237.196.139 | 10.12.11.224) (shards: 14)
longevity-tls-50gb-3d-master-db-node-ee8833a4-7 (3.91.44.54 | 10.12.9.54) (shards: 14)
longevity-tls-50gb-3d-master-db-node-ee8833a4-6 (54.224.49.190 | 10.12.9.114) (shards: 14)
longevity-tls-50gb-3d-master-db-node-ee8833a4-5 (54.221.151.171 | 10.12.11.178) (shards: 14)
...

OS / Image: ami-014a8e66eac5a37c5 (aws: us-east-1)

Test: longevity-50gb-3days
Test id: ee8833a4-d087-47f6-9865-3db37c29f94b
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Issue description

Test times out at 15:13, while waiting for stress threads to finish:

< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > 2022-10-10 15:13:36.217: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=11a70335-2bb5-44aa-973f-26a68ce1c51e, source=LongevityTest.test_custom_time (longevity_test.LongevityTest)() message=Traceback (most recent call last):
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/longevity_test.py", line 140, in test_custom_time
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     self.verify_stress_thread(cs_thread_pool=stress)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1906, in verify_stress_thread
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     results, errors = cs_thread_pool.verify_results()
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 281, in verify_results
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     for future in concurrent.futures.as_completed(self.results_futures, timeout=self.timeout):
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 245, in as_completed
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     waiter.event.wait(wait_timeout)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/usr/local/lib/python3.10/threading.py", line 600, in wait
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     signaled = self._cond.wait(timeout)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/usr/local/lib/python3.10/threading.py", line 324, in wait
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     gotit = waiter.acquire(True, timeout)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 241, in critical_failure_handler
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     raise CriticalTestFailure("Critical Error has failed the test")  # pylint: disable=raise-missing-from
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > sdcm.tester.CriticalTestFailure: Critical Error has failed the test

while most main stress threads are finished, one still fails on the timeout:

43826232:< t:2022-10-10 15:03:53,951 f:decorators.py   l:69   c:sdcm.utils.decorators p:DEBUG > '_run': failed with 'RetryableNetworkException('Command did not complete within 259800
seconds!\n\nCommand: \'echo TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=JG6YDU4LBWMI2FJ2CBP7; cassandra-stress read cl=QUORUM duration=4320m -mode cql3 native   use
r=cassandra password=cassandra -rate threads=50 -pop seq=1..100000000 -log interval=5 -transport "truststore=/etc/scylla/ssl_conf/client/cacerts.jks truststore-password=cassandra" -no
de 10.12.9.39,10.12.11.81,10.12.8.12,10.12.9.74,10.12.11.178,10.12.9.114 -errors skip-unsupported-columns\

< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-10-10 15:05:12.385: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id
=f5e9fa3c-51a6-43a9-8a9b-e485c37815ab during_nemesis=NoCorruptRepair duration=3d10h50m53s: node=Node longevity-tls-50gb-3d-master-loader-node-ee8833a4-2 [18.234.34.198 | 10.12.9.30] (seed: False)
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress read cl=QUORUM duration=4320m -mode cql3 native  -rate threads=50 -pop seq=1..100000000 -log interval=5
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > errors:
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stress command completed with bad status 1: WARN  15:04:04,928 Error creating netty channel to /10.12.9.114:9042

looking at the log, seem like it's keep trying to contact non-existing node, and that slows it down a bit.

Restore Monitor Stack command: $ hydra investigate show-monitor ee8833a4-d087-47f6-9865-3db37c29f94b
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs ee8833a4-d087-47f6-9865-3db37c29f94b

Logs:

db-cluster-ee8833a4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/db-cluster-ee8833a4.tar.gz
monitor-set-ee8833a4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/monitor-set-ee8833a4.tar.gz
loader-set-ee8833a4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/loader-set-ee8833a4.tar.gz
normal-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/normal-ee8833a4.log.tar.gz
summary-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/summary-ee8833a4.log.tar.gz
events-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/events-ee8833a4.log.tar.gz
output-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/output-ee8833a4.log.tar.gz
debug-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/debug-ee8833a4.log.tar.gz
sct-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/sct-ee8833a4.log.tar.gz
error-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/error-ee8833a4.log.tar.gz
critical-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/critical-ee8833a4.log.tar.gz
raw_events-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/raw_events-ee8833a4.log.tar.gz
warning-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/warning-ee8833a4.log.tar.gz
email_data-ee8833a4.json.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/email_data-ee8833a4.json.tar.gz
argus-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/argus-ee8833a4.log.tar.gz
left_processes-ee8833a4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ee8833a4-d087-47f6-9865-3db37c29f94b/20221010_152334/left_processes-ee8833a4.log.tar.gz

Jenkins job URL

The text was updated successfully, but these errors were encountered:

KnifeyMoloko · 2022-10-13T11:24:28Z

Reproduced in 5.0.5 as well with a shorter longevity

Installation details

Kernel Version: 5.15.0-1018-gcp
Scylla version (or git commit hash): 5.0.5-20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee

Cluster size: 6 nodes (n1-highmem-16)

Scylla Nodes used in this run:

longevity-10gb-3h-5-0-db-node-fb62010e-0-9 (34.75.168.233 | 10.142.0.33) (shards: 16)
longevity-10gb-3h-5-0-db-node-fb62010e-0-8 (34.75.168.233 | 10.142.0.42) (shards: 16)
longevity-10gb-3h-5-0-db-node-fb62010e-0-7 (34.75.168.233 | 10.142.0.157) (shards: 16)
longevity-10gb-3h-5-0-db-node-fb62010e-0-6 (35.185.44.132 | 10.142.0.150) (shards: 16)
longevity-10gb-3h-5-0-db-node-fb62010e-0-5 (104.196.222.105 | 10.142.0.148) (shards: 16)
longevity-10gb-3h-5-0-db-node-fb62010e-0-4 (34.148.58.38 | 10.142.0.147) (shards: 16)
longevity-10gb-3h-5-0-db-node-fb62010e-0-3 (34.73.145.132 | 10.142.0.139) (shards: 16)
longevity-10gb-3h-5-0-db-node-fb62010e-0-2 (35.196.146.49 | 10.142.0.135) (shards: 16)
longevity-10gb-3h-5-0-db-node-fb62010e-0-1 (35.229.126.210 | 10.142.0.130) (shards: 16)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/5935604957551590924 (gce: us-east1)

Test: longevity-10gb-3h-gce-test
Test id: fb62010e-329a-4d63-aeac-50e1c1e13341
Test name: scylla-5.0/longevity/longevity-10gb-3h-gce-test
Test config file(s):

longevity-10gb-3h.yaml

Issue description

>>>>>>>
The test timed out waiting without the c-s threads finishing before the timeout was reached. Looking in the loader logs however, I see that the 2 longest c-s runs finished:

Total GC memory           : 0.000 KiB
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Total GC time             :    0.0 seconds
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Avg GC time               :    NaN ms
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
StdDev GC time            :    0.0 ms
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Total operation time      : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
END

Yet it still attempted to connect to a node that was (successfully) decommissioned during the test:

om.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed

<<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor fb62010e-329a-4d63-aeac-50e1c1e13341
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs fb62010e-329a-4d63-aeac-50e1c1e13341

Logs:

db-cluster-fb62010e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/fb62010e-329a-4d63-aeac-50e1c1e13341/20221011_161408/db-cluster-fb62010e.tar.gz
monitor-set-fb62010e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/fb62010e-329a-4d63-aeac-50e1c1e13341/20221011_161408/monitor-set-fb62010e.tar.gz
loader-set-fb62010e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/fb62010e-329a-4d63-aeac-50e1c1e13341/20221011_161408/loader-set-fb62010e.tar.gz
sct-runner-fb62010e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/fb62010e-329a-4d63-aeac-50e1c1e13341/20221011_161408/sct-runner-fb62010e.tar.gz
parallel-timelines-report-fb62010e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/fb62010e-329a-4d63-aeac-50e1c1e13341/20221011_161408/parallel-timelines-report-fb62010e.tar.gz

Jenkins job URL

fgelcer · 2022-10-13T11:42:33Z

IIRC we had a similar issue, and the problem was something with the driver or the log writing taking too long, so the stress although finished, was stuck writing to logs or something like that...
could it be the case, but now, because it is running inside a docker container, we are having problems in writing logs within the container?

fruch · 2022-10-13T11:58:49Z

IIRC we had a similar issue, and the problem was something with the driver or the log writing taking too long, so the stress although finished, was stuck writing to logs or something like that...
could it be the case, but now, because it is running inside a docker container, we are having problems in writing logs within the container?

We are not using docker for cassandra-stress, at least not yet

soyacz · 2022-10-17T15:27:28Z

I think this error might be related to my 2 other commits:

#5074 - where based on duration we timeout stress thread. We add 10 minutes margin to stress command and it might be to little (especially, when no-warmup flag is not set). But this also could be result of c-s hang we notice sometimes (that's why we added this fix).

The other thing is, that we didn't see this error because timeouts were not catched up/working properly - this was fixed in #5311

I propose to prolong this 10m period by 5% of duration (so for short c-s we don't wait too long).
Also we could add no-warmup flags everywhere - but I don't know if this would be right approach.

soyacz · 2022-10-17T15:57:55Z

I verified stress logs, and in first issue comment test, in cassandra-stress-l1-c0-k1-30fb05d9-67fe-4988-8df3-997b0aa6cf22.log from loader-node-2 I can see it got frozen up (we can see in this log also starting lines after 11h which I don't understand). Also in other test I found c-s logs get silent after some time and leading to timeout error.

So I don't think increasing timeout will help.

We need to find out why these c-s threads freeze.

e.g. see cassandra-stress-l0-c0-k1-07f02534-d226-4550-976a-2f795c9801bc.log from test on master:

Installation details

Kernel Version: 5.15.0-1021-azure
Scylla version (or git commit hash): 5.2.0~dev-20221011.fcf0628bc5d0 with build-id 8f3895a59a9d0e6bcc3c0f8932dee690e20feff9

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-9 (20.172.146.207 | 10.0.0.7) (shards: 7)
longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-8 (4.246.166.158 | 10.0.0.6) (shards: 7)
longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-7 (52.188.208.105 | 10.0.0.14) (shards: 7)
longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-6 (20.124.241.150 | 10.0.0.10) (shards: 7)
longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-5 (20.163.207.223 | 10.0.0.9) (shards: 7)
longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-4 (20.185.18.243 | 10.0.0.8) (shards: 7)
longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-3 (20.127.113.247 | 10.0.0.7) (shards: 7)
longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-2 (20.185.18.203 | 10.0.0.6) (shards: 7)
longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-1 (52.255.150.93 | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.2.0-dev-x86_64-2022-10-11T04-25-04Z (azure: eastus)

Test: longevity-10gb-3h-azure-test
Test id: 2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):

longevity-10gb-3h.yaml

Issue description

Restore Monitor Stack command: $ hydra investigate show-monitor 2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825

Logs:

db-cluster-2eb4c2ab.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825/20221011_100921/db-cluster-2eb4c2ab.tar.gz
monitor-set-2eb4c2ab.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825/20221011_100921/monitor-set-2eb4c2ab.tar.gz
loader-set-2eb4c2ab.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825/20221011_100921/loader-set-2eb4c2ab.tar.gz
sct-runner-2eb4c2ab.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825/20221011_100921/sct-runner-2eb4c2ab.tar.gz
parallel-timelines-report-2eb4c2ab.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825/20221011_100921/parallel-timelines-report-2eb4c2ab.tar.gz

Jenkins job URL

ilya-rarov · 2022-11-13T16:00:37Z

Got something similar in 2022.2 job (although the test wasn't timed out)

Installation details

Kernel Version: 5.15.0-1022-aws
Scylla version (or git commit hash): 2022.2.0~rc4-20221106.f5714e0db12f with build-id f4a927b8a00fbcd8d48640835192aeaa7968b1f2
Relocatable Package: http://downloads.scylladb.com/unstable/scylla-enterprise/enterprise-2022.2/relocatable/2022-11-06T15:44:05Z/scylla-enterprise-x86_64-package.tar.gz
Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-2022-2-db-node-a94220e9-9 (35.178.200.195 | 10.3.1.68) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-8 (13.40.154.111 | 10.3.3.247) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-7 (18.130.108.114 | 10.3.0.41) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-6 (3.9.16.111 | 10.3.1.133) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-5 (13.40.151.248 | 10.3.1.112) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-4 (52.56.200.142 | 10.3.0.176) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-33 (13.40.115.39 | 10.3.0.101) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-32 (13.40.123.99 | 10.3.2.159) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-31 (3.9.115.191 | 10.3.0.118) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-30 (3.8.170.243 | 10.3.3.89) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-3 (18.135.104.106 | 10.3.0.170) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-29 (13.40.101.236 | 10.3.0.29) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-28 (18.130.232.235 | 10.3.3.117) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-27 (18.170.61.6 | 10.3.3.209) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-26 (18.134.198.186 | 10.3.1.5) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-25 (18.169.157.184 | 10.3.3.5) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-24 (35.176.158.13 | 10.3.1.214) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-23 (3.8.131.38 | 10.3.1.112) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-22 (3.9.177.44 | 10.3.0.93) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-21 (35.178.161.42 | 10.3.3.223) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-20 (35.176.103.236 | 10.3.0.99) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-2 (3.8.139.104 | 10.3.2.112) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-19 (18.133.175.26 | 10.3.2.156) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-18 (52.56.125.9 | 10.3.2.89) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-17 (35.178.199.29 | 10.3.1.188) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-16 (52.56.195.17 | 10.3.1.19) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-15 (18.135.105.217 | 10.3.3.172) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-14 (13.40.73.39 | 10.3.1.213) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-13 (3.8.143.101 | 10.3.0.34) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-12 (18.133.238.4 | 10.3.0.217) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-11 (18.132.9.223 | 10.3.0.175) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-10 (18.133.244.123 | 10.3.1.199) (shards: 14)
longevity-tls-50gb-3d-2022-2-db-node-a94220e9-1 (52.56.171.91 | 10.3.1.49) (shards: 14)

OS / Image: ami-06afe72a2cd370dbb (aws: eu-west-2)

Test: longevity-50gb-3days-aws-i4i-test
Test id: a94220e9-cdb8-4806-9590-185b346bbad1
Test name: enterprise-2022.2/longevity/longevity-50gb-3days-aws-i4i-test
Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Issue description

In the beginning two stress threads were started among many others:

on loader-node-3

2022-11-10 08:45:28.520: (CassandraStressEvent Severity.NORMAL) period_type=begin event_id=6e3c6801-ef57-4c16-b225-6feef198cdab: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-3 [18.132.47.93 | 10.3.1.228] (seed: False)
stress_cmd=cassandra-stress mixed cl=QUORUM duration=4320m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native  -rate threads=100 -pop seq=1..100000000 -log interval=5

on loader node-1

2022-11-10 08:45:42.036: (CassandraStressEvent Severity.NORMAL) period_type=begin event_id=a34b8e2a-6b22-4646-bf4c-68770daf3ff3: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-1 [35.176.112.176 | 10.3.3.100] (seed: False)
stress_cmd=cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=4320m -mode cql3 native -rate threads=20

The test run reached the end

2022-11-13 08:57:02.971: (InfoEvent Severity.NORMAL) period_type=not-set event_id=62d39dc9-a9d2-4342-a90b-8d76bf5c2e9c: message=TEST_END

After that we got a timeout error for the stress threads

2022-11-13 08:57:28.879: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=74e00f6b-61c4-4c9f-801e-128a19ee2464, source=LongevityTest.test_custom_time (longevity_test.LongevityTest)() message=Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/longevity_test.py", line 262, in test_custom_time
    self.verify_stress_thread(cs_thread_pool=stress)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1916, in verify_stress_thread
    results, errors = cs_thread_pool.verify_results()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 273, in verify_results
    for future in concurrent.futures.as_completed(self.results_futures, timeout=self.timeout):
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 241, in as_completed
    raise TimeoutError(
concurrent.futures._base.TimeoutError: 1 (of 3) futures unfinished

All the other stress threads finished, but the ones mentioned above failed:

2022-11-13 08:57:31.945: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=a34b8e2a-6b22-4646-bf4c-68770daf3ff3 duration=3d0h11m49s: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-1 [35.176.112.176 | 10.3.3.100] (seed: False)
stress_cmd=cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=4320m -mode cql3 native -rate threads=20
errors:

Stress command completed with bad status 1: WARN  08:57:16,007 Error creating netty channel to /10.3.0.176:9042
com.datastax.shaded.netty.channe

2022-11-13 08:57:38.326: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=6e3c6801-ef57-4c16-b225-6feef198cdab duration=3d0h12m9s: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-3 [18.132.47.93 | 10.3.1.228] (seed: False)
stress_cmd=cassandra-stress mixed cl=QUORUM duration=4320m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native  -rate threads=100 -pop seq=1..100000000 -log interval=5
errors:

Stress command completed with bad status 1: WARN  08:57:23,897 Error creating netty channel to /10.3.0.170:9042
com.datastax.shaded.netty.channe

Restore Monitor Stack command: $ hydra investigate show-monitor a94220e9-cdb8-4806-9590-185b346bbad1
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs a94220e9-cdb8-4806-9590-185b346bbad1

Logs:

db-cluster-a94220e9.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/db-cluster-a94220e9.tar.gz
monitor-set-a94220e9.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/monitor-set-a94220e9.tar.gz
loader-set-a94220e9.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/loader-set-a94220e9.tar.gz
normal-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/normal-a94220e9.log.tar.gz
summary-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/summary-a94220e9.log.tar.gz
events-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/events-a94220e9.log.tar.gz
output-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/output-a94220e9.log.tar.gz
debug-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/debug-a94220e9.log.tar.gz
sct-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/sct-a94220e9.log.tar.gz
error-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/error-a94220e9.log.tar.gz
critical-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/critical-a94220e9.log.tar.gz
raw_events-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/raw_events-a94220e9.log.tar.gz
warning-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/warning-a94220e9.log.tar.gz
email_data-a94220e9.json.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/email_data-a94220e9.json.tar.gz
argus-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/argus-a94220e9.log.tar.gz
left_processes-a94220e9.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/left_processes-a94220e9.log.tar.gz

Jenkins job URL

fgelcer · 2022-11-29T15:43:15Z

we are most likely seeing something similar in a rolling upgrade test:

Installation details

Kernel Version: 3.10.0-1160.76.1.el7.x86_64
Scylla version (or git commit hash): 5.0.5-20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee
Relocatable Package: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.0/scylla-x86_64-package-5.0.5.0.20221009.5a97a1060.tar.gz
Cluster size: 4 nodes (n1-highmem-8)

Scylla Nodes used in this run:

rolling-upgrade--centos-db-node-edcca94b-0-4 (34.139.250.105 | 10.142.15.192) (shards: 8)
rolling-upgrade--centos-db-node-edcca94b-0-3 (35.243.160.90 | 10.142.0.189) (shards: 8)
rolling-upgrade--centos-db-node-edcca94b-0-2 (35.196.83.10 | 10.142.0.152) (shards: 8)
rolling-upgrade--centos-db-node-edcca94b-0-1 (34.74.49.187 | 10.142.0.141) (shards: 8)

OS / Image: https://www.googleapis.com/compute/v1/projects/centos-cloud/global/images/family/centos-7 (gce: us-east1)

Test: rolling-upgrade-centos7-test
Test id: edcca94b-e7eb-4751-ba67-513d9f6cc34a
Test name: scylla-5.1/rolling-upgrade/rolling-upgrade-centos7-test
Test config file(s):

rolling-upgrade.yaml

Issue description

>>>>>>>

test timed out during step 5 (out of 10), because these steps took some 30 min more than usually they do (we run multiple times c-s and gemini stress in this test)
there are no errors, nor stress command failures, but things seem to be delayed

<<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor edcca94b-e7eb-4751-ba67-513d9f6cc34a
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs edcca94b-e7eb-4751-ba67-513d9f6cc34a

Logs:

db-cluster-edcca94b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/edcca94b-e7eb-4751-ba67-513d9f6cc34a/20221128_001632/db-cluster-edcca94b.tar.gz
monitor-set-edcca94b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/edcca94b-e7eb-4751-ba67-513d9f6cc34a/20221128_001632/monitor-set-edcca94b.tar.gz
loader-set-edcca94b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/edcca94b-e7eb-4751-ba67-513d9f6cc34a/20221128_001632/loader-set-edcca94b.tar.gz
sct-runner-edcca94b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/edcca94b-e7eb-4751-ba67-513d9f6cc34a/20221128_001632/sct-runner-edcca94b.tar.gz

Jenkins job URL

aleksbykov · 2023-11-14T10:31:47Z

Same issue reproduced with azure:

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

some cassandra-stress continue to try to connect to removed node after c-s reported results which cause error:

2023-11-12 20:18:50.601: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=3b624bf0-6e0b-4a6a-a174-131a8d2aabc7, source=LongevityTest.test_custom_time (longevity_test.LongevityTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/longevity_test.py", line 166, in test_custom_time
self.verify_stress_thread(cs_thread_pool=stress)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 2165, in verify_stress_thread
results, errors = cs_thread_pool.verify_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 379, in verify_results
results = super().get_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress/base.py", line 88, in get_results
for future in concurrent.futures.as_completed(self.results_futures, timeout=timeout):
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 241, in as_completed
raise TimeoutError(
concurrent.futures._base.TimeoutError: 2 (of 2) futures unfinished

Impact

test marked as failed with error or critical error

How frequently does it reproduce?

latest 3 runs on azure

Installation details

Kernel Version: 5.15.0-1051-azure
Scylla version (or git commit hash): 5.5.0~dev-20231112.8d618bbfc657 with build-id d2846e72c5ad6493e5eb93f1eb9227eca2761f92

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

longevity-10gb-3h-master-db-node-5be183a7-eastus-8 (40.87.17.249 | 10.0.0.6) (shards: -1)
longevity-10gb-3h-master-db-node-5be183a7-eastus-7 (20.168.251.30 | 10.0.0.10) (shards: 7)
longevity-10gb-3h-master-db-node-5be183a7-eastus-6 (20.163.160.37 | 10.0.0.10) (shards: 7)
longevity-10gb-3h-master-db-node-5be183a7-eastus-5 (20.232.101.255 | 10.0.0.9) (shards: 7)
longevity-10gb-3h-master-db-node-5be183a7-eastus-4 (20.231.84.94 | 10.0.0.8) (shards: 7)
longevity-10gb-3h-master-db-node-5be183a7-eastus-3 (20.231.84.53 | 10.0.0.7) (shards: 7)
longevity-10gb-3h-master-db-node-5be183a7-eastus-2 (20.231.84.13 | 10.0.0.6) (shards: 7)
longevity-10gb-3h-master-db-node-5be183a7-eastus-1 (20.228.216.41 | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.5.0-dev-x86_64-2023-11-12T14-55-14 (azure: undefined_region)

Test: longevity-10gb-3h-azure-test
Test id: 5be183a7-473d-45d5-a1c5-fb8308c56f5c
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):

longevity-10gb-3h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 5be183a7-473d-45d5-a1c5-fb8308c56f5c
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 5be183a7-473d-45d5-a1c5-fb8308c56f5c

Logs:

db-cluster-5be183a7.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/db-cluster-5be183a7.tar.gz
sct-runner-events-5be183a7.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/sct-runner-events-5be183a7.tar.gz
sct-5be183a7.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/sct-5be183a7.log.tar.gz
loader-set-5be183a7.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/loader-set-5be183a7.tar.gz
monitor-set-5be183a7.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/monitor-set-5be183a7.tar.gz
parallel-timelines-report-5be183a7.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/parallel-timelines-report-5be183a7.tar.gz

Jenkins job URL
Argus

fruch · 2023-11-14T13:22:04Z

@aleksbykov

What are the loader logs showing ?

I don't think that it's related to this year old issue

enaydanov · 2023-11-20T11:58:27Z

@fruch @aleksbykov

I can agree with Israel. This error looks related to scylladb/java-driver#258 :

The c-s failed to stop after DecommissionStreamingErr nemesis.

github-actions bot assigned fruch Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[longevity-tls-50gb-3d]: test timeout event that stress finishes in time #5345

[longevity-tls-50gb-3d]: test timeout event that stress finishes in time #5345

fruch commented Oct 11, 2022 •

edited

Loading

KnifeyMoloko commented Oct 13, 2022

fgelcer commented Oct 13, 2022

fruch commented Oct 13, 2022

soyacz commented Oct 17, 2022

soyacz commented Oct 17, 2022 •

edited

Loading

ilya-rarov commented Nov 13, 2022 •

edited

Loading

fgelcer commented Nov 29, 2022

aleksbykov commented Nov 14, 2023

Logs:

fruch commented Nov 14, 2023

enaydanov commented Nov 20, 2023

[longevity-tls-50gb-3d]: test timeout event that stress finishes in time #5345

[longevity-tls-50gb-3d]: test timeout event that stress finishes in time #5345

Comments

fruch commented Oct 11, 2022 • edited Loading

Installation details

Issue description

Logs:

KnifeyMoloko commented Oct 13, 2022

Installation details

Issue description

Logs:

fgelcer commented Oct 13, 2022

fruch commented Oct 13, 2022

soyacz commented Oct 17, 2022

soyacz commented Oct 17, 2022 • edited Loading

Installation details

Issue description

Logs:

ilya-rarov commented Nov 13, 2022 • edited Loading

Installation details

Issue description

Logs:

fgelcer commented Nov 29, 2022

Installation details

Issue description

Logs:

aleksbykov commented Nov 14, 2023

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

fruch commented Nov 14, 2023

enaydanov commented Nov 20, 2023

fruch commented Oct 11, 2022 •

edited

Loading

soyacz commented Oct 17, 2022 •

edited

Loading

ilya-rarov commented Nov 13, 2022 •

edited

Loading