Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[longevity-tls-50gb-3d]: test timeout event that stress finishes in time #5345

Open
fruch opened this issue Oct 11, 2022 · 10 comments
Open

[longevity-tls-50gb-3d]: test timeout event that stress finishes in time #5345

fruch opened this issue Oct 11, 2022 · 10 comments
Assignees

Comments

@fruch
Copy link
Contributor

fruch commented Oct 11, 2022

Installation details

Kernel Version: 5.15.0-1020-aws
Scylla version (or git commit hash): 5.2.0~dev-20221002.060dda8e00b7 with build-id b4e08d869feb2fecb04c6ea45eb8946cade10c70

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-ee8833a4-9 (34.228.71.135 | 10.12.10.33) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-ee8833a4-8 (54.237.196.139 | 10.12.11.224) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-ee8833a4-7 (3.91.44.54 | 10.12.9.54) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-ee8833a4-6 (54.224.49.190 | 10.12.9.114) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-ee8833a4-5 (54.221.151.171 | 10.12.11.178) (shards: 14)
    ...

OS / Image: ami-014a8e66eac5a37c5 (aws: us-east-1)

Test: longevity-50gb-3days
Test id: ee8833a4-d087-47f6-9865-3db37c29f94b
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

Issue description

Test times out at 15:13, while waiting for stress threads to finish:

< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > 2022-10-10 15:13:36.217: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=11a70335-2bb5-44aa-973f-26a68ce1c51e, source=LongevityTest.test_custom_time (longevity_test.LongevityTest)() message=Traceback (most recent call last):
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/longevity_test.py", line 140, in test_custom_time
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     self.verify_stress_thread(cs_thread_pool=stress)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1906, in verify_stress_thread
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     results, errors = cs_thread_pool.verify_results()
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 281, in verify_results
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     for future in concurrent.futures.as_completed(self.results_futures, timeout=self.timeout):
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 245, in as_completed
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     waiter.event.wait(wait_timeout)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/usr/local/lib/python3.10/threading.py", line 600, in wait
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     signaled = self._cond.wait(timeout)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/usr/local/lib/python3.10/threading.py", line 324, in wait
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     gotit = waiter.acquire(True, timeout)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 241, in critical_failure_handler
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     raise CriticalTestFailure("Critical Error has failed the test")  # pylint: disable=raise-missing-from
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > sdcm.tester.CriticalTestFailure: Critical Error has failed the test

while most main stress threads are finished, one still fails on the timeout:

43826232:< t:2022-10-10 15:03:53,951 f:decorators.py   l:69   c:sdcm.utils.decorators p:DEBUG > '_run': failed with 'RetryableNetworkException('Command did not complete within 259800
seconds!\n\nCommand: \'echo TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=JG6YDU4LBWMI2FJ2CBP7; cassandra-stress read cl=QUORUM duration=4320m -mode cql3 native   use
r=cassandra password=cassandra -rate threads=50 -pop seq=1..100000000 -log interval=5 -transport "truststore=/etc/scylla/ssl_conf/client/cacerts.jks truststore-password=cassandra" -no
de 10.12.9.39,10.12.11.81,10.12.8.12,10.12.9.74,10.12.11.178,10.12.9.114 -errors skip-unsupported-columns\
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-10-10 15:05:12.385: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id
=f5e9fa3c-51a6-43a9-8a9b-e485c37815ab during_nemesis=NoCorruptRepair duration=3d10h50m53s: node=Node longevity-tls-50gb-3d-master-loader-node-ee8833a4-2 [18.234.34.198 | 10.12.9.30] (seed: False)
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress read cl=QUORUM duration=4320m -mode cql3 native  -rate threads=50 -pop seq=1..100000000 -log interval=5
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > errors:
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stress command completed with bad status 1: WARN  15:04:04,928 Error creating netty channel to /10.12.9.114:9042

looking at the log, seem like it's keep trying to contact non-existing node, and that slows it down a bit.

  • Restore Monitor Stack command: $ hydra investigate show-monitor ee8833a4-d087-47f6-9865-3db37c29f94b
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs ee8833a4-d087-47f6-9865-3db37c29f94b

Logs:

Jenkins job URL

@KnifeyMoloko
Copy link
Contributor

Reproduced in 5.0.5 as well with a shorter longevity

Installation details

Kernel Version: 5.15.0-1018-gcp
Scylla version (or git commit hash): 5.0.5-20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee

Cluster size: 6 nodes (n1-highmem-16)

Scylla Nodes used in this run:

  • longevity-10gb-3h-5-0-db-node-fb62010e-0-9 (34.75.168.233 | 10.142.0.33) (shards: 16)
  • longevity-10gb-3h-5-0-db-node-fb62010e-0-8 (34.75.168.233 | 10.142.0.42) (shards: 16)
  • longevity-10gb-3h-5-0-db-node-fb62010e-0-7 (34.75.168.233 | 10.142.0.157) (shards: 16)
  • longevity-10gb-3h-5-0-db-node-fb62010e-0-6 (35.185.44.132 | 10.142.0.150) (shards: 16)
  • longevity-10gb-3h-5-0-db-node-fb62010e-0-5 (104.196.222.105 | 10.142.0.148) (shards: 16)
  • longevity-10gb-3h-5-0-db-node-fb62010e-0-4 (34.148.58.38 | 10.142.0.147) (shards: 16)
  • longevity-10gb-3h-5-0-db-node-fb62010e-0-3 (34.73.145.132 | 10.142.0.139) (shards: 16)
  • longevity-10gb-3h-5-0-db-node-fb62010e-0-2 (35.196.146.49 | 10.142.0.135) (shards: 16)
  • longevity-10gb-3h-5-0-db-node-fb62010e-0-1 (35.229.126.210 | 10.142.0.130) (shards: 16)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/5935604957551590924 (gce: us-east1)

Test: longevity-10gb-3h-gce-test
Test id: fb62010e-329a-4d63-aeac-50e1c1e13341
Test name: scylla-5.0/longevity/longevity-10gb-3h-gce-test
Test config file(s):

Issue description

>>>>>>>
The test timed out waiting without the c-s threads finishing before the timeout was reached. Looking in the loader logs however, I see that the 2 longest c-s runs finished:

Total GC memory           : 0.000 KiB
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Total GC time             :    0.0 seconds
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Avg GC time               :    NaN ms
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
StdDev GC time            :    0.0 ms
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Total operation time      : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
END

Yet it still attempted to connect to a node that was (successfully) decommissioned during the test:

om.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed

<<<<<<<

  • Restore Monitor Stack command: $ hydra investigate show-monitor fb62010e-329a-4d63-aeac-50e1c1e13341
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs fb62010e-329a-4d63-aeac-50e1c1e13341

Logs:

Jenkins job URL

@fgelcer
Copy link
Contributor

fgelcer commented Oct 13, 2022

IIRC we had a similar issue, and the problem was something with the driver or the log writing taking too long, so the stress although finished, was stuck writing to logs or something like that...
could it be the case, but now, because it is running inside a docker container, we are having problems in writing logs within the container?

@fruch
Copy link
Contributor Author

fruch commented Oct 13, 2022

IIRC we had a similar issue, and the problem was something with the driver or the log writing taking too long, so the stress although finished, was stuck writing to logs or something like that...
could it be the case, but now, because it is running inside a docker container, we are having problems in writing logs within the container?

We are not using docker for cassandra-stress, at least not yet

@soyacz
Copy link
Contributor

soyacz commented Oct 17, 2022

I think this error might be related to my 2 other commits:

#5074 - where based on duration we timeout stress thread. We add 10 minutes margin to stress command and it might be to little (especially, when no-warmup flag is not set). But this also could be result of c-s hang we notice sometimes (that's why we added this fix).

The other thing is, that we didn't see this error because timeouts were not catched up/working properly - this was fixed in #5311

I propose to prolong this 10m period by 5% of duration (so for short c-s we don't wait too long).
Also we could add no-warmup flags everywhere - but I don't know if this would be right approach.

@soyacz
Copy link
Contributor

soyacz commented Oct 17, 2022

I verified stress logs, and in first issue comment test, in cassandra-stress-l1-c0-k1-30fb05d9-67fe-4988-8df3-997b0aa6cf22.log from loader-node-2 I can see it got frozen up (we can see in this log also starting lines after 11h which I don't understand). Also in other test I found c-s logs get silent after some time and leading to timeout error.

So I don't think increasing timeout will help.

We need to find out why these c-s threads freeze.

e.g. see cassandra-stress-l0-c0-k1-07f02534-d226-4550-976a-2f795c9801bc.log from test on master:

Installation details

Kernel Version: 5.15.0-1021-azure
Scylla version (or git commit hash): 5.2.0~dev-20221011.fcf0628bc5d0 with build-id 8f3895a59a9d0e6bcc3c0f8932dee690e20feff9

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

  • longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-9 (20.172.146.207 | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-8 (4.246.166.158 | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-7 (52.188.208.105 | 10.0.0.14) (shards: 7)
  • longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-6 (20.124.241.150 | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-5 (20.163.207.223 | 10.0.0.9) (shards: 7)
  • longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-4 (20.185.18.243 | 10.0.0.8) (shards: 7)
  • longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-3 (20.127.113.247 | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-2 (20.185.18.203 | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-master-db-node-2eb4c2ab-eastus-1 (52.255.150.93 | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.2.0-dev-x86_64-2022-10-11T04-25-04Z (azure: eastus)

Test: longevity-10gb-3h-azure-test
Test id: 2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):

Issue description

  • Restore Monitor Stack command: $ hydra investigate show-monitor 2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825

Logs:

Jenkins job URL

@ilya-rarov
Copy link
Contributor

ilya-rarov commented Nov 13, 2022

Got something similar in 2022.2 job (although the test wasn't timed out)

Installation details

Kernel Version: 5.15.0-1022-aws
Scylla version (or git commit hash): 2022.2.0~rc4-20221106.f5714e0db12f with build-id f4a927b8a00fbcd8d48640835192aeaa7968b1f2
Relocatable Package: http://downloads.scylladb.com/unstable/scylla-enterprise/enterprise-2022.2/relocatable/2022-11-06T15:44:05Z/scylla-enterprise-x86_64-package.tar.gz
Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-9 (35.178.200.195 | 10.3.1.68) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-8 (13.40.154.111 | 10.3.3.247) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-7 (18.130.108.114 | 10.3.0.41) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-6 (3.9.16.111 | 10.3.1.133) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-5 (13.40.151.248 | 10.3.1.112) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-4 (52.56.200.142 | 10.3.0.176) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-33 (13.40.115.39 | 10.3.0.101) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-32 (13.40.123.99 | 10.3.2.159) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-31 (3.9.115.191 | 10.3.0.118) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-30 (3.8.170.243 | 10.3.3.89) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-3 (18.135.104.106 | 10.3.0.170) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-29 (13.40.101.236 | 10.3.0.29) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-28 (18.130.232.235 | 10.3.3.117) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-27 (18.170.61.6 | 10.3.3.209) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-26 (18.134.198.186 | 10.3.1.5) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-25 (18.169.157.184 | 10.3.3.5) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-24 (35.176.158.13 | 10.3.1.214) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-23 (3.8.131.38 | 10.3.1.112) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-22 (3.9.177.44 | 10.3.0.93) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-21 (35.178.161.42 | 10.3.3.223) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-20 (35.176.103.236 | 10.3.0.99) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-2 (3.8.139.104 | 10.3.2.112) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-19 (18.133.175.26 | 10.3.2.156) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-18 (52.56.125.9 | 10.3.2.89) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-17 (35.178.199.29 | 10.3.1.188) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-16 (52.56.195.17 | 10.3.1.19) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-15 (18.135.105.217 | 10.3.3.172) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-14 (13.40.73.39 | 10.3.1.213) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-13 (3.8.143.101 | 10.3.0.34) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-12 (18.133.238.4 | 10.3.0.217) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-11 (18.132.9.223 | 10.3.0.175) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-10 (18.133.244.123 | 10.3.1.199) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-1 (52.56.171.91 | 10.3.1.49) (shards: 14)

OS / Image: ami-06afe72a2cd370dbb (aws: eu-west-2)

Test: longevity-50gb-3days-aws-i4i-test
Test id: a94220e9-cdb8-4806-9590-185b346bbad1
Test name: enterprise-2022.2/longevity/longevity-50gb-3days-aws-i4i-test
Test config file(s):

Issue description

In the beginning two stress threads were started among many others:

  • on loader-node-3
2022-11-10 08:45:28.520: (CassandraStressEvent Severity.NORMAL) period_type=begin event_id=6e3c6801-ef57-4c16-b225-6feef198cdab: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-3 [18.132.47.93 | 10.3.1.228] (seed: False)
stress_cmd=cassandra-stress mixed cl=QUORUM duration=4320m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native  -rate threads=100 -pop seq=1..100000000 -log interval=5
  • on loader node-1
2022-11-10 08:45:42.036: (CassandraStressEvent Severity.NORMAL) period_type=begin event_id=a34b8e2a-6b22-4646-bf4c-68770daf3ff3: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-1 [35.176.112.176 | 10.3.3.100] (seed: False)
stress_cmd=cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=4320m -mode cql3 native -rate threads=20

The test run reached the end

2022-11-13 08:57:02.971: (InfoEvent Severity.NORMAL) period_type=not-set event_id=62d39dc9-a9d2-4342-a90b-8d76bf5c2e9c: message=TEST_END

After that we got a timeout error for the stress threads

2022-11-13 08:57:28.879: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=74e00f6b-61c4-4c9f-801e-128a19ee2464, source=LongevityTest.test_custom_time (longevity_test.LongevityTest)() message=Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/longevity_test.py", line 262, in test_custom_time
    self.verify_stress_thread(cs_thread_pool=stress)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1916, in verify_stress_thread
    results, errors = cs_thread_pool.verify_results()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 273, in verify_results
    for future in concurrent.futures.as_completed(self.results_futures, timeout=self.timeout):
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 241, in as_completed
    raise TimeoutError(
concurrent.futures._base.TimeoutError: 1 (of 3) futures unfinished

All the other stress threads finished, but the ones mentioned above failed:

2022-11-13 08:57:31.945: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=a34b8e2a-6b22-4646-bf4c-68770daf3ff3 duration=3d0h11m49s: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-1 [35.176.112.176 | 10.3.3.100] (seed: False)
stress_cmd=cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=4320m -mode cql3 native -rate threads=20
errors:

Stress command completed with bad status 1: WARN  08:57:16,007 Error creating netty channel to /10.3.0.176:9042
com.datastax.shaded.netty.channe
2022-11-13 08:57:38.326: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=6e3c6801-ef57-4c16-b225-6feef198cdab duration=3d0h12m9s: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-3 [18.132.47.93 | 10.3.1.228] (seed: False)
stress_cmd=cassandra-stress mixed cl=QUORUM duration=4320m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native  -rate threads=100 -pop seq=1..100000000 -log interval=5
errors:

Stress command completed with bad status 1: WARN  08:57:23,897 Error creating netty channel to /10.3.0.170:9042
com.datastax.shaded.netty.channe
  • Restore Monitor Stack command: $ hydra investigate show-monitor a94220e9-cdb8-4806-9590-185b346bbad1
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs a94220e9-cdb8-4806-9590-185b346bbad1

Logs:

Jenkins job URL

@fgelcer
Copy link
Contributor

fgelcer commented Nov 29, 2022

we are most likely seeing something similar in a rolling upgrade test:

Installation details

Kernel Version: 3.10.0-1160.76.1.el7.x86_64
Scylla version (or git commit hash): 5.0.5-20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee
Relocatable Package: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.0/scylla-x86_64-package-5.0.5.0.20221009.5a97a1060.tar.gz
Cluster size: 4 nodes (n1-highmem-8)

Scylla Nodes used in this run:

  • rolling-upgrade--centos-db-node-edcca94b-0-4 (34.139.250.105 | 10.142.15.192) (shards: 8)
  • rolling-upgrade--centos-db-node-edcca94b-0-3 (35.243.160.90 | 10.142.0.189) (shards: 8)
  • rolling-upgrade--centos-db-node-edcca94b-0-2 (35.196.83.10 | 10.142.0.152) (shards: 8)
  • rolling-upgrade--centos-db-node-edcca94b-0-1 (34.74.49.187 | 10.142.0.141) (shards: 8)

OS / Image: https://www.googleapis.com/compute/v1/projects/centos-cloud/global/images/family/centos-7 (gce: us-east1)

Test: rolling-upgrade-centos7-test
Test id: edcca94b-e7eb-4751-ba67-513d9f6cc34a
Test name: scylla-5.1/rolling-upgrade/rolling-upgrade-centos7-test
Test config file(s):

Issue description

>>>>>>>

test timed out during step 5 (out of 10), because these steps took some 30 min more than usually they do (we run multiple times c-s and gemini stress in this test)
there are no errors, nor stress command failures, but things seem to be delayed

<<<<<<<

  • Restore Monitor Stack command: $ hydra investigate show-monitor edcca94b-e7eb-4751-ba67-513d9f6cc34a
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs edcca94b-e7eb-4751-ba67-513d9f6cc34a

Logs:

Jenkins job URL

@aleksbykov
Copy link
Contributor

Same issue reproduced with azure:

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

some cassandra-stress continue to try to connect to removed node after c-s reported results which cause error:

2023-11-12 20:18:50.601: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=3b624bf0-6e0b-4a6a-a174-131a8d2aabc7, source=LongevityTest.test_custom_time (longevity_test.LongevityTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/longevity_test.py", line 166, in test_custom_time
self.verify_stress_thread(cs_thread_pool=stress)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 2165, in verify_stress_thread
results, errors = cs_thread_pool.verify_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 379, in verify_results
results = super().get_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress/base.py", line 88, in get_results
for future in concurrent.futures.as_completed(self.results_futures, timeout=timeout):
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 241, in as_completed
raise TimeoutError(
concurrent.futures._base.TimeoutError: 2 (of 2) futures unfinished

Impact

test marked as failed with error or critical error

How frequently does it reproduce?

latest 3 runs on azure

Installation details

Kernel Version: 5.15.0-1051-azure
Scylla version (or git commit hash): 5.5.0~dev-20231112.8d618bbfc657 with build-id d2846e72c5ad6493e5eb93f1eb9227eca2761f92

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

  • longevity-10gb-3h-master-db-node-5be183a7-eastus-8 (40.87.17.249 | 10.0.0.6) (shards: -1)
  • longevity-10gb-3h-master-db-node-5be183a7-eastus-7 (20.168.251.30 | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-master-db-node-5be183a7-eastus-6 (20.163.160.37 | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-master-db-node-5be183a7-eastus-5 (20.232.101.255 | 10.0.0.9) (shards: 7)
  • longevity-10gb-3h-master-db-node-5be183a7-eastus-4 (20.231.84.94 | 10.0.0.8) (shards: 7)
  • longevity-10gb-3h-master-db-node-5be183a7-eastus-3 (20.231.84.53 | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-5be183a7-eastus-2 (20.231.84.13 | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-master-db-node-5be183a7-eastus-1 (20.228.216.41 | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.5.0-dev-x86_64-2023-11-12T14-55-14 (azure: undefined_region)

Test: longevity-10gb-3h-azure-test
Test id: 5be183a7-473d-45d5-a1c5-fb8308c56f5c
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 5be183a7-473d-45d5-a1c5-fb8308c56f5c
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 5be183a7-473d-45d5-a1c5-fb8308c56f5c

Logs:

Jenkins job URL
Argus

@fruch
Copy link
Contributor Author

fruch commented Nov 14, 2023

@aleksbykov

What are the loader logs showing ?

I don't think that it's related to this year old issue

@enaydanov
Copy link
Contributor

@fruch @aleksbykov

I can agree with Israel. This error looks related to scylladb/java-driver#258 :

The c-s failed to stop after DecommissionStreamingErr nemesis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants