cqlsh fails with "Operation timed out for system.peers" (after `nodetool refresh` used) #11016

fruch · 2022-07-11T15:20:00Z

Installation details

Kernel Version: 5.13.0-1031-aws
Scylla version (or git commit hash): 5.1.dev-20220706.a0ffbf3291b7 with build-id 3490fa9f14da510e97a1d0f53f693cac13a70494
Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-master-db-node-fb8b36a7-9 (63.35.163.81 | 10.0.3.187) (shards: 14)
longevity-tls-50gb-3d-master-db-node-fb8b36a7-8 (52.31.41.157 | 10.0.0.95) (shards: 14)
longevity-tls-50gb-3d-master-db-node-fb8b36a7-7 (34.244.145.19 | 10.0.0.195) (shards: 14)
longevity-tls-50gb-3d-master-db-node-fb8b36a7-6 (34.241.120.110 | 10.0.0.111) (shards: 14)
longevity-tls-50gb-3d-master-db-node-fb8b36a7-5 (34.244.16.3 | 10.0.1.141) (shards: 14)
longevity-tls-50gb-3d-master-db-node-fb8b36a7-4 (34.253.116.115 | 10.0.0.12) (shards: 14)
longevity-tls-50gb-3d-master-db-node-fb8b36a7-3 (3.251.71.251 | 10.0.1.19) (shards: 14)
longevity-tls-50gb-3d-master-db-node-fb8b36a7-2 (52.16.243.180 | 10.0.2.34) (shards: 14)
longevity-tls-50gb-3d-master-db-node-fb8b36a7-11 (3.251.65.93 | 10.0.2.8) (shards: 14)
longevity-tls-50gb-3d-master-db-node-fb8b36a7-10 (34.243.184.173 | 10.0.0.66) (shards: 14)
longevity-tls-50gb-3d-master-db-node-fb8b36a7-1 (63.33.208.165 | 10.0.1.209) (shards: 14)

OS / Image: ami-07d73e5ea1fc772eb (aws: eu-west-1)

Test: longevity-50gb-3days
Test id: fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Issue description

during disrupt_nodetool_refresh, when test tries to verify the snapshot was refreshed correctly,
the cqlsh command timeout on system.peers

Command: 'cqlsh --no-color -u cassandra -p \'cassandra\' --request-timeout=120 --connect-timeout=60 --ssl -e "SELECT * FROM keyspace1.standard1 WHERE key=0x32373131364f334f3830" 10.0.0.95 9042'

Exit code: 1

Stdout:



Stderr:

Connection error: ('Unable to connect to any servers', {'10.0.0.95': ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out for system.peers - received only 0 responses from 1 CL=ONE." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',)})

Restore Monitor Stack command: $ hydra investigate show-monitor fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a

Logs:

No logs captured during this run. test is still running
http://34.246.190.165:3000/d/ks-master/longevity-50gb-3days-scylla-per-server-metrics-nemesis-master?orgId=1&from=now-6h&to=now

Jenkins job URL

it's ~2022-07-11 12:02:19

The text was updated successfully, but these errors were encountered:

fruch · 2022-07-17T09:06:01Z

Happening also this week run mostly during cqlsh describe keyspces on node-1

Command: 'cqlsh --no-color -u cassandra -p \'cassandra\'  --request-timeout=120 --connect-timeout=60 --ssl -e "describe keyspaces" 10.0.1.31 9042'

Exit code: 1

Stdout:



Stderr:

Connection error: ('Unable to connect to any servers', {'10.0.1.31': ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out for system.peers - received only 0 responses from 1 CL=ONE." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',)})

seems that since the disrupt_remove_node_then_add_node added a new node at ~7:00, node-1 is having lots of read failures
for hours, until ~16:30

http://34.244.142.156:3000/d/overview-master/longevity-50gb-3days-scylla-per-server-metrics-nemesis-master?orgId=1&from=1657863680494&to=1657903953915

Installation details

Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 5.1.dev-20220714.98aa3ec99b9b with build-id e27b38549f89da2f016567fc243c8404a6460ebc
Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-master-db-node-8dce4a7e-9 (34.244.142.219 | 10.0.1.13) (shards: 14)
longevity-tls-50gb-3d-master-db-node-8dce4a7e-8 (34.252.229.242 | 10.0.3.40) (shards: 14)
longevity-tls-50gb-3d-master-db-node-8dce4a7e-7 (3.250.124.209 | 10.0.1.101) (shards: 14)
longevity-tls-50gb-3d-master-db-node-8dce4a7e-6 (34.244.161.163 | 10.0.0.77) (shards: 14)
longevity-tls-50gb-3d-master-db-node-8dce4a7e-5 (54.75.84.148 | 10.0.2.140) (shards: 14)
longevity-tls-50gb-3d-master-db-node-8dce4a7e-4 (34.244.80.175 | 10.0.1.123) (shards: 14)
longevity-tls-50gb-3d-master-db-node-8dce4a7e-3 (3.250.10.77 | 10.0.3.184) (shards: 14)
longevity-tls-50gb-3d-master-db-node-8dce4a7e-2 (34.243.250.187 | 10.0.3.232) (shards: 14)
longevity-tls-50gb-3d-master-db-node-8dce4a7e-1 (34.246.186.17 | 10.0.1.31) (shards: 14)

OS / Image: ami-08cbc2649e53039f8 (aws: eu-west-1)

Test: longevity-50gb-3days
Test id: 8dce4a7e-e326-4373-bd48-e772fbfbe457
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml
Restore Monitor Stack command: $ hydra investigate show-monitor 8dce4a7e-e326-4373-bd48-e772fbfbe457
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 8dce4a7e-e326-4373-bd48-e772fbfbe457

Logs:

No logs captured during this run.

Jenkins job URL

bhalevy · 2022-07-17T09:46:32Z

The read errors on 10.0.1.31 are all on a shard 12.

bhalevy · 2022-07-17T09:47:22Z

@fruch how do I get to the cluster node logs?

fruch · 2022-07-17T09:48:52Z

@fruch how do I get to the cluster node logs?

it's currently running so they aren't collected yet.
I'll send you the details, on how to connect

bhalevy · 2022-07-17T10:13:35Z

Looking at
longevity-tls-50gb-3d-master-db-node-8dce4a7e-1.log.gz

Amongst the thousands of Commitlog shutdown complete message (Cc @elcallio why do we have so many of them? Does this indicate a problem?)

There are many reader_concurrency_semaphore timed out messages on shard 12:

2022-07-15T06:58:07+00:00 longevity-tls-50gb-3d-master-db-node-8dce4a7e-1     !INFO | scylla[7943]:  [shard 12] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 4/100 count and 68358/170959831 memory resources: timed out, dumping permit diagnostics:
...
2022-07-15T06:58:36+00:00 longevity-tls-50gb-3d-master-db-node-8dce4a7e-1     !INFO | scylla[7943]:  [shard 12] reader_concurrency_semaphore - (rate limiting dropped 16984 similar messages) Semaphore _read_concurrency_sem with 3/100 count and 53000/170959831 memory resources: timed out, dumping permit diagnostics:

Until 2022-07-15T18:31:04+00:00 there are close to 1000 reader_concurrency_semaphore timed out message on shard 12.
After that other shards start popping up.

@denesb please look into this.

denesb · 2022-07-18T05:41:05Z

The actual diagnostics dump is missing from the logs. Wasn't this fixed already?

fruch · 2022-07-18T07:08:11Z

The actual diagnostics dump is missing from the logs. Wasn't this fixed already?

where do you expect the fix to be ?

denesb · 2022-07-18T07:21:58Z

This log is a multi-line one and we had the problem in the past of SCT only copying the first line into its own logs. There was an issue (don't remember which) where this was discussed and I thought it was fixed.

bhalevy · 2022-07-18T07:45:40Z

I hope the job also collects the node logs verbatim.

fruch · 2022-07-18T09:09:24Z

I hope the job also collects the node logs verbatim.

the answer is no, (I'll try to handle it in scylladb/scylla-cluster-tests#5026)

but the faulty machine is up, here is one example from it:

Jul 15 16:17:56 longevity-tls-50gb-3d-master-db-node-8dce4a7e-1 scylla[46538]:  [shard 12] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 5/100 count and 105314/>
                                                                               permits        count        memory        table/description/state
                                                                               3        3        66K        keyspace1.standard1/data-query/active/blocked
                                                                               1        1        19K        keyspace1.standard1/data-query/active/used
                                                                               1        1        16K        mview.users_by_first_name/data-query/active/used
                                                                               2        0        2K        mview.users/push-view-updates-2/active/unused
                                                                               2        0        0B        mview.users_by_last_name/mutation-query/waiting
                                                                               2        0        0B        mview.users_by_first_name/mutation-query/waiting
                                                                               48        0        0B        mview.users_by_last_name/data-query/waiting
                                                                               23        0        0B        mview.users/mutation-query/waiting
                                                                               49        0        0B        mview.users/data-query/waiting
                                                                               1551        0        0B        keyspace1.standard1/data-query/waiting
                                                                               46        0        0B        mview.users_by_first_name/data-query/waiting

                                                                               1728        5        103K        total

                                                                               Total: 1728 permits with 5 count and 103K memory resources

denesb · 2022-07-18T10:12:38Z

I need a coredump from this node while it is producing these symptoms.

fruch · 2022-07-18T10:56:52Z

I need a coredump from this node while it is producing these symptoms.

the test case ended, so it's a bit too late for that.

denesb · 2022-07-18T11:59:26Z

Ok. Next time you reproduce this, please kill scylla with SIGABRT and upload the coredump, so I can have a look at what the problematic shard is doing.

roydahan · 2022-07-19T16:54:54Z

@denesb why do we have multi-line messages in the log?
It's not something ideal to work with.

juliayakovlev · 2022-07-24T14:27:18Z

Reproduced 3 times during longevity-twcs-48h 5.1.dev-20220719.274691f45e3d

Installation details

Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 5.1.dev-20220719.274691f45e3d with build-id 8bedc6db4fe1f245a70c67d2d2135b056a361a78
Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

longevity-twcs-48h-master-db-node-360458dd-9 (54.154.212.57 | 10.0.1.137) (shards: 7)
longevity-twcs-48h-master-db-node-360458dd-8 (54.77.206.42 | 10.0.0.8) (shards: 7)
longevity-twcs-48h-master-db-node-360458dd-7 (52.210.48.154 | 10.0.0.96) (shards: 7)
longevity-twcs-48h-master-db-node-360458dd-6 (54.216.80.92 | 10.0.2.251) (shards: 7)
longevity-twcs-48h-master-db-node-360458dd-5 (52.18.227.181 | 10.0.0.156) (shards: 7)
longevity-twcs-48h-master-db-node-360458dd-4 (54.77.222.231 | 10.0.2.198) (shards: 7)
longevity-twcs-48h-master-db-node-360458dd-3 (3.250.195.217 | 10.0.0.227) (shards: 7)
longevity-twcs-48h-master-db-node-360458dd-2 (3.250.177.61 | 10.0.3.20) (shards: 7)
longevity-twcs-48h-master-db-node-360458dd-12 (18.202.58.144 | 10.0.3.64) (shards: 7)
longevity-twcs-48h-master-db-node-360458dd-11 (34.248.144.163 | 10.0.2.101) (shards: 7)
longevity-twcs-48h-master-db-node-360458dd-10 (34.254.225.234 | 10.0.1.152) (shards: 7)
longevity-twcs-48h-master-db-node-360458dd-1 (52.18.27.2 | 10.0.0.17) (shards: 7)

OS / Image: ami-02742dcbeb0063751 (aws: eu-west-1)

Test: longevity-twcs-48h-test
Test id: 360458dd-1013-44a9-b93f-0f3e25f6d73f
Test name: scylla-master/longevity/longevity-twcs-48h-test
Test config file(s):

longevity-twcs-48h.yaml

Issue description

During the test 3 nemeses failed with same ReadTimeout error when sent "describe keyspaces" to the node1 (private IP 10.0.0.17):

< t:2022-07-23 01:54:37,902 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-07-23 01:54:37.900: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=b79a5f9b-7031-4cdf-9fa4-e4d35b184f6a: nemesis_name=DeleteByRowsRange target_node=Node longevity-twcs-48h-master-db-node-360458dd-5 [52.18.227.181 | 10.0.0.156] (seed: False)
< t:2022-07-23 01:54:38,633 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"...
< t:2022-07-23 01:54:49,137 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"; Exit status: 1

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3614, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2025, in disrupt_delete_by_rows_range
    self.verify_initial_inputs_for_delete_nemesis()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1859, in verify_initial_inputs_for_delete_nemesis
    test_keyspaces = self.cluster.get_test_keyspaces()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4027, in get_test_keyspaces
    keyspaces = self.nodes[0].run_cqlsh("describe keyspaces").stdout.split()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2804, in run_cqlsh
    cqlsh_out = self.remoter.run(cmd, timeout=timeout + 30,  # we give 30 seconds to cqlsh timeout mechanism to work
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042'

Exit code: 1

Stdout:



Stderr:

Connection error: ('Unable to connect to any servers', {'10.0.0.17': ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out for system.peers - received only 0 responses from 1 CL=ONE." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',)})

< t:2022-07-23 02:27:47,247 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-07-23 02:27:47.245: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=3a413c58-870e-4256-b54b-5d0eeacfebbd: nemesis_name=NoCorruptRepair target_node=Node longevity-twcs-48h-master-db-node-360458dd-6 [54.216.80.92 | 10.0.2.251] (seed: False)
< t:2022-07-23 02:34:09,469 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"...
< t:2022-07-23 02:34:21,474 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"; Exit status: 1

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3614, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1393, in disrupt_no_corrupt_repair
    self._prepare_test_table(ks=f'drop_table_during_repair_ks_{i}', table='standard1')
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1649, in _prepare_test_table
    test_keyspaces = self.cluster.get_test_keyspaces()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4027, in get_test_keyspaces
    keyspaces = self.nodes[0].run_cqlsh("describe keyspaces").stdout.split()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2804, in run_cqlsh
    cqlsh_out = self.remoter.run(cmd, timeout=timeout + 30,  # we give 30 seconds to cqlsh timeout mechanism to work
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042'

Exit code: 1

Stdout:



Stderr:

Connection error: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for system.local - received only 0 responses from 1 CL=ONE." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}

< t:2022-07-23 03:53:46,125 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-07-23 03:53:46.123: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=d4a96399-abf7-40df-ae84-ed1b2673105f: nemesis_name=Truncate target_node=Node longevity-twcs-48h-master-db-node-360458dd-6 [54.216.80.92 | 10.0.2.251] (seed: False)
< t:2022-07-23 03:57:28,783 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"...
< t:2022-07-23 03:57:45,289 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"; Exit status: 1

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3614, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1662, in disrupt_truncate
    self._prepare_test_table(ks=keyspace_truncate)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1649, in _prepare_test_table
    test_keyspaces = self.cluster.get_test_keyspaces()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4027, in get_test_keyspaces
    keyspaces = self.nodes[0].run_cqlsh("describe keyspaces").stdout.split()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2804, in run_cqlsh
    cqlsh_out = self.remoter.run(cmd, timeout=timeout + 30,  # we give 30 seconds to cqlsh timeout mechanism to work
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042'

Exit code: 1

Stdout:



Stderr:

Connection error: ('Unable to connect to any servers', {'10.0.0.17': ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out for system_schema.keyspaces - received only 0 responses from 1 CL=ONE." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',)})

The node1 is not overloaded:

Scylla-bench commands are running at this time:

2022-07-24 03:41:46.731: (ScyllaBenchEvent Severity.NORMAL) period_type=end event_id=492a4182-6a07-4651-a6b5-25a2ec5067e1 duration=2d0h0m2s: node=Node longevity-twcs-48h-master-loader-node-360458dd-1 [54.154.112.100 | 10.0.2.244] (seed: False)
stress_cmd=scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000000 -clustering-row-size=200 -concurrency=100 -rows-per-request=100 -start-timestamp=1658461302364681340 -connection-count 100 -max-rate 50000 --timeout 120s -duration=2880m -error-at-row-limit 1000

2022-07-24 03:43:04.974: (ScyllaBenchEvent Severity.NORMAL) period_type=end event_id=65040e53-bdb6-4650-bed3-eb2a7a0d2534 duration=2d0h0m0s: node=Node longevity-twcs-48h-master-loader-node-360458dd-2 [52.208.136.220 | 10.0.1.124] (seed: False)
stress_cmd=scylla-bench -workload=timeseries -mode=read -partition-count=20000 -concurrency=100 -replication-factor=3 -clustering-row-count=10000000 -clustering-row-size=200  -rows-per-request=100 -start-timestamp=1658461302364681340 -write-rate 125 -distribution hnormal --connection-count 100 -duration=2880m -error-at-row-limit 1000

2022-07-24 03:44:30.772: (ScyllaBenchEvent Severity.NORMAL) period_type=end event_id=206d78d9-5680-4b93-a856-f01ba5a139b1 duration=2d0h0m0s: node=Node longevity-twcs-48h-master-loader-node-360458dd-3 [34.244.200.85 | 10.0.0.159] (seed: False)
stress_cmd=scylla-bench -workload=timeseries -mode=read -partition-count=20000 -concurrency=100 -replication-factor=3 -clustering-row-count=10000000 -clustering-row-size=200  -rows-per-request=100 -start-timestamp=1658461302364681340 -write-rate 125 -distribution uniform --connection-count 100 -duration=2880m -error-at-row-limit 1000

Restore Monitor Stack command: $ hydra investigate show-monitor 360458dd-1013-44a9-b93f-0f3e25f6d73f
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 360458dd-1013-44a9-b93f-0f3e25f6d73f

Logs:

db-cluster-360458dd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/360458dd-1013-44a9-b93f-0f3e25f6d73f/20220724_040218/db-cluster-360458dd.tar.gz
monitor-set-360458dd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/360458dd-1013-44a9-b93f-0f3e25f6d73f/20220724_040218/monitor-set-360458dd.tar.gz
loader-set-360458dd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/360458dd-1013-44a9-b93f-0f3e25f6d73f/20220724_040218/loader-set-360458dd.tar.gz
sct-runner-360458dd.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/360458dd-1013-44a9-b93f-0f3e25f6d73f/20220724_040218/sct-runner-360458dd.tar.gz

Jenkins job URL

denesb · 2022-07-25T03:22:20Z

@denesb why do we have multi-line messages in the log? It's not something ideal to work with.

I know its not ideal but this report is very long and if I made it single-line it would be completely unreadable for humans.

bhalevy · 2022-07-25T03:38:08Z

@denesb why do we have multi-line messages in the log? It's not something ideal to work with.

I know its not ideal but this report is very long and if I made it single-line it would be completely unreadable for humans.

We could print it as json string that could be easily printed later with json pretty printer..

denesb · 2022-07-25T05:21:00Z

@denesb why do we have multi-line messages in the log? It's not something ideal to work with.

I know its not ideal but this report is very long and if I made it single-line it would be completely unreadable for humans.

We could print it as json string that could be easily printed later with json pretty printer..

This also makes the log unreadable for humans, without the tool at hand. The questions is: who do we want to make the logs easy to read for? Humans or scripts?

bhalevy · 2022-07-25T06:02:06Z

@denesb why do we have multi-line messages in the log? It's not something ideal to work with.

I know its not ideal but this report is very long and if I made it single-line it would be completely unreadable for humans.

We could print it as json string that could be easily printed later with json pretty printer..

This also makes the log unreadable for humans, without the tool at hand. The questions is: who do we want to make the logs easy to read for? Humans or scripts?

I think that in most cases, this particular error target audience is scylla engineering, rather than users, since it involves an internal data structure that's not exposed in the user interface.

Therefore I'd expect scylla staff dealing with it to have the tool.
It's commonly available (e.g. perl-JSON-PP-4.06-2.fc34.noarch).

denesb · 2022-07-25T06:43:07Z

This is far from being the only multi-line log message. Are we requiring all of them to be converted to single line JSON? Note that multi-line logs are especially common in engineer oriented logs, typically where more details are required.
The other questions is: is this additional steps for interpreting logs worth it just so log parsing scripts can be a little simpler?
It is not that hard to make log parsing work with multi-line logs.

bhalevy · 2022-07-25T10:11:45Z

maybe we should add an option to scylla that will be set in automated runs to print those logs in a single line.

denesb · 2022-07-25T10:59:11Z

I don't mind that, but again, how far are we willing to go to avoid having parsing multi-line logs? From where I'm sitting this doesn't seem like a hard thing to do. Logs have a very well defined start sequence, which is easy to find and thus it can be used to sequence the log stream into individual multi-line logs.
Maybe there is a genuine reason I don't know about why this is not as simple as I imagine it. If there is indeed, I don't mind doing some extra work to avoid multi-line logs.

bhalevy · 2022-07-26T12:46:34Z

Is this a duplicate of #10405?

DoronArazii · 2023-03-01T12:29:42Z

@xemul ^^

xemul · 2023-03-02T08:36:08Z

(not necessarily the root cause) In the load_and_stream case the CPU sched group used is not the streaming one

xemul · 2023-03-02T08:47:43Z

1500 requests in query queue is way too much

xemul · 2023-03-02T08:54:05Z

Also note the default IO class:

Minutes delays is no wonder -- the class has shares value of 1 (one, literally). However, in this class some sstables access still happens, smth like TOC reading on opening or alike.

xemul · 2023-03-02T09:43:20Z

High CPU starvation times. CPU cannot keep-up with compaction and flushing. Statement group feels much better, but also doesn't get what it needs.

xemul · 2023-03-02T09:44:36Z

Do we have a metrics showing how many IOs a cache-missing request needs?

xemul · 2023-03-02T09:59:42Z

Also note the default IO class ... in this class some sstables access still happens, smth like TOC reading on opening or alike.

The plan is to generalize CPU and IO classes so that default IO class would just naturally disappear

xemul · 2023-03-02T10:18:53Z

in [default] class some sstables access still happens, smth like TOC reading on opening or alike.

Yup:

// This is small enough, and well-defined. Easier to just read it all at once
future<> sstable::read_toc() noexcept {
    ...
    return with_file(new_sstable_component_file(_read_error_handler, component_type::TOC, open_flags::ro), [this] (file f) {
        auto bufptr = allocate_aligned_buffer<char>(4096, 4096);
        auto buf = bufptr.get();
        auto fut = f.dma_read(0, buf, 4096); // <<<<< uses default IO class

xemul · 2023-03-02T10:47:07Z

(not necessarily the root cause) In the load_and_stream case the CPU sched group used is not the streaming one

Not the case here

$ fgrep 'Loading new' */system.log | cut -f18 -d' '
load_and_stream=false,
load_and_stream=false,
load_and_stream=false,
load_and_stream=false,
load_and_stream=false,

xemul · 2023-03-02T10:57:07Z

Erm... Doesn't look like loading at all in fact:

$ fgrep -i 'loading new' */system.log | cut -f3,4,11,12,13,21 -d' '
10:53:43 longevity-cdc-3d-400gb-master-db-node-43d44169-1 - Loading new
10:53:44 longevity-cdc-3d-400gb-master-db-node-43d44169-1 - Done loading status=succeeded
10:53:56 longevity-cdc-3d-400gb-master-db-node-43d44169-3 - Loading new
10:53:56 longevity-cdc-3d-400gb-master-db-node-43d44169-3 - Done loading status=succeeded
10:54:21 longevity-cdc-3d-400gb-master-db-node-43d44169-5 - Loading new
10:54:22 longevity-cdc-3d-400gb-master-db-node-43d44169-5 - Done loading status=succeeded
10:54:31 longevity-cdc-3d-400gb-master-db-node-43d44169-7 - Loading new
10:54:32 longevity-cdc-3d-400gb-master-db-node-43d44169-7 - Done loading status=succeeded
10:54:42 longevity-cdc-3d-400gb-master-db-node-43d44169-9 - Loading new
10:54:43 longevity-cdc-3d-400gb-master-db-node-43d44169-9 - Done loading status=succeeded

on all nodes load_new_sstables finishes in less than a minute

xemul · 2023-03-02T11:09:42Z

Ah, so probably streaming happens because decommissioning some node(s)

xemul · 2023-03-02T11:24:20Z

So node 10.4.0.178 first got decommissioned

11:06:31 node-1 scylla[115526]:  [shard  0] storage_service - decommission[9858669e-686e-4520-acac-beef650cfe5b]: Added node=10.4.0.178 as leaving node, coordinator=10.4.0.178
11:06:33 node-1 scylla[115526]:  [shard  0] storage_service - decommission[9858669e-686e-4520-acac-beef650cfe5b]: Removed node=10.4.0.178 as leaving node, coordinator=10.4.0.178
11:06:37 node-1 scylla[115526]:  [shard 10] rpc - client 10.4.0.178:7000: ignoring error response: abort requested
11:06:38 node-1 scylla[115526]:  [shard 10] rpc - client 10.4.0.178:7000: ignoring error response: abort requested
11:06:39 node-1 scylla[115526]:  [shard 10] rpc - client 10.4.0.178:7000: ignoring error response: abort requested
11:06:41 node-1 scylla[115526]:  [shard  0] gossip - Got shutdown message from 10.4.0.178, received_generation=1670659740, local_generation=1670659740
11:06:41 node-1 scylla[115526]:  [shard  0] gossip - InetAddress 10.4.0.178 is now DOWN, status = shutdown

then immediately joins back

11:06:41 node-1 scylla[115526]:  [shard  0] gossip - Skip marking node 10.4.0.178 with status = shutdown as UP
11:09:12 node-1 scylla[115526]:  [shard  0] storage_service - Node 10.4.0.178 state jump to normal
11:09:12 node-1 scylla[115526]:  [shard  0] gossip - Fail to send EchoMessage to 10.4.0.178: seastar::rpc::closed_error (connection is closed)
11:09:13 node-1 scylla[115526]:  [shard  0] gossip - InetAddress 10.4.0.178 is now UP, status = NORMAL
11:09:31 node-1 scylla[115526]:  [shard  0] storage_service - Node 10.4.0.178 state jump to normal

and repair kicks in

11:09:59 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #2ff21c23-787b-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-system_traces-index-0 with peers=10.4.0.178, slave
11:09:59 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #2ff21c23-787b-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-system_traces-index-0 succeeded, peers={10.4.0.178}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
...
16:55:57 node-1 scylla[115526]:  [shard  0] repair - repair[9ee191d8-afd6-47b2-aa94-a4cf20ca6fdb]: Skipped sending repair_flush_hints_batchlog to nodes=[10.4.0.178, 10.4.0.193, 10.4.2.92, 10.4.2.198, 10.4.2.61, 10.4.0.90]
16:55:57 node-1 scylla[115526]:  [shard  0] repair - repair[9ee191d8-afd6-47b2-aa94-a4cf20ca6fdb]: Repair 13 out of 2364 ranges, shard=0, keyspace=system_auth, table={role_members, roles, role_attributes}, range=(-8965166271892268491, -8>
...
16:56:31 node-1 scylla[115526]:  [shard  7] repair - repair[9ee191d8-afd6-47b2-aa94-a4cf20ca6fdb]: shard 0 stats: repair_reason=repair, keyspace=system_auth, tables={role_members, roles, role_attributes}, ranges_nr=788, round_nr=2364, ro>

then removenode happens

17:32:42 node-1 scylla[115526]:  [shard  0] range_streamer - Removenode with [10.4.2.61, 10.4.2.198, 10.4.0.178, 10.4.0.193] for keyspace=system_distributed started, nodes_to_stream=4
...
19:56:55 node-1 scylla[115526]:  [shard  0] range_streamer - Removenode with 10.4.0.178 for keyspace=cdc_test succeeded, took 8649.633 seconds

then rpair again

16:46:21 node-1 scylla[115526]:  [shard  0] repair - repair[435f5247-0b64-4ebc-91d9-b9fae548fca6]: Skipped sending repair_flush_hints_batchlog to nodes=[10.4.0.178, 10.4.0.193, 10.4.3.254, 10.4.2.198, 10.4.2.61, 10.4.0.90]

So most of the streaming badness happens here

11:10:04 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #326c1cd2-787b-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-0 with peers=10.4.0.178, slave
11:33:56 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #326c1cd2-787b-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-0 succeeded, peers={10.4.0.178}, tx=9826269 KiB, 6858.35 KiB/s, rx=0 KiB, 0.00 KiB/s
11:33:56 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #8867f480-787e-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-1 with peers=10.4.0.178, slave
12:06:33 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #8867f480-787e-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-1 succeeded, peers={10.4.0.178}, tx=6748884 KiB, 3449.80 KiB/s, rx=0 KiB, 0.00 KiB/s
12:06:33 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #1675ef30-7883-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-2 with peers=10.4.0.178, slave
12:32:14 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #1675ef30-7883-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-2 succeeded, peers={10.4.0.178}, tx=12576249 KiB, 8159.79 KiB/s, rx=0 KiB, 0.00 KiB/s
12:32:14 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #ad1dde40-7886-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-3 with peers=10.4.0.178, slave
13:02:55 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #f6683d80-788a-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-4 with peers=10.4.0.178, slave
13:02:55 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #ad1dde40-7886-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-3 succeeded, peers={10.4.0.178}, tx=4443914 KiB, 2413.93 KiB/s, rx=0 KiB, 0.00 KiB/s
13:25:37 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #f6683d80-788a-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-4 succeeded, peers={10.4.0.178}, tx=10475443 KiB, 7689.65 KiB/s, rx=0 KiB, 0.00 KiB/s
13:25:37 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #2263ac00-788e-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-5 with peers=10.4.0.178, slave
13:50:27 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #9aa8acd0-7891-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-6 with peers=10.4.0.178, slave
13:50:27 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #2263ac00-788e-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-5 succeeded, peers={10.4.0.178}, tx=10061054 KiB, 6751.17 KiB/s, rx=0 KiB, 0.00 KiB/s
14:12:49 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #9aa8acd0-7891-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-6 succeeded, peers={10.4.0.178}, tx=9683191 KiB, 7215.92 KiB/s, rx=0 KiB, 0.00 KiB/s
14:12:49 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #ba81e000-7894-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-7 with peers=10.4.0.178, slave
14:36:36 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #ba81e000-7894-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-7 succeeded, peers={10.4.0.178}, tx=10221807 KiB, 7162.53 KiB/s, rx=0 KiB, 0.00 KiB/s
14:36:36 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #0d23b740-7898-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-8 with peers=10.4.0.178, slave
14:59:37 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #0d23b740-7898-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-8 succeeded, peers={10.4.0.178}, tx=9358207 KiB, 6776.35 KiB/s, rx=0 KiB, 0.00 KiB/s
14:59:37 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #444a08c0-789b-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-9 with peers=10.4.0.178, slave
15:27:02 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #18ca07a0-789f-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-10 with peers=10.4.0.178, slave
15:27:02 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #444a08c0-789b-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-9 succeeded, peers={10.4.0.178}, tx=11300724 KiB, 6869.72 KiB/s, rx=0 KiB, 0.00 KiB/s
15:50:24 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #18ca07a0-789f-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-10 succeeded, peers={10.4.0.178}, tx=1117325 KiB, 797.42 KiB/s, rx=0 KiB, 0.00 KiB/s

fruch · 2023-04-13T16:08:58Z

Happened again on 2023.1.0~rc3

during TRUNCATE command, it failed after 4m, even the timeout is 600s

Command: 'cqlsh --no-color   --request-timeout=600 --connect-timeout=60  -e "TRUNCATE ks_truncate_large_partition.test_table USING TIMEOUT 600s" 10.12.3.74 9042'
Exit code: 1
Stdout:
Stderr:
Connection error: ('Unable to connect to any servers', {'10.12.3.74:9042': ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out for system.local - received only 0 responses from 1 CL=ONE." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',)})

Installation details

Kernel Version: 5.15.0-1031-aws
Scylla version (or git commit hash): 2023.1.0~rc3-20230321.80de75947b7a with build-id 6e1d6cb6cac9242e7ed7bfd8b07c1fc5998281dc

Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-cdc-100gb-4h-2023-1-db-node-50584be7-7 (34.206.53.69 | 10.12.1.253) (shards: 14)
longevity-cdc-100gb-4h-2023-1-db-node-50584be7-6 (44.203.17.236 | 10.12.0.89) (shards: 14)
longevity-cdc-100gb-4h-2023-1-db-node-50584be7-5 (44.204.242.23 | 10.12.3.74) (shards: 14)
longevity-cdc-100gb-4h-2023-1-db-node-50584be7-4 (34.201.209.223 | 10.12.1.88) (shards: 14)
longevity-cdc-100gb-4h-2023-1-db-node-50584be7-3 (3.235.100.219 | 10.12.3.152) (shards: 14)
longevity-cdc-100gb-4h-2023-1-db-node-50584be7-2 (44.200.154.165 | 10.12.1.118) (shards: 14)
longevity-cdc-100gb-4h-2023-1-db-node-50584be7-1 (3.237.62.175 | 10.12.3.208) (shards: 14)

OS / Image: ami-0d66fc9f6080b90ff (aws: us-east-1)

Test: longevity-cdc-100gb-4h-test
Test id: 50584be7-7384-4ca7-b77a-b26d06e6f4be
Test name: enterprise-2023.1/longevity/longevity-cdc-100gb-4h-test
Test config file(s):

longevity-cdc-100gb-4h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 50584be7-7384-4ca7-b77a-b26d06e6f4be
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 50584be7-7384-4ca7-b77a-b26d06e6f4be

Logs:

db-cluster-50584be7.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/50584be7-7384-4ca7-b77a-b26d06e6f4be/20230410_201002/db-cluster-50584be7.tar.gz
sct-runner-50584be7.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/50584be7-7384-4ca7-b77a-b26d06e6f4be/20230410_201002/sct-runner-50584be7.tar.gz
monitor-set-50584be7.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/50584be7-7384-4ca7-b77a-b26d06e6f4be/20230410_201002/monitor-set-50584be7.tar.gz
loader-set-50584be7.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/50584be7-7384-4ca7-b77a-b26d06e6f4be/20230410_201002/loader-set-50584be7.tar.gz
parallel-timelines-report-50584be7.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/50584be7-7384-4ca7-b77a-b26d06e6f4be/20230410_201002/parallel-timelines-report-50584be7.tar.gz

Jenkins job URL

DoronArazii · 2023-07-26T09:06:01Z

@xemul where is it standing?
last reproducer is from April. do we need another one with the latest?

xemul · 2023-07-26T09:42:13Z

last reproducer is from April. do we need another one with the latest?

I think we do. Since then we've fixed tons of default IO class mis-usage indirectly with 66e4391 and this had visible effect, e.g. on #13753 . This can be affected too

DoronArazii · 2023-07-26T09:58:23Z

/Cc @fruch ^^

fruch · 2023-07-26T11:12:30Z

/Cc @fruch ^^

There's no specific reproducer or case for this one, it would wait for the next release, when we'll run some of those cases again to see if this is still happening

mykaul · 2023-08-10T11:24:08Z

/Cc @fruch ^^

There's no specific reproducer or case for this one, it would wait for the next release, when we'll run some of those cases again to see if this is still happening

We can start with master.

fgelcer · 2023-09-06T11:53:49Z

@fruch , we are building 2023.1.1 , so perhaps we should re-run the test that hit it in the first place

fruch · 2023-09-06T12:01:50Z

@fruch , we are building 2023.1.1 , so perhaps we should re-run the test that hit it in the first place

This was happening in multiple cases for the last year, why try reproducing it on 2023.1.1 ? and not on any other release ?

Anyhow, we don't have any specific case that reproduces it clearly

mykaul · 2024-03-10T13:35:48Z

I'm closing this for the time being. If it reproduces - please re-open.

fruch added the triage/master Looking for assignee label Jul 11, 2022

DoronArazii assigned bhalevy Jul 13, 2022

DoronArazii added bug and removed triage/master Looking for assignee labels Jul 13, 2022

DoronArazii added this to the 5.1 milestone Jul 14, 2022

bhalevy assigned denesb Jul 17, 2022

fruch mentioned this issue Jul 19, 2022

fix(syslog-ng): add support for multi-line logs scylladb/scylla-cluster-tests#5026

Merged

7 tasks

DoronArazii modified the milestones: 5.3, 5.4 Jul 26, 2023

DoronArazii added the P2 High Priority label Jul 26, 2023

xemul self-assigned this Jul 26, 2023

DoronArazii added the status/pending qa reproduction Pending for QA team to reproduce the issue label Jul 26, 2023

DoronArazii removed the triage/master Looking for assignee label Jul 27, 2023

DoronArazii modified the milestones: 5.4, 6.0 Sep 3, 2023

mykaul closed this as not planned Won't fix, can't repro, duplicate, stale Mar 10, 2024

cqlsh fails with "Operation timed out for system.peers" (after nodetool refresh used) #11016

cqlsh fails with "Operation timed out for system.peers" (after nodetool refresh used) #11016

Comments

fruch commented Jul 11, 2022

Installation details

Issue description

Logs:

fruch commented Jul 17, 2022

Installation details

Logs:

bhalevy commented Jul 17, 2022

bhalevy commented Jul 17, 2022

fruch commented Jul 17, 2022

bhalevy commented Jul 17, 2022

denesb commented Jul 18, 2022

fruch commented Jul 18, 2022

denesb commented Jul 18, 2022

bhalevy commented Jul 18, 2022

fruch commented Jul 18, 2022

denesb commented Jul 18, 2022

fruch commented Jul 18, 2022

denesb commented Jul 18, 2022

roydahan commented Jul 19, 2022

juliayakovlev commented Jul 24, 2022

Installation details

Issue description

Logs:

denesb commented Jul 25, 2022

bhalevy commented Jul 25, 2022

denesb commented Jul 25, 2022

bhalevy commented Jul 25, 2022

denesb commented Jul 25, 2022

bhalevy commented Jul 25, 2022

denesb commented Jul 25, 2022

bhalevy commented Jul 26, 2022

DoronArazii commented Mar 1, 2023

xemul commented Mar 2, 2023

xemul commented Mar 2, 2023

xemul commented Mar 2, 2023

xemul commented Mar 2, 2023

xemul commented Mar 2, 2023

xemul commented Mar 2, 2023

xemul commented Mar 2, 2023

xemul commented Mar 2, 2023

xemul commented Mar 2, 2023

xemul commented Mar 2, 2023

xemul commented Mar 2, 2023

fruch commented Apr 13, 2023

Installation details

Logs:

DoronArazii commented Jul 26, 2023 • edited

xemul commented Jul 26, 2023

DoronArazii commented Jul 26, 2023

fruch commented Jul 26, 2023

mykaul commented Aug 10, 2023

fgelcer commented Sep 6, 2023

fruch commented Sep 6, 2023

mykaul commented Mar 10, 2024

cqlsh fails with "Operation timed out for system.peers" (after `nodetool refresh` used) #11016

cqlsh fails with "Operation timed out for system.peers" (after `nodetool refresh` used) #11016

DoronArazii commented Jul 26, 2023 •

edited