Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cqlsh fails with "Operation timed out for system.peers" (after nodetool refresh used) #11016

Closed
fruch opened this issue Jul 11, 2022 · 68 comments
Assignees
Labels
bug P2 High Priority status/pending qa reproduction Pending for QA team to reproduce the issue
Milestone

Comments

@fruch
Copy link
Contributor

fruch commented Jul 11, 2022

Installation details

Kernel Version: 5.13.0-1031-aws
Scylla version (or git commit hash): 5.1.dev-20220706.a0ffbf3291b7 with build-id 3490fa9f14da510e97a1d0f53f693cac13a70494
Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-9 (63.35.163.81 | 10.0.3.187) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-8 (52.31.41.157 | 10.0.0.95) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-7 (34.244.145.19 | 10.0.0.195) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-6 (34.241.120.110 | 10.0.0.111) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-5 (34.244.16.3 | 10.0.1.141) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-4 (34.253.116.115 | 10.0.0.12) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-3 (3.251.71.251 | 10.0.1.19) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-2 (52.16.243.180 | 10.0.2.34) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-11 (3.251.65.93 | 10.0.2.8) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-10 (34.243.184.173 | 10.0.0.66) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-1 (63.33.208.165 | 10.0.1.209) (shards: 14)

OS / Image: ami-07d73e5ea1fc772eb (aws: eu-west-1)

Test: longevity-50gb-3days
Test id: fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

Issue description

during disrupt_nodetool_refresh, when test tries to verify the snapshot was refreshed correctly,
the cqlsh command timeout on system.peers

Command: 'cqlsh --no-color -u cassandra -p \'cassandra\' --request-timeout=120 --connect-timeout=60 --ssl -e "SELECT * FROM keyspace1.standard1 WHERE key=0x32373131364f334f3830" 10.0.0.95 9042'

Exit code: 1

Stdout:



Stderr:

Connection error: ('Unable to connect to any servers', {'10.0.0.95': ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out for system.peers - received only 0 responses from 1 CL=ONE." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',)})

  • Restore Monitor Stack command: $ hydra investigate show-monitor fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a

Logs:

No logs captured during this run. test is still running
http://34.246.190.165:3000/d/ks-master/longevity-50gb-3days-scylla-per-server-metrics-nemesis-master?orgId=1&from=now-6h&to=now

Jenkins job URL

it's ~2022-07-11 12:02:19
image

@fruch fruch added the triage/master Looking for assignee label Jul 11, 2022
@DoronArazii DoronArazii added bug and removed triage/master Looking for assignee labels Jul 13, 2022
@DoronArazii DoronArazii added this to the 5.1 milestone Jul 14, 2022
@fruch
Copy link
Contributor Author

fruch commented Jul 17, 2022

Happening also this week run mostly during cqlsh describe keyspces on node-1

Command: 'cqlsh --no-color -u cassandra -p \'cassandra\'  --request-timeout=120 --connect-timeout=60 --ssl -e "describe keyspaces" 10.0.1.31 9042'

Exit code: 1

Stdout:



Stderr:

Connection error: ('Unable to connect to any servers', {'10.0.1.31': ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out for system.peers - received only 0 responses from 1 CL=ONE." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',)})

seems that since the disrupt_remove_node_then_add_node added a new node at ~7:00, node-1 is having lots of read failures
for hours, until ~16:30

image

http://34.244.142.156:3000/d/overview-master/longevity-50gb-3days-scylla-per-server-metrics-nemesis-master?orgId=1&from=1657863680494&to=1657903953915

Installation details

Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 5.1.dev-20220714.98aa3ec99b9b with build-id e27b38549f89da2f016567fc243c8404a6460ebc
Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-8dce4a7e-9 (34.244.142.219 | 10.0.1.13) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8dce4a7e-8 (34.252.229.242 | 10.0.3.40) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8dce4a7e-7 (3.250.124.209 | 10.0.1.101) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8dce4a7e-6 (34.244.161.163 | 10.0.0.77) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8dce4a7e-5 (54.75.84.148 | 10.0.2.140) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8dce4a7e-4 (34.244.80.175 | 10.0.1.123) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8dce4a7e-3 (3.250.10.77 | 10.0.3.184) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8dce4a7e-2 (34.243.250.187 | 10.0.3.232) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8dce4a7e-1 (34.246.186.17 | 10.0.1.31) (shards: 14)

OS / Image: ami-08cbc2649e53039f8 (aws: eu-west-1)

Test: longevity-50gb-3days
Test id: 8dce4a7e-e326-4373-bd48-e772fbfbe457
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

Logs:

No logs captured during this run.

Jenkins job URL

@bhalevy
Copy link
Member

bhalevy commented Jul 17, 2022

The read errors on 10.0.1.31 are all on a shard 12.

image

@bhalevy
Copy link
Member

bhalevy commented Jul 17, 2022

@fruch how do I get to the cluster node logs?

@fruch
Copy link
Contributor Author

fruch commented Jul 17, 2022

@fruch how do I get to the cluster node logs?

it's currently running so they aren't collected yet.
I'll send you the details, on how to connect

@bhalevy
Copy link
Member

bhalevy commented Jul 17, 2022

Looking at
longevity-tls-50gb-3d-master-db-node-8dce4a7e-1.log.gz

Amongst the thousands of Commitlog shutdown complete message (Cc @elcallio why do we have so many of them? Does this indicate a problem?)

There are many reader_concurrency_semaphore timed out messages on shard 12:

2022-07-15T06:58:07+00:00 longevity-tls-50gb-3d-master-db-node-8dce4a7e-1     !INFO | scylla[7943]:  [shard 12] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 4/100 count and 68358/170959831 memory resources: timed out, dumping permit diagnostics:
...
2022-07-15T06:58:36+00:00 longevity-tls-50gb-3d-master-db-node-8dce4a7e-1     !INFO | scylla[7943]:  [shard 12] reader_concurrency_semaphore - (rate limiting dropped 16984 similar messages) Semaphore _read_concurrency_sem with 3/100 count and 53000/170959831 memory resources: timed out, dumping permit diagnostics:

Until 2022-07-15T18:31:04+00:00 there are close to 1000 reader_concurrency_semaphore timed out message on shard 12.
After that other shards start popping up.

@denesb please look into this.

@denesb
Copy link
Contributor

denesb commented Jul 18, 2022

The actual diagnostics dump is missing from the logs. Wasn't this fixed already?

@fruch
Copy link
Contributor Author

fruch commented Jul 18, 2022

The actual diagnostics dump is missing from the logs. Wasn't this fixed already?

where do you expect the fix to be ?

@denesb
Copy link
Contributor

denesb commented Jul 18, 2022

This log is a multi-line one and we had the problem in the past of SCT only copying the first line into its own logs. There was an issue (don't remember which) where this was discussed and I thought it was fixed.

@bhalevy
Copy link
Member

bhalevy commented Jul 18, 2022

I hope the job also collects the node logs verbatim.

@fruch
Copy link
Contributor Author

fruch commented Jul 18, 2022

I hope the job also collects the node logs verbatim.

the answer is no, (I'll try to handle it in scylladb/scylla-cluster-tests#5026)

but the faulty machine is up, here is one example from it:

Jul 15 16:17:56 longevity-tls-50gb-3d-master-db-node-8dce4a7e-1 scylla[46538]:  [shard 12] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 5/100 count and 105314/>
                                                                               permits        count        memory        table/description/state
                                                                               3        3        66K        keyspace1.standard1/data-query/active/blocked
                                                                               1        1        19K        keyspace1.standard1/data-query/active/used
                                                                               1        1        16K        mview.users_by_first_name/data-query/active/used
                                                                               2        0        2K        mview.users/push-view-updates-2/active/unused
                                                                               2        0        0B        mview.users_by_last_name/mutation-query/waiting
                                                                               2        0        0B        mview.users_by_first_name/mutation-query/waiting
                                                                               48        0        0B        mview.users_by_last_name/data-query/waiting
                                                                               23        0        0B        mview.users/mutation-query/waiting
                                                                               49        0        0B        mview.users/data-query/waiting
                                                                               1551        0        0B        keyspace1.standard1/data-query/waiting
                                                                               46        0        0B        mview.users_by_first_name/data-query/waiting

                                                                               1728        5        103K        total

                                                                               Total: 1728 permits with 5 count and 103K memory resources

@denesb
Copy link
Contributor

denesb commented Jul 18, 2022

I need a coredump from this node while it is producing these symptoms.

@fruch
Copy link
Contributor Author

fruch commented Jul 18, 2022

I need a coredump from this node while it is producing these symptoms.

the test case ended, so it's a bit too late for that.

@denesb
Copy link
Contributor

denesb commented Jul 18, 2022

Ok. Next time you reproduce this, please kill scylla with SIGABRT and upload the coredump, so I can have a look at what the problematic shard is doing.

@roydahan
Copy link

@denesb why do we have multi-line messages in the log?
It's not something ideal to work with.

@juliayakovlev
Copy link

Reproduced 3 times during longevity-twcs-48h 5.1.dev-20220719.274691f45e3d

Installation details

Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 5.1.dev-20220719.274691f45e3d with build-id 8bedc6db4fe1f245a70c67d2d2135b056a361a78
Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • longevity-twcs-48h-master-db-node-360458dd-9 (54.154.212.57 | 10.0.1.137) (shards: 7)
  • longevity-twcs-48h-master-db-node-360458dd-8 (54.77.206.42 | 10.0.0.8) (shards: 7)
  • longevity-twcs-48h-master-db-node-360458dd-7 (52.210.48.154 | 10.0.0.96) (shards: 7)
  • longevity-twcs-48h-master-db-node-360458dd-6 (54.216.80.92 | 10.0.2.251) (shards: 7)
  • longevity-twcs-48h-master-db-node-360458dd-5 (52.18.227.181 | 10.0.0.156) (shards: 7)
  • longevity-twcs-48h-master-db-node-360458dd-4 (54.77.222.231 | 10.0.2.198) (shards: 7)
  • longevity-twcs-48h-master-db-node-360458dd-3 (3.250.195.217 | 10.0.0.227) (shards: 7)
  • longevity-twcs-48h-master-db-node-360458dd-2 (3.250.177.61 | 10.0.3.20) (shards: 7)
  • longevity-twcs-48h-master-db-node-360458dd-12 (18.202.58.144 | 10.0.3.64) (shards: 7)
  • longevity-twcs-48h-master-db-node-360458dd-11 (34.248.144.163 | 10.0.2.101) (shards: 7)
  • longevity-twcs-48h-master-db-node-360458dd-10 (34.254.225.234 | 10.0.1.152) (shards: 7)
  • longevity-twcs-48h-master-db-node-360458dd-1 (52.18.27.2 | 10.0.0.17) (shards: 7)

OS / Image: ami-02742dcbeb0063751 (aws: eu-west-1)

Test: longevity-twcs-48h-test
Test id: 360458dd-1013-44a9-b93f-0f3e25f6d73f
Test name: scylla-master/longevity/longevity-twcs-48h-test
Test config file(s):

Issue description

During the test 3 nemeses failed with same ReadTimeout error when sent "describe keyspaces" to the node1 (private IP 10.0.0.17):

< t:2022-07-23 01:54:37,902 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-07-23 01:54:37.900: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=b79a5f9b-7031-4cdf-9fa4-e4d35b184f6a: nemesis_name=DeleteByRowsRange target_node=Node longevity-twcs-48h-master-db-node-360458dd-5 [52.18.227.181 | 10.0.0.156] (seed: False)
< t:2022-07-23 01:54:38,633 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"...
< t:2022-07-23 01:54:49,137 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"; Exit status: 1

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3614, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2025, in disrupt_delete_by_rows_range
    self.verify_initial_inputs_for_delete_nemesis()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1859, in verify_initial_inputs_for_delete_nemesis
    test_keyspaces = self.cluster.get_test_keyspaces()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4027, in get_test_keyspaces
    keyspaces = self.nodes[0].run_cqlsh("describe keyspaces").stdout.split()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2804, in run_cqlsh
    cqlsh_out = self.remoter.run(cmd, timeout=timeout + 30,  # we give 30 seconds to cqlsh timeout mechanism to work
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042'

Exit code: 1

Stdout:



Stderr:

Connection error: ('Unable to connect to any servers', {'10.0.0.17': ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out for system.peers - received only 0 responses from 1 CL=ONE." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',)})
< t:2022-07-23 02:27:47,247 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-07-23 02:27:47.245: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=3a413c58-870e-4256-b54b-5d0eeacfebbd: nemesis_name=NoCorruptRepair target_node=Node longevity-twcs-48h-master-db-node-360458dd-6 [54.216.80.92 | 10.0.2.251] (seed: False)
< t:2022-07-23 02:34:09,469 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"...
< t:2022-07-23 02:34:21,474 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"; Exit status: 1

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3614, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1393, in disrupt_no_corrupt_repair
    self._prepare_test_table(ks=f'drop_table_during_repair_ks_{i}', table='standard1')
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1649, in _prepare_test_table
    test_keyspaces = self.cluster.get_test_keyspaces()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4027, in get_test_keyspaces
    keyspaces = self.nodes[0].run_cqlsh("describe keyspaces").stdout.split()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2804, in run_cqlsh
    cqlsh_out = self.remoter.run(cmd, timeout=timeout + 30,  # we give 30 seconds to cqlsh timeout mechanism to work
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042'

Exit code: 1

Stdout:



Stderr:

Connection error: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for system.local - received only 0 responses from 1 CL=ONE." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
< t:2022-07-23 03:53:46,125 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-07-23 03:53:46.123: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=d4a96399-abf7-40df-ae84-ed1b2673105f: nemesis_name=Truncate target_node=Node longevity-twcs-48h-master-db-node-360458dd-6 [54.216.80.92 | 10.0.2.251] (seed: False)
< t:2022-07-23 03:57:28,783 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"...
< t:2022-07-23 03:57:45,289 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042"; Exit status: 1

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3614, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1662, in disrupt_truncate
    self._prepare_test_table(ks=keyspace_truncate)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1649, in _prepare_test_table
    test_keyspaces = self.cluster.get_test_keyspaces()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4027, in get_test_keyspaces
    keyspaces = self.nodes[0].run_cqlsh("describe keyspaces").stdout.split()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2804, in run_cqlsh
    cqlsh_out = self.remoter.run(cmd, timeout=timeout + 30,  # we give 30 seconds to cqlsh timeout mechanism to work
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.0.0.17 9042'

Exit code: 1

Stdout:



Stderr:

Connection error: ('Unable to connect to any servers', {'10.0.0.17': ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out for system_schema.keyspaces - received only 0 responses from 1 CL=ONE." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',)})

The node1 is not overloaded:
Screenshot from 2022-07-24 16-31-56

Scylla-bench commands are running at this time:

2022-07-24 03:41:46.731: (ScyllaBenchEvent Severity.NORMAL) period_type=end event_id=492a4182-6a07-4651-a6b5-25a2ec5067e1 duration=2d0h0m2s: node=Node longevity-twcs-48h-master-loader-node-360458dd-1 [54.154.112.100 | 10.0.2.244] (seed: False)
stress_cmd=scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000000 -clustering-row-size=200 -concurrency=100 -rows-per-request=100 -start-timestamp=1658461302364681340 -connection-count 100 -max-rate 50000 --timeout 120s -duration=2880m -error-at-row-limit 1000

2022-07-24 03:43:04.974: (ScyllaBenchEvent Severity.NORMAL) period_type=end event_id=65040e53-bdb6-4650-bed3-eb2a7a0d2534 duration=2d0h0m0s: node=Node longevity-twcs-48h-master-loader-node-360458dd-2 [52.208.136.220 | 10.0.1.124] (seed: False)
stress_cmd=scylla-bench -workload=timeseries -mode=read -partition-count=20000 -concurrency=100 -replication-factor=3 -clustering-row-count=10000000 -clustering-row-size=200  -rows-per-request=100 -start-timestamp=1658461302364681340 -write-rate 125 -distribution hnormal --connection-count 100 -duration=2880m -error-at-row-limit 1000

2022-07-24 03:44:30.772: (ScyllaBenchEvent Severity.NORMAL) period_type=end event_id=206d78d9-5680-4b93-a856-f01ba5a139b1 duration=2d0h0m0s: node=Node longevity-twcs-48h-master-loader-node-360458dd-3 [34.244.200.85 | 10.0.0.159] (seed: False)
stress_cmd=scylla-bench -workload=timeseries -mode=read -partition-count=20000 -concurrency=100 -replication-factor=3 -clustering-row-count=10000000 -clustering-row-size=200  -rows-per-request=100 -start-timestamp=1658461302364681340 -write-rate 125 -distribution uniform --connection-count 100 -duration=2880m -error-at-row-limit 1000
  • Restore Monitor Stack command: $ hydra investigate show-monitor 360458dd-1013-44a9-b93f-0f3e25f6d73f
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 360458dd-1013-44a9-b93f-0f3e25f6d73f

Logs:

Jenkins job URL

@denesb
Copy link
Contributor

denesb commented Jul 25, 2022

@denesb why do we have multi-line messages in the log? It's not something ideal to work with.

I know its not ideal but this report is very long and if I made it single-line it would be completely unreadable for humans.

@bhalevy
Copy link
Member

bhalevy commented Jul 25, 2022

@denesb why do we have multi-line messages in the log? It's not something ideal to work with.

I know its not ideal but this report is very long and if I made it single-line it would be completely unreadable for humans.

We could print it as json string that could be easily printed later with json pretty printer..

@denesb
Copy link
Contributor

denesb commented Jul 25, 2022

@denesb why do we have multi-line messages in the log? It's not something ideal to work with.

I know its not ideal but this report is very long and if I made it single-line it would be completely unreadable for humans.

We could print it as json string that could be easily printed later with json pretty printer..

This also makes the log unreadable for humans, without the tool at hand. The questions is: who do we want to make the logs easy to read for? Humans or scripts?

@bhalevy
Copy link
Member

bhalevy commented Jul 25, 2022

@denesb why do we have multi-line messages in the log? It's not something ideal to work with.

I know its not ideal but this report is very long and if I made it single-line it would be completely unreadable for humans.

We could print it as json string that could be easily printed later with json pretty printer..

This also makes the log unreadable for humans, without the tool at hand. The questions is: who do we want to make the logs easy to read for? Humans or scripts?

I think that in most cases, this particular error target audience is scylla engineering, rather than users, since it involves an internal data structure that's not exposed in the user interface.

Therefore I'd expect scylla staff dealing with it to have the tool.
It's commonly available (e.g. perl-JSON-PP-4.06-2.fc34.noarch).

@denesb
Copy link
Contributor

denesb commented Jul 25, 2022

This is far from being the only multi-line log message. Are we requiring all of them to be converted to single line JSON? Note that multi-line logs are especially common in engineer oriented logs, typically where more details are required.
The other questions is: is this additional steps for interpreting logs worth it just so log parsing scripts can be a little simpler?
It is not that hard to make log parsing work with multi-line logs.

@bhalevy
Copy link
Member

bhalevy commented Jul 25, 2022

maybe we should add an option to scylla that will be set in automated runs to print those logs in a single line.

@denesb
Copy link
Contributor

denesb commented Jul 25, 2022

I don't mind that, but again, how far are we willing to go to avoid having parsing multi-line logs? From where I'm sitting this doesn't seem like a hard thing to do. Logs have a very well defined start sequence, which is easy to find and thus it can be used to sequence the log stream into individual multi-line logs.
Maybe there is a genuine reason I don't know about why this is not as simple as I imagine it. If there is indeed, I don't mind doing some extra work to avoid multi-line logs.

@bhalevy
Copy link
Member

bhalevy commented Jul 26, 2022

Is this a duplicate of #10405?

@DoronArazii
Copy link

@xemul ^^

@xemul
Copy link
Contributor

xemul commented Mar 2, 2023

  1. (not necessarily the root cause) In the load_and_stream case the CPU sched group used is not the streaming one

@xemul
Copy link
Contributor

xemul commented Mar 2, 2023

image

1500 requests in query queue is way too much

@xemul
Copy link
Contributor

xemul commented Mar 2, 2023

Also note the default IO class:

image

Minutes delays is no wonder -- the class has shares value of 1 (one, literally). However, in this class some sstables access still happens, smth like TOC reading on opening or alike.

@xemul
Copy link
Contributor

xemul commented Mar 2, 2023

image
image
image

High CPU starvation times. CPU cannot keep-up with compaction and flushing. Statement group feels much better, but also doesn't get what it needs.

@xemul
Copy link
Contributor

xemul commented Mar 2, 2023

Do we have a metrics showing how many IOs a cache-missing request needs?

@xemul
Copy link
Contributor

xemul commented Mar 2, 2023

Also note the default IO class ... in this class some sstables access still happens, smth like TOC reading on opening or alike.

The plan is to generalize CPU and IO classes so that default IO class would just naturally disappear

@xemul
Copy link
Contributor

xemul commented Mar 2, 2023

in [default] class some sstables access still happens, smth like TOC reading on opening or alike.

Yup:

// This is small enough, and well-defined. Easier to just read it all at once
future<> sstable::read_toc() noexcept {
    ...
    return with_file(new_sstable_component_file(_read_error_handler, component_type::TOC, open_flags::ro), [this] (file f) {
        auto bufptr = allocate_aligned_buffer<char>(4096, 4096);
        auto buf = bufptr.get();
        auto fut = f.dma_read(0, buf, 4096); // <<<<< uses default IO class

@xemul
Copy link
Contributor

xemul commented Mar 2, 2023

  1. (not necessarily the root cause) In the load_and_stream case the CPU sched group used is not the streaming one

Not the case here

$ fgrep 'Loading new' */system.log | cut -f18 -d' '
load_and_stream=false,
load_and_stream=false,
load_and_stream=false,
load_and_stream=false,
load_and_stream=false,

@xemul
Copy link
Contributor

xemul commented Mar 2, 2023

Erm... Doesn't look like loading at all in fact:

$ fgrep -i 'loading new' */system.log | cut -f3,4,11,12,13,21 -d' '
10:53:43 longevity-cdc-3d-400gb-master-db-node-43d44169-1 - Loading new
10:53:44 longevity-cdc-3d-400gb-master-db-node-43d44169-1 - Done loading status=succeeded
10:53:56 longevity-cdc-3d-400gb-master-db-node-43d44169-3 - Loading new
10:53:56 longevity-cdc-3d-400gb-master-db-node-43d44169-3 - Done loading status=succeeded
10:54:21 longevity-cdc-3d-400gb-master-db-node-43d44169-5 - Loading new
10:54:22 longevity-cdc-3d-400gb-master-db-node-43d44169-5 - Done loading status=succeeded
10:54:31 longevity-cdc-3d-400gb-master-db-node-43d44169-7 - Loading new
10:54:32 longevity-cdc-3d-400gb-master-db-node-43d44169-7 - Done loading status=succeeded
10:54:42 longevity-cdc-3d-400gb-master-db-node-43d44169-9 - Loading new
10:54:43 longevity-cdc-3d-400gb-master-db-node-43d44169-9 - Done loading status=succeeded

on all nodes load_new_sstables finishes in less than a minute

@xemul
Copy link
Contributor

xemul commented Mar 2, 2023

Ah, so probably streaming happens because decommissioning some node(s)

@xemul
Copy link
Contributor

xemul commented Mar 2, 2023

So node 10.4.0.178 first got decommissioned

11:06:31 node-1 scylla[115526]:  [shard  0] storage_service - decommission[9858669e-686e-4520-acac-beef650cfe5b]: Added node=10.4.0.178 as leaving node, coordinator=10.4.0.178
11:06:33 node-1 scylla[115526]:  [shard  0] storage_service - decommission[9858669e-686e-4520-acac-beef650cfe5b]: Removed node=10.4.0.178 as leaving node, coordinator=10.4.0.178
11:06:37 node-1 scylla[115526]:  [shard 10] rpc - client 10.4.0.178:7000: ignoring error response: abort requested
11:06:38 node-1 scylla[115526]:  [shard 10] rpc - client 10.4.0.178:7000: ignoring error response: abort requested
11:06:39 node-1 scylla[115526]:  [shard 10] rpc - client 10.4.0.178:7000: ignoring error response: abort requested
11:06:41 node-1 scylla[115526]:  [shard  0] gossip - Got shutdown message from 10.4.0.178, received_generation=1670659740, local_generation=1670659740
11:06:41 node-1 scylla[115526]:  [shard  0] gossip - InetAddress 10.4.0.178 is now DOWN, status = shutdown

then immediately joins back

11:06:41 node-1 scylla[115526]:  [shard  0] gossip - Skip marking node 10.4.0.178 with status = shutdown as UP
11:09:12 node-1 scylla[115526]:  [shard  0] storage_service - Node 10.4.0.178 state jump to normal
11:09:12 node-1 scylla[115526]:  [shard  0] gossip - Fail to send EchoMessage to 10.4.0.178: seastar::rpc::closed_error (connection is closed)
11:09:13 node-1 scylla[115526]:  [shard  0] gossip - InetAddress 10.4.0.178 is now UP, status = NORMAL
11:09:31 node-1 scylla[115526]:  [shard  0] storage_service - Node 10.4.0.178 state jump to normal

and repair kicks in

11:09:59 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #2ff21c23-787b-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-system_traces-index-0 with peers=10.4.0.178, slave
11:09:59 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #2ff21c23-787b-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-system_traces-index-0 succeeded, peers={10.4.0.178}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
...
16:55:57 node-1 scylla[115526]:  [shard  0] repair - repair[9ee191d8-afd6-47b2-aa94-a4cf20ca6fdb]: Skipped sending repair_flush_hints_batchlog to nodes=[10.4.0.178, 10.4.0.193, 10.4.2.92, 10.4.2.198, 10.4.2.61, 10.4.0.90]
16:55:57 node-1 scylla[115526]:  [shard  0] repair - repair[9ee191d8-afd6-47b2-aa94-a4cf20ca6fdb]: Repair 13 out of 2364 ranges, shard=0, keyspace=system_auth, table={role_members, roles, role_attributes}, range=(-8965166271892268491, -8>
...
16:56:31 node-1 scylla[115526]:  [shard  7] repair - repair[9ee191d8-afd6-47b2-aa94-a4cf20ca6fdb]: shard 0 stats: repair_reason=repair, keyspace=system_auth, tables={role_members, roles, role_attributes}, ranges_nr=788, round_nr=2364, ro>

then removenode happens

17:32:42 node-1 scylla[115526]:  [shard  0] range_streamer - Removenode with [10.4.2.61, 10.4.2.198, 10.4.0.178, 10.4.0.193] for keyspace=system_distributed started, nodes_to_stream=4
...
19:56:55 node-1 scylla[115526]:  [shard  0] range_streamer - Removenode with 10.4.0.178 for keyspace=cdc_test succeeded, took 8649.633 seconds

then rpair again

16:46:21 node-1 scylla[115526]:  [shard  0] repair - repair[435f5247-0b64-4ebc-91d9-b9fae548fca6]: Skipped sending repair_flush_hints_batchlog to nodes=[10.4.0.178, 10.4.0.193, 10.4.3.254, 10.4.2.198, 10.4.2.61, 10.4.0.90]

So most of the streaming badness happens here

11:10:04 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #326c1cd2-787b-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-0 with peers=10.4.0.178, slave
11:33:56 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #326c1cd2-787b-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-0 succeeded, peers={10.4.0.178}, tx=9826269 KiB, 6858.35 KiB/s, rx=0 KiB, 0.00 KiB/s
11:33:56 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #8867f480-787e-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-1 with peers=10.4.0.178, slave
12:06:33 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #8867f480-787e-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-1 succeeded, peers={10.4.0.178}, tx=6748884 KiB, 3449.80 KiB/s, rx=0 KiB, 0.00 KiB/s
12:06:33 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #1675ef30-7883-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-2 with peers=10.4.0.178, slave
12:32:14 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #1675ef30-7883-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-2 succeeded, peers={10.4.0.178}, tx=12576249 KiB, 8159.79 KiB/s, rx=0 KiB, 0.00 KiB/s
12:32:14 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #ad1dde40-7886-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-3 with peers=10.4.0.178, slave
13:02:55 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #f6683d80-788a-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-4 with peers=10.4.0.178, slave
13:02:55 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #ad1dde40-7886-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-3 succeeded, peers={10.4.0.178}, tx=4443914 KiB, 2413.93 KiB/s, rx=0 KiB, 0.00 KiB/s
13:25:37 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #f6683d80-788a-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-4 succeeded, peers={10.4.0.178}, tx=10475443 KiB, 7689.65 KiB/s, rx=0 KiB, 0.00 KiB/s
13:25:37 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #2263ac00-788e-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-5 with peers=10.4.0.178, slave
13:50:27 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #9aa8acd0-7891-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-6 with peers=10.4.0.178, slave
13:50:27 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #2263ac00-788e-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-5 succeeded, peers={10.4.0.178}, tx=10061054 KiB, 6751.17 KiB/s, rx=0 KiB, 0.00 KiB/s
14:12:49 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #9aa8acd0-7891-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-6 succeeded, peers={10.4.0.178}, tx=9683191 KiB, 7215.92 KiB/s, rx=0 KiB, 0.00 KiB/s
14:12:49 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #ba81e000-7894-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-7 with peers=10.4.0.178, slave
14:36:36 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #ba81e000-7894-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-7 succeeded, peers={10.4.0.178}, tx=10221807 KiB, 7162.53 KiB/s, rx=0 KiB, 0.00 KiB/s
14:36:36 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #0d23b740-7898-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-8 with peers=10.4.0.178, slave
14:59:37 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #0d23b740-7898-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-8 succeeded, peers={10.4.0.178}, tx=9358207 KiB, 6776.35 KiB/s, rx=0 KiB, 0.00 KiB/s
14:59:37 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #444a08c0-789b-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-9 with peers=10.4.0.178, slave
15:27:02 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #18ca07a0-789f-11ed-8ad9-5b3cdf9b2c62] Executing streaming plan for Rebuild-cdc_test-index-10 with peers=10.4.0.178, slave
15:27:02 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #444a08c0-789b-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-9 succeeded, peers={10.4.0.178}, tx=11300724 KiB, 6869.72 KiB/s, rx=0 KiB, 0.00 KiB/s
15:50:24 node-1 scylla[115526]:  [shard  0] stream_session - [Stream #18ca07a0-789f-11ed-8ad9-5b3cdf9b2c62] Streaming plan for Rebuild-cdc_test-index-10 succeeded, peers={10.4.0.178}, tx=1117325 KiB, 797.42 KiB/s, rx=0 KiB, 0.00 KiB/s

@fruch
Copy link
Contributor Author

fruch commented Apr 13, 2023

Happened again on 2023.1.0~rc3

during TRUNCATE command, it failed after 4m, even the timeout is 600s

Command: 'cqlsh --no-color   --request-timeout=600 --connect-timeout=60  -e "TRUNCATE ks_truncate_large_partition.test_table USING TIMEOUT 600s" 10.12.3.74 9042'
Exit code: 1
Stdout:
Stderr:
Connection error: ('Unable to connect to any servers', {'10.12.3.74:9042': ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out for system.local - received only 0 responses from 1 CL=ONE." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',)})

Installation details

Kernel Version: 5.15.0-1031-aws
Scylla version (or git commit hash): 2023.1.0~rc3-20230321.80de75947b7a with build-id 6e1d6cb6cac9242e7ed7bfd8b07c1fc5998281dc

Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • longevity-cdc-100gb-4h-2023-1-db-node-50584be7-7 (34.206.53.69 | 10.12.1.253) (shards: 14)
  • longevity-cdc-100gb-4h-2023-1-db-node-50584be7-6 (44.203.17.236 | 10.12.0.89) (shards: 14)
  • longevity-cdc-100gb-4h-2023-1-db-node-50584be7-5 (44.204.242.23 | 10.12.3.74) (shards: 14)
  • longevity-cdc-100gb-4h-2023-1-db-node-50584be7-4 (34.201.209.223 | 10.12.1.88) (shards: 14)
  • longevity-cdc-100gb-4h-2023-1-db-node-50584be7-3 (3.235.100.219 | 10.12.3.152) (shards: 14)
  • longevity-cdc-100gb-4h-2023-1-db-node-50584be7-2 (44.200.154.165 | 10.12.1.118) (shards: 14)
  • longevity-cdc-100gb-4h-2023-1-db-node-50584be7-1 (3.237.62.175 | 10.12.3.208) (shards: 14)

OS / Image: ami-0d66fc9f6080b90ff (aws: us-east-1)

Test: longevity-cdc-100gb-4h-test
Test id: 50584be7-7384-4ca7-b77a-b26d06e6f4be
Test name: enterprise-2023.1/longevity/longevity-cdc-100gb-4h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 50584be7-7384-4ca7-b77a-b26d06e6f4be
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 50584be7-7384-4ca7-b77a-b26d06e6f4be

Logs:

Jenkins job URL

@DoronArazii
Copy link

DoronArazii commented Jul 26, 2023

@xemul where is it standing?
last reproducer is from April. do we need another one with the latest?

@DoronArazii DoronArazii modified the milestones: 5.3, 5.4 Jul 26, 2023
@DoronArazii DoronArazii added the P2 High Priority label Jul 26, 2023
@xemul
Copy link
Contributor

xemul commented Jul 26, 2023

last reproducer is from April. do we need another one with the latest?

I think we do. Since then we've fixed tons of default IO class mis-usage indirectly with 66e4391 and this had visible effect, e.g. on #13753 . This can be affected too

@xemul xemul self-assigned this Jul 26, 2023
@DoronArazii
Copy link

/Cc @fruch ^^

@DoronArazii DoronArazii added the status/pending qa reproduction Pending for QA team to reproduce the issue label Jul 26, 2023
@fruch
Copy link
Contributor Author

fruch commented Jul 26, 2023

/Cc @fruch ^^

There's no specific reproducer or case for this one, it would wait for the next release, when we'll run some of those cases again to see if this is still happening

@DoronArazii DoronArazii removed the triage/master Looking for assignee label Jul 27, 2023
@mykaul
Copy link
Contributor

mykaul commented Aug 10, 2023

/Cc @fruch ^^

There's no specific reproducer or case for this one, it would wait for the next release, when we'll run some of those cases again to see if this is still happening

We can start with master.

@DoronArazii DoronArazii modified the milestones: 5.4, 6.0 Sep 3, 2023
@fgelcer
Copy link

fgelcer commented Sep 6, 2023

@fruch , we are building 2023.1.1 , so perhaps we should re-run the test that hit it in the first place

@fruch
Copy link
Contributor Author

fruch commented Sep 6, 2023

@fruch , we are building 2023.1.1 , so perhaps we should re-run the test that hit it in the first place

This was happening in multiple cases for the last year, why try reproducing it on 2023.1.1 ? and not on any other release ?

Anyhow, we don't have any specific case that reproduces it clearly

@mykaul
Copy link
Contributor

mykaul commented Mar 10, 2024

I'm closing this for the time being. If it reproduces - please re-open.

@mykaul mykaul closed this as not planned Won't fix, can't repro, duplicate, stale Mar 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug P2 High Priority status/pending qa reproduction Pending for QA team to reproduce the issue
Projects
None yet
Development

No branches or pull requests

10 participants