`reader_concurrency_semaphore` times out a lot on 2/6 nodes and a little on 4/6 having ~90% load in cluster #13322

vponomaryov · 2023-03-24T17:15:09Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

A lot of query timeouts.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1031-aws
Scylla version (or git commit hash): 5.3.0~dev-20230316.5705df77a155 with build-id 3aa7c396c9e0eab7f08c8ab7dd82a548cf8d41e6

Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-cdc-100gb-4h-master-db-node-99364f24-6 (54.217.15.173 | 10.4.1.244) (shards: 14)
longevity-cdc-100gb-4h-master-db-node-99364f24-5 (52.49.70.224 | 10.4.3.200) (shards: 14)
longevity-cdc-100gb-4h-master-db-node-99364f24-4 (54.75.3.62 | 10.4.1.37) (shards: 14)
longevity-cdc-100gb-4h-master-db-node-99364f24-3 (63.33.189.173 | 10.4.3.164) (shards: 14)
longevity-cdc-100gb-4h-master-db-node-99364f24-2 (34.250.254.31 | 10.4.1.227) (shards: 14)
longevity-cdc-100gb-4h-master-db-node-99364f24-1 (54.229.65.47 | 10.4.3.85) (shards: 14)

OS / Image: ami-0cbdbeb82ff6b7e4f (aws: eu-west-1)

Test: longevity-cdc-100gb-4h-test
Test id: 99364f24-5e01-4f5f-9acf-90656769f227
Test name: scylla-master/longevity/longevity-cdc-100gb-4h-test
Test config file(s):

longevity-cdc-100gb-4h.yaml

Details:

In this test run the no_corrupt_repair nemesis was attempted to be run.
And only following parts were executed before and during the buggy behavior of Scylla:

Create table drop_table_during_repair_ks_%integer%.standard1
Wait for schema agreement
Repeat above 2 steps 9 more times

The load was growing from 80% to 90% and following errors started appearing in DB logs:

node-5

2023-03-19T05:16:03+00:00 longevity-cdc-100gb-4h-master-db-node-99364f24-5     !INFO | scylla[5639]:  [shard  3] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 1/100 count and 18462/169911255 memory resources: timed out, dumping permit diagnostics:
permits	count	memory	table/description/state
1	1	18K	cdc_test.test_table_postimage/data-query/active/used
745	0	0B	cdc_test.test_table_postimage/mutation-query/waiting_for_admission
18	0	0B	cdc_test.test_table/mutation-query/waiting_for_admission
777	0	0B	cdc_test.test_table_preimage/mutation-query/waiting_for_admission
1103	0	0B	cdc_test.test_table_preimage/data-query/waiting_for_admission
1125	0	0B	cdc_test.test_table_postimage/data-query/waiting_for_admission
726	0	0B	cdc_test.test_table_preimage_postimage/mutation-query/waiting_for_admission
1175	0	0B	cdc_test.test_table_preimage_postimage/data-query/waiting_for_admission
231	0	0B	cdc_test.test_table/data-query/waiting_for_admission

5901	1	18K	total

Total: 5901 permits with 1 count and 18K memory resources

...
2023-03-19T05:16:33+00:00 longevity-cdc-100gb-4h-master-db-node-99364f24-5     !INFO | scylla[5639]:  [shard  3] reader_concurrency_semaphore - (rate limiting dropped 7817 similar messages) Semaphore _read_concurrency_sem with 1/100 count and 17672/169911255 memory resources: timed out, dumping permit diagnostics:
permits	count	memory	table/description/state
1	1	17K	cdc_test.test_table/data-query/active/used
1689	0	0B	cdc_test.test_table_postimage/mutation-query/waiting_for_admission
1700	0	0B	cdc_test.test_table_preimage/mutation-query/waiting_for_admission
443	0	0B	cdc_test.test_table/data-query/waiting_for_admission
2075	0	0B	cdc_test.test_table_preimage/data-query/waiting_for_admission
2103	0	0B	cdc_test.test_table_preimage_postimage/data-query/waiting_for_admission
1584	0	0B	cdc_test.test_table_preimage_postimage/mutation-query/waiting_for_admission
2018	0	0B	cdc_test.test_table_postimage/data-query/waiting_for_admission

11613	1	17K	total

Total: 11613 permits with 1 count and 17K memory resources

node-3:

2023-03-19T05:16:52+00:00 longevity-cdc-100gb-4h-master-db-node-99364f24-3     !INFO | scylla[5558]:  [shard  3] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 1/100 count and 17535/169911255 memory resources: timed out, dumping permit diagnostics:
permits	count	memory	table/description/state
1	1	17K	cdc_test.test_table_preimage_postimage/data-query/active/used
2	0	0B	cdc_test.test_table/data-query/waiting_for_admission
37	0	0B	cdc_test.test_table_preimage_postimage/data-query/waiting_for_admission
41	0	0B	cdc_test.test_table_preimage/data-query/waiting_for_admission
29	0	0B	cdc_test.test_table_postimage/mutation-query/waiting_for_admission
38	0	0B	cdc_test.test_table_postimage/data-query/waiting_for_admission
30	0	0B	cdc_test.test_table_preimage/mutation-query/waiting_for_admission
30	0	0B	cdc_test.test_table_preimage_postimage/mutation-query/waiting_for_admission

208	1	17K	total

Total: 208 permits with 1 count and 17K memory resources

node-1:

2023-03-19T05:20:11+00:00 longevity-cdc-100gb-4h-master-db-node-99364f24-1     !INFO | scylla[5523]:  [shard  3] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 1/100 count and 17554/169911255 memory resources: timed out, dumping permit diagnostics:
permits	count	memory	table/description/state
1	1	17K	cdc_test.test_table_preimage/data-query/active/used
2	0	0B	cdc_test.test_table_preimage/data-query/waiting_for_admission
2	0	0B	cdc_test.test_table_postimage/data-query/waiting_for_admission
2	0	0B	cdc_test.test_table_postimage/mutation-query/waiting_for_admission
2	0	0B	cdc_test.test_table/data-query/waiting_for_admission
6	0	0B	cdc_test.test_table_preimage_postimage/data-query/waiting_for_admission

15	1	17K	total

Total: 15 permits with 1 count and 17K memory resources

The node-5 with the highest number of permits in above kind of error started returning a lot of query time outs causing following errors:

07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING > [control connection] Error connecting to 10.4.3.200:9042: < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING > [control connection] Error connecting to 10.4.3.200:9042:
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING > Traceback (most recent call last):
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3606, in cassandra.cluster.ControlConnection._reconnect_internal
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3683, in cassandra.cluster.ControlConnection._try_connect
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3675, in cassandra.cluster.ControlConnection._try_connect
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 1085, in cassandra.connection.Connection.wait_for_response
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 1129, in cassandra.connection.Connection.wait_for_responses
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 1127, in cassandra.connection.Connection.wait_for_responses
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 1640, in cassandra.connection.ResponseWaiter.deliver
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING > cassandra.OperationTimedOut: errors=None, last_host=None
...
          < t:2023-03-19 05:18:22,621 f:decorators.py   l:69   c:sdcm.utils.decorators p:DEBUG > 'get_peers_info': failed with 'NoHostAvailable('Unable to connect to any servers', {'10.4.3.200:9042': OperationTimedOut('errors=None, last_host=None')})', retrying [#0]

And following is visible in loaders logs:

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during UNLOGGED_BATCH write query at consistency QUORUM (2 replica were required but only 0 acknowledged the write)
...
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (3 responses were required but only 2 replica responded). In case this was generated during read repair, the consistency level is not representative of the actual consistency.

The above problems caused the peers info queries, made by SCT as part of the nemesis, timeout on this node-5 for almost 3 hours. It stopped timing out when the loader started decreasing the load reaching the end of load schedule.
So, load started finishing, our time outs stopped and the test tearDown started. The rest of errors in SCT is SCT-related.

NOTE: This bug's effect is very similar to this one: #12552
But this one is not related to the upgrade and appears when the 90% load is reached.

Monitor:

Restore Monitor Stack command: $ hydra investigate show-monitor 99364f24-5e01-4f5f-9acf-90656769f227
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 99364f24-5e01-4f5f-9acf-90656769f227

Logs:

db-cluster-99364f24.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/99364f24-5e01-4f5f-9acf-90656769f227/20230319_084031/db-cluster-99364f24.tar.gz
sct-runner-99364f24.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/99364f24-5e01-4f5f-9acf-90656769f227/20230319_084031/sct-runner-99364f24.tar.gz
monitor-set-99364f24.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/99364f24-5e01-4f5f-9acf-90656769f227/20230319_084031/monitor-set-99364f24.tar.gz
loader-set-99364f24.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/99364f24-5e01-4f5f-9acf-90656769f227/20230319_084031/loader-set-99364f24.tar.gz
parallel-timelines-report-99364f24.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/99364f24-5e01-4f5f-9acf-90656769f227/20230319_084031/parallel-timelines-report-99364f24.tar.gz

Jenkins job URL

The text was updated successfully, but these errors were encountered:

mykaul · 2023-03-26T11:03:43Z

@denesb - can you take a look? (I assume this is without your latest changes in this area - perhaps they need backports)

denesb · 2023-03-27T06:30:16Z

@vponomaryov I don't see anything wrong in the printouts. This seems like plain overload to me.
Are these timeouts regressions? Did this workload not cause timeouts up to now?

vponomaryov · 2023-03-27T14:04:08Z

@vponomaryov I don't see anything wrong in the printouts. This seems like plain overload to me.

Overload is 100%, the problem appeared at 90% load.

Are these timeouts regressions? Did this workload not cause timeouts up to now?

We have run it too few times in a last year:

As you can see, it is one such failure.

Also, one more weekly run happened here: https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-cdc-100gb-4h-test/381/consoleFull
And it passed ok during this nemesis.

So, either the bug is flaky or, probably, the disk perf (real perf, the io_properties.yaml file is the same on all nodes) in the failed run was really bad...

denesb · 2023-03-27T14:09:42Z

So, either the bug is flaky or, probably, the disk perf (real perf, the io_properties.yaml file is the same on all nodes) in the failed run was really bad...

I don't see any disk reads in the semaphore printouts, its all cache. Of course that is just a snapshot, metrics would give a more accurate picture.

denesb · 2023-03-29T13:15:19Z

Switching to shard view, it looks like the problem is shard imbalance. Some shards are on 100%, while others on as low as 70% (load):

denesb · 2023-03-29T13:21:54Z

Even more apparent when selecting a single node and looking at queued reads:

This is the queued reads graph on node 10.4.3.200

denesb · 2023-03-29T13:23:34Z

The same shards that have a lot of reads queued (shard 3 and 12), are pegged to 100% load:

denesb · 2023-03-29T13:26:34Z

Shard 3 and 12 doesn't serve significantly more requests than the other ones:

denesb · 2023-03-29T13:29:17Z

The two problematic shards (3 and 12) feature prominently on all cache graphs:

They seem to be reading much more partitions and rows than other shards.

denesb · 2023-03-29T13:33:02Z

@vponomaryov how is the data prepared for this test? And what is the read-pattern? Has something changed around that recently?

denesb · 2023-03-29T13:33:36Z

So, either the bug is flaky or, probably, the disk perf (real perf, the io_properties.yaml file is the same on all nodes) in the failed run was really bad...

I don't see any disk reads in the semaphore printouts, its all cache. Of course that is just a snapshot, metrics would give a more accurate picture.

Graphs confirm there are not disk reads whatsoever, read is 100% cache.

denesb · 2023-03-29T13:42:52Z

Shards 3, 12 and to some extent 13 is busier on all the nodes that have timeouts. Consequently reads on these shards queue up and time out.

denesb · 2023-03-29T13:44:24Z

Looking at just the reads (on nodes 1, 3 and 5):

There is significantly more read request against the problematic shards.

denesb · 2023-03-29T13:47:42Z

Looks like speculative reads is at least partially responsible for the additional load on those shards:

I wonder if this is another case of speculative reads being a self-fulfilling prophecy: some shards experience a temporary spike, speculative read kicks in and now those shards are more loaded on other nodes too. This increased load leads to higher latencies so speculative reads stay on and keeps the pain-train rolling.

denesb · 2023-03-29T13:49:54Z

No sign of imbalance on the CQL user side:

denesb · 2023-03-29T13:55:57Z

@vponomaryov was speculative reads always on in this test?

mykaul · 2023-03-29T13:58:30Z

Looks like speculative reads is at least partially responsible for the additional load on those shards:

I wonder if this is another case of speculative reads being a self-fulfilling prophecy: some shards experience a temporary spike, speculative read kicks in and now those shards are more loaded on other nodes too. This increased load leads to higher latencies so speculative reads stay on and keeps the pain-train rolling.

I was never a fan of speculative reads. 'lets make more noise' was always a strange tactic IMHO. The fact that there's no feedback loop whatsoever on its effectiveness is what bothers me really.

vponomaryov · 2023-03-29T14:36:56Z

@vponomaryov how is the data prepared for this test? And what is the read-pattern? Has something changed around that recently?

No, no changes have been made for more than 1 year.
Queries which were applied here can be found in here:

@vponomaryov was speculative reads always on in this test?

Yes, It has been enabled for 3 years already in the test config which gets used here.

denesb · 2023-03-30T03:46:01Z

I have no idea what recent changes could have contributed to this failure, if any.
If this test now fails consistently, you can try bisecting it.

fruch · 2023-04-18T21:06:57Z

we encounter it during testing of 5.2:

and bisected it there:

🟢 ❯ git log 80de75947b7a..f90af81a13cc --oneline
f90af81a13 release: prepare for 2023.1.0-rc4
7bdea325c9 Merge branch 'branch-5.2' of github.com:scylladb/scylladb into next-2023.1
1fba43c317 docs: minor improvments to the Raft Handling Failures and recovery procedure sections
e380c24c69 Merge 'Improve database shutdown verbosity' from Pavel Emelyanov
2b050e83c8 Merge branch 'branch-5.2' of github.com:scylladb/scylladb into next-2023.1
76a76a95f4 Update tools/java submodule (hdrhistogram with Java 11)
f6837afec7 doc: update the Ubuntu version used in the image
6350c8836d Revert "repair: Reduce repair reader eviction with diff shard count"
5457948437 Update seastar submodule (rpc cancellation during negotiation)
da41001b5c .gitmodules: point seastar submodule at scylla-seastar.git
dd61e8634c doc: related https://github.com/scylladb/scylladb/issues/12754; add the missing information about reporting latencies to the upgrade guide 5.1 to 5.2
b642b4c30e doc: fix the service name in upgrade guides
c013336121 db/view/view_update_check: check_needs_view_update_path(): filter out non-member hosts
b6b35ce061 service: storage_proxy: sequence CDC preimage select with Paxos learn
069e38f02d transport server: fix unexpected server errors handling
61a8003ad1 (tag: scylla-5.2.0-rc3) release: prepare for 5.2.0-rc3

the only CDC related change is:
b6b35ce service: storage_proxy: sequence CDC preimage select with Paxos learn

@kbr- can you take a look ? now that this change is found itself on the release branch...

kbr-scylla · 2023-04-19T10:05:56Z

@fruch

@kbr- can you take a look ? now that this change is found itself on the release branch...

(for the future: please don't tag my old account, instead tag my new account @kbr-scylla so I get a notification on my work email)

The CDC change you mentioned only affects the LWT path, but there are no LWT queries in the cassandra-stress profiles that @vponomaryov has posted. So it's not that.

@denesb is it plausible that it's the same cause as #11803? If so maybe we should rerun once the fix gets promoted on 5.2 (9d384e3) (currently 5.2 promotion appears to be stuck though... there was a dtest backport missing, it should be backported now so let's hope it promotes)

denesb · 2023-04-19T10:18:28Z

@denesb is it plausible that it's the same cause as #11803? If so maybe we should rerun once the fix gets promoted on 5.2 (9d384e3) (currently 5.2 promotion appears to be stuck though... there was a dtest backport missing, it should be backported now so let's hope it promotes)

It is plausible. Worth a try.

DoronArazii · 2023-04-27T06:26:23Z

@denesb this case should block 5.2 ?

/Cc @mykaul @avikivity @roydahan

denesb · 2023-04-27T06:45:22Z

@denesb this case should block 5.2 ?

I have no idea because I still don't know what caused these timeouts. #11803, the issue @kbr-scylla suggested is a plausible cause. The fix for that was backported to 5.2 only last week, so I think it is worth to re-run reproducer and see if it still reproduces with 9d384e3 (5.2 backport of the fix).

If the problem doesn't reproduce then we can close this as duplicate of #11803.
If it still reproduces, then we need to do a bisect because I don't know how to proceed with this investigation.

DoronArazii · 2023-04-27T06:55:23Z

Okay, thanks, @vponomaryov FYI ^^ please keep us posted (RC5 has the fix).
Adding the QA-Reproduction label.

DoronArazii · 2023-05-02T07:29:54Z

@vponomaryov any updates?

vponomaryov · 2023-05-02T11:13:17Z

Ran the 5.2.0-rc5 Scylla in here: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-cdc-100gb-4h-test/3

The overall behavior stayed, but we haven't reached the case with timeouts.
So, looks like the perf became a bit better and we don't catch it with the same load values.

There are still lots of the semaphore errors:

 2023-05-02T10:31:15+00:00 longevity-cdc-100gb-4h-dev-db-node-444b5fc5-1     !INFO | scylla[5576]:  [shard 12] reader_concurrency_semaphore - (rate limiting dropped 837 similar messages) Semaphore _read_concurrency_sem with 1/100 count and 16536/170917888 memory resources: timed out, dumping permit diagnostics:
 permits count   memory  table/description/state
 1       1       16K     cdc_test.test_table_postimage/mutation-query/active/used
 2063    0       0B      cdc_test.test_table_preimage_postimage/mutation-query/waiting
 1745    0       0B      cdc_test.test_table_preimage_postimage/data-query/waiting
 2065    0       0B      cdc_test.test_table_postimage/mutation-query/waiting
 1677    0       0B      cdc_test.test_table_postimage/data-query/waiting
 
 7551    1       16K     total
 
 Total: 7551 permits with 1 count and 16K memory resources

Logs:

test-id: 444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f
db_cluster: https://cloudius-jenkins-test.s3.amazonaws.com/444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f/20230502_111936/db-cluster-444b5fc5.tar.gz
sct-runner: https://cloudius-jenkins-test.s3.amazonaws.com/444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f/20230502_111936/sct-444b5fc5.log.tar.gz
sct-runner-events: https://cloudius-jenkins-test.s3.amazonaws.com/444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f/20230502_111936/sct-runner-events-444b5fc5.tar.gz
loader-set: https://cloudius-jenkins-test.s3.amazonaws.com/444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f/20230502_111936/loader-set-444b5fc5.tar.gz
monitor-set: https://cloudius-jenkins-test.s3.amazonaws.com/444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f/20230502_111936/monitor-set-444b5fc5.tar.gz

roydahan · 2023-05-02T11:19:28Z

@denesb can you please take a look and provide your feedback on how bad is this?

denesb · 2023-05-02T12:02:46Z

The overall behavior stayed, but we haven't reached the case with timeouts.

I don't understand, you say you didn't reach the case with timeout, but you post a timeout from the logs below. Do you mean that you didn't see as many timeouts as before? If yes, how significant was the difference? Did the amount of timeouts go back to the "old" value (before you opened this issue)? Or not quite?

So, looks like the perf became a bit better and we don't catch it with the same load values.

From first look, it still looks like reads are bound by CPU. I will have a look at the metrics later.

vponomaryov · 2023-05-02T12:07:43Z

The overall behavior stayed, but we haven't reached the case with timeouts.

I don't understand, you say you didn't reach the case with timeout, but you post a timeout from the logs below. Do you mean that you didn't see as many timeouts as before? If yes, how significant was the difference? Did the amount of timeouts go back to the "old" value (before you opened this issue)? Or not quite?

There are no timeouts in loaders, but retries were applied, just the 10 retries limit per query was not reached.
So, yes, since we didn't get timeouts in loaders not exceeding the retry limit, I can say that the perf became a bit better so we still have lots of semaphore timeouts but it is less than was in the test run which was used for the bug report.

I will have a look at the metrics later.

JFYI: the test tearDown started at 2023-05-02 10:32:35,192 (UTC/DB time) and it influenced the load.

denesb · 2023-05-02T12:20:29Z

There are no timeouts in loaders, but retries were applied, just the 10 retries limit per query was not reached. So, yes, since we didn't get timeouts in loaders not exceeding the retry limit, I can say that the perf became a bit better so we still have lots of semaphore timeouts but it is less than was in the test run which was used for the bug report.

Does that mean that the way the cluster behaves is within normal parameters? Or do you still consider the current state to be a regression (only a smaller one than what it used to be).

What I want to find out with all these questions, is whether we are still dealing with a regression here, compared to the "good" state before. I just want to avoid going on a wild goose chase, investigating a regression that doesn't exist.

vponomaryov · 2023-05-03T16:07:44Z

The diff in timeouts is following:

scylla_version: 5.3.0~dev-20230316.5705df77a155 (The one used for the bug report)

DB1: 1571    (1,5k)
DB2: 2585    (2,6k)
DB3: 135757  (135k)
DB4: 23739   (23,7k)
DB5: 8587278 (8,5M) <- !
DB6: 15849   (15,8k)

scylla_version: 5.2.0-rc5

DB1: 13975 (14k)
DB2: 0
DB3: 1433  (1,4k)
DB4: 3655  (3,6k)
DB5: 0
DB6: 95011 (95k)

And one of the recent runs with 5.1
scylla_version: 5.1.9-0.20230423.ba1a57bd5531 with build-id dc1d4a867f95e936e52c60fa6f3cfe15654f0de4

DB1: 3474362 (3,47k)
DB2: 957     (1k)
DB3: 34156   (34k)
DB4: 513925  (513k)
DB5: 561     (561)
DB6: 166     (166)

So, it can be said that 5.2.0-rc5 doesn't have regression compared to the 5.1 run.

roydahan · 2023-05-04T16:48:32Z

@denesb it looks like #13759 is very similar to this one, I don't know if/when the symptoms we see are casing timeouts and when not (but it's quite obvious that this is flakey).

DoronArazii · 2023-05-08T09:19:02Z

@denesb can you please refer to Roy's assumption ^^

denesb · 2023-05-08T09:28:36Z

@denesb it looks like #13759 is very similar to this one, I don't know if/when the symptoms we see are casing timeouts and when not (but it's quite obvious that this is flakey).

I have no idea whether they are the same or not. Timeouts are just the first symptom one will see if something goes wrong on a node. The root cause could be wide array of things (including of course the semaphore itself).

denesb · 2023-05-10T05:33:38Z

It is important to not conflate different issue just because they all print semaphore diagnostics to the logs. These issues can have wildly different causes and conflating them all means we will never close any of these issues (we already have way to many unclosed timeout related issues).

Based on #13322 (comment) I'm leaning towards closing this issue and focusing on the other ones that were not investigated yet.

DoronArazii · 2023-05-10T07:18:36Z

Thanks Botond.

/Cc @roydahan i'm going with Botond ^^ and closing it. please raise a flag if you have any objection.

vponomaryov mentioned this issue Mar 24, 2023

The tearDown kills stress docker containers before stopping nemesis which may try to use these containers scylladb/scylla-cluster-tests#5945

Open

3 tasks

mykaul added the triage/master Looking for assignee label Mar 26, 2023

DoronArazii assigned denesb Mar 27, 2023

DoronArazii removed the triage/master Looking for assignee label Mar 27, 2023

DoronArazii added this to the 5.3 milestone Mar 27, 2023

DoronArazii added status/regression P2 High Priority bug labels Mar 30, 2023

roydahan modified the milestones: 5.3, 5.2 Apr 19, 2023

DoronArazii added the status/pending qa reproduction Pending for QA team to reproduce the issue label Apr 27, 2023

mykaul modified the milestones: 5.2, 5.3 May 2, 2023

fgelcer mentioned this issue May 4, 2023

[longevity-cdc-100gb-4h] reader_concurrency_semaphore and query timeouts on one node that looks overloaded (not clear why) #13759

Closed

2 tasks

DoronArazii closed this as completed May 10, 2023

DoronArazii closed this as not planned Won't fix, can't repro, duplicate, stale May 10, 2023

reader_concurrency_semaphore times out a lot on 2/6 nodes and a little on 4/6 having ~90% load in cluster #13322

reader_concurrency_semaphore times out a lot on 2/6 nodes and a little on 4/6 having ~90% load in cluster #13322

Comments

vponomaryov commented Mar 24, 2023 • edited

Issue description

Impact

How frequently does it reproduce?

Installation details

Details:

Monitor:

Logs:

mykaul commented Mar 26, 2023

denesb commented Mar 27, 2023 • edited by DoronArazii

vponomaryov commented Mar 27, 2023

denesb commented Mar 27, 2023

denesb commented Mar 29, 2023

denesb commented Mar 29, 2023

denesb commented Mar 29, 2023 • edited

denesb commented Mar 29, 2023

denesb commented Mar 29, 2023 • edited

denesb commented Mar 29, 2023

denesb commented Mar 29, 2023

denesb commented Mar 29, 2023

denesb commented Mar 29, 2023

denesb commented Mar 29, 2023 • edited

denesb commented Mar 29, 2023 • edited

denesb commented Mar 29, 2023

mykaul commented Mar 29, 2023

vponomaryov commented Mar 29, 2023 • edited

denesb commented Mar 30, 2023

fruch commented Apr 18, 2023

kbr-scylla commented Apr 19, 2023 • edited

denesb commented Apr 19, 2023

DoronArazii commented Apr 27, 2023

denesb commented Apr 27, 2023

DoronArazii commented Apr 27, 2023

DoronArazii commented May 2, 2023

vponomaryov commented May 2, 2023 • edited

Logs:

roydahan commented May 2, 2023

denesb commented May 2, 2023

vponomaryov commented May 2, 2023 • edited

denesb commented May 2, 2023

vponomaryov commented May 3, 2023 • edited by roydahan

roydahan commented May 4, 2023

DoronArazii commented May 8, 2023

denesb commented May 8, 2023

denesb commented May 10, 2023

DoronArazii commented May 10, 2023

`reader_concurrency_semaphore` times out a lot on 2/6 nodes and a little on 4/6 having ~90% load in cluster #13322

`reader_concurrency_semaphore` times out a lot on 2/6 nodes and a little on 4/6 having ~90% load in cluster #13322

vponomaryov commented Mar 24, 2023 •

edited

denesb commented Mar 27, 2023 •

edited by DoronArazii

denesb commented Mar 29, 2023 •

edited

denesb commented Mar 29, 2023 •

edited

denesb commented Mar 29, 2023 •

edited

denesb commented Mar 29, 2023 •

edited

vponomaryov commented Mar 29, 2023 •

edited

kbr-scylla commented Apr 19, 2023 •

edited

vponomaryov commented May 2, 2023 •

edited

vponomaryov commented May 2, 2023 •

edited

vponomaryov commented May 3, 2023 •

edited by roydahan