Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reader_concurrency_semaphore times out a lot on 2/6 nodes and a little on 4/6 having ~90% load in cluster #13322

Closed
1 of 2 tasks
vponomaryov opened this issue Mar 24, 2023 · 37 comments
Assignees
Labels
bug P2 High Priority status/pending qa reproduction Pending for QA team to reproduce the issue status/regression
Milestone

Comments

@vponomaryov
Copy link
Contributor

vponomaryov commented Mar 24, 2023

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

A lot of query timeouts.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1031-aws
Scylla version (or git commit hash): 5.3.0~dev-20230316.5705df77a155 with build-id 3aa7c396c9e0eab7f08c8ab7dd82a548cf8d41e6

Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • longevity-cdc-100gb-4h-master-db-node-99364f24-6 (54.217.15.173 | 10.4.1.244) (shards: 14)
  • longevity-cdc-100gb-4h-master-db-node-99364f24-5 (52.49.70.224 | 10.4.3.200) (shards: 14)
  • longevity-cdc-100gb-4h-master-db-node-99364f24-4 (54.75.3.62 | 10.4.1.37) (shards: 14)
  • longevity-cdc-100gb-4h-master-db-node-99364f24-3 (63.33.189.173 | 10.4.3.164) (shards: 14)
  • longevity-cdc-100gb-4h-master-db-node-99364f24-2 (34.250.254.31 | 10.4.1.227) (shards: 14)
  • longevity-cdc-100gb-4h-master-db-node-99364f24-1 (54.229.65.47 | 10.4.3.85) (shards: 14)

OS / Image: ami-0cbdbeb82ff6b7e4f (aws: eu-west-1)

Test: longevity-cdc-100gb-4h-test
Test id: 99364f24-5e01-4f5f-9acf-90656769f227
Test name: scylla-master/longevity/longevity-cdc-100gb-4h-test
Test config file(s):

Details:

In this test run the no_corrupt_repair nemesis was attempted to be run.
And only following parts were executed before and during the buggy behavior of Scylla:

  • Create table drop_table_during_repair_ks_%integer%.standard1
  • Wait for schema agreement
  • Repeat above 2 steps 9 more times

The load was growing from 80% to 90% and following errors started appearing in DB logs:

node-5

2023-03-19T05:16:03+00:00 longevity-cdc-100gb-4h-master-db-node-99364f24-5     !INFO | scylla[5639]:  [shard  3] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 1/100 count and 18462/169911255 memory resources: timed out, dumping permit diagnostics:
permits	count	memory	table/description/state
1	1	18K	cdc_test.test_table_postimage/data-query/active/used
745	0	0B	cdc_test.test_table_postimage/mutation-query/waiting_for_admission
18	0	0B	cdc_test.test_table/mutation-query/waiting_for_admission
777	0	0B	cdc_test.test_table_preimage/mutation-query/waiting_for_admission
1103	0	0B	cdc_test.test_table_preimage/data-query/waiting_for_admission
1125	0	0B	cdc_test.test_table_postimage/data-query/waiting_for_admission
726	0	0B	cdc_test.test_table_preimage_postimage/mutation-query/waiting_for_admission
1175	0	0B	cdc_test.test_table_preimage_postimage/data-query/waiting_for_admission
231	0	0B	cdc_test.test_table/data-query/waiting_for_admission

5901	1	18K	total

Total: 5901 permits with 1 count and 18K memory resources

...
2023-03-19T05:16:33+00:00 longevity-cdc-100gb-4h-master-db-node-99364f24-5     !INFO | scylla[5639]:  [shard  3] reader_concurrency_semaphore - (rate limiting dropped 7817 similar messages) Semaphore _read_concurrency_sem with 1/100 count and 17672/169911255 memory resources: timed out, dumping permit diagnostics:
permits	count	memory	table/description/state
1	1	17K	cdc_test.test_table/data-query/active/used
1689	0	0B	cdc_test.test_table_postimage/mutation-query/waiting_for_admission
1700	0	0B	cdc_test.test_table_preimage/mutation-query/waiting_for_admission
443	0	0B	cdc_test.test_table/data-query/waiting_for_admission
2075	0	0B	cdc_test.test_table_preimage/data-query/waiting_for_admission
2103	0	0B	cdc_test.test_table_preimage_postimage/data-query/waiting_for_admission
1584	0	0B	cdc_test.test_table_preimage_postimage/mutation-query/waiting_for_admission
2018	0	0B	cdc_test.test_table_postimage/data-query/waiting_for_admission

11613	1	17K	total

Total: 11613 permits with 1 count and 17K memory resources

node-3:

2023-03-19T05:16:52+00:00 longevity-cdc-100gb-4h-master-db-node-99364f24-3     !INFO | scylla[5558]:  [shard  3] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 1/100 count and 17535/169911255 memory resources: timed out, dumping permit diagnostics:
permits	count	memory	table/description/state
1	1	17K	cdc_test.test_table_preimage_postimage/data-query/active/used
2	0	0B	cdc_test.test_table/data-query/waiting_for_admission
37	0	0B	cdc_test.test_table_preimage_postimage/data-query/waiting_for_admission
41	0	0B	cdc_test.test_table_preimage/data-query/waiting_for_admission
29	0	0B	cdc_test.test_table_postimage/mutation-query/waiting_for_admission
38	0	0B	cdc_test.test_table_postimage/data-query/waiting_for_admission
30	0	0B	cdc_test.test_table_preimage/mutation-query/waiting_for_admission
30	0	0B	cdc_test.test_table_preimage_postimage/mutation-query/waiting_for_admission

208	1	17K	total

Total: 208 permits with 1 count and 17K memory resources

node-1:

2023-03-19T05:20:11+00:00 longevity-cdc-100gb-4h-master-db-node-99364f24-1     !INFO | scylla[5523]:  [shard  3] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 1/100 count and 17554/169911255 memory resources: timed out, dumping permit diagnostics:
permits	count	memory	table/description/state
1	1	17K	cdc_test.test_table_preimage/data-query/active/used
2	0	0B	cdc_test.test_table_preimage/data-query/waiting_for_admission
2	0	0B	cdc_test.test_table_postimage/data-query/waiting_for_admission
2	0	0B	cdc_test.test_table_postimage/mutation-query/waiting_for_admission
2	0	0B	cdc_test.test_table/data-query/waiting_for_admission
6	0	0B	cdc_test.test_table_preimage_postimage/data-query/waiting_for_admission

15	1	17K	total

Total: 15 permits with 1 count and 17K memory resources

The node-5 with the highest number of permits in above kind of error started returning a lot of query time outs causing following errors:

07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING > [control connection] Error connecting to 10.4.3.200:9042: < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING > [control connection] Error connecting to 10.4.3.200:9042:
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING > Traceback (most recent call last):
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3606, in cassandra.cluster.ControlConnection._reconnect_internal
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3683, in cassandra.cluster.ControlConnection._try_connect
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3675, in cassandra.cluster.ControlConnection._try_connect
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 1085, in cassandra.connection.Connection.wait_for_response
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 1129, in cassandra.connection.Connection.wait_for_responses
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 1127, in cassandra.connection.Connection.wait_for_responses
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 1640, in cassandra.connection.ResponseWaiter.deliver
07:16:09  < t:2023-03-19 05:16:08,441 f:cluster.py      l:3357 c:cassandra.cluster    p:WARNING > cassandra.OperationTimedOut: errors=None, last_host=None
...
          < t:2023-03-19 05:18:22,621 f:decorators.py   l:69   c:sdcm.utils.decorators p:DEBUG > 'get_peers_info': failed with 'NoHostAvailable('Unable to connect to any servers', {'10.4.3.200:9042': OperationTimedOut('errors=None, last_host=None')})', retrying [#0]

And following is visible in loaders logs:

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during UNLOGGED_BATCH write query at consistency QUORUM (2 replica were required but only 0 acknowledged the write)
...
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (3 responses were required but only 2 replica responded). In case this was generated during read repair, the consistency level is not representative of the actual consistency.

The above problems caused the peers info queries, made by SCT as part of the nemesis, timeout on this node-5 for almost 3 hours. It stopped timing out when the loader started decreasing the load reaching the end of load schedule.
So, load started finishing, our time outs stopped and the test tearDown started. The rest of errors in SCT is SCT-related.

NOTE: This bug's effect is very similar to this one: #12552
But this one is not related to the upgrade and appears when the 90% load is reached.

Monitor:

  • Restore Monitor Stack command: $ hydra investigate show-monitor 99364f24-5e01-4f5f-9acf-90656769f227
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 99364f24-5e01-4f5f-9acf-90656769f227
    Screenshot from 2023-03-24 18-14-36

Logs:

Jenkins job URL

@mykaul
Copy link
Contributor

mykaul commented Mar 26, 2023

@denesb - can you take a look? (I assume this is without your latest changes in this area - perhaps they need backports)

@denesb
Copy link
Contributor

denesb commented Mar 27, 2023

@vponomaryov I don't see anything wrong in the printouts. This seems like plain overload to me.
Are these timeouts regressions? Did this workload not cause timeouts up to now?

@DoronArazii DoronArazii removed the triage/master Looking for assignee label Mar 27, 2023
@DoronArazii DoronArazii added this to the 5.3 milestone Mar 27, 2023
@vponomaryov
Copy link
Contributor Author

@vponomaryov I don't see anything wrong in the printouts. This seems like plain overload to me.

Overload is 100%, the problem appeared at 90% load.

Are these timeouts regressions? Did this workload not cause timeouts up to now?

We have run it too few times in a last year:
Screenshot from 2023-03-27 16-55-24

As you can see, it is one such failure.

Also, one more weekly run happened here: https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-cdc-100gb-4h-test/381/consoleFull
And it passed ok during this nemesis.

So, either the bug is flaky or, probably, the disk perf (real perf, the io_properties.yaml file is the same on all nodes) in the failed run was really bad...

@denesb
Copy link
Contributor

denesb commented Mar 27, 2023

So, either the bug is flaky or, probably, the disk perf (real perf, the io_properties.yaml file is the same on all nodes) in the failed run was really bad...

I don't see any disk reads in the semaphore printouts, its all cache. Of course that is just a snapshot, metrics would give a more accurate picture.

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

Switching to shard view, it looks like the problem is shard imbalance. Some shards are on 100%, while others on as low as 70% (load):
image

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

Even more apparent when selecting a single node and looking at queued reads:
image
This is the queued reads graph on node 10.4.3.200

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

The same shards that have a lot of reads queued (shard 3 and 12), are pegged to 100% load:
image

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

Shard 3 and 12 doesn't serve significantly more requests than the other ones:
image

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

The two problematic shards (3 and 12) feature prominently on all cache graphs:
image
They seem to be reading much more partitions and rows than other shards.

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

@vponomaryov how is the data prepared for this test? And what is the read-pattern? Has something changed around that recently?

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

So, either the bug is flaky or, probably, the disk perf (real perf, the io_properties.yaml file is the same on all nodes) in the failed run was really bad...

I don't see any disk reads in the semaphore printouts, its all cache. Of course that is just a snapshot, metrics would give a more accurate picture.

Graphs confirm there are not disk reads whatsoever, read is 100% cache.

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

Shards 3, 12 and to some extent 13 is busier on all the nodes that have timeouts. Consequently reads on these shards queue up and time out.

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

Looking at just the reads (on nodes 1, 3 and 5):
image

There is significantly more read request against the problematic shards.

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

Looks like speculative reads is at least partially responsible for the additional load on those shards:
image

I wonder if this is another case of speculative reads being a self-fulfilling prophecy: some shards experience a temporary spike, speculative read kicks in and now those shards are more loaded on other nodes too. This increased load leads to higher latencies so speculative reads stay on and keeps the pain-train rolling.

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

No sign of imbalance on the CQL user side:
image

@denesb
Copy link
Contributor

denesb commented Mar 29, 2023

@vponomaryov was speculative reads always on in this test?

@mykaul
Copy link
Contributor

mykaul commented Mar 29, 2023

Looks like speculative reads is at least partially responsible for the additional load on those shards: image

I wonder if this is another case of speculative reads being a self-fulfilling prophecy: some shards experience a temporary spike, speculative read kicks in and now those shards are more loaded on other nodes too. This increased load leads to higher latencies so speculative reads stay on and keeps the pain-train rolling.

I was never a fan of speculative reads. 'lets make more noise' was always a strange tactic IMHO. The fact that there's no feedback loop whatsoever on its effectiveness is what bothers me really.

@vponomaryov
Copy link
Contributor Author

vponomaryov commented Mar 29, 2023

@vponomaryov how is the data prepared for this test? And what is the read-pattern? Has something changed around that recently?

No, no changes have been made for more than 1 year.
Queries which were applied here can be found in here:

@vponomaryov was speculative reads always on in this test?

Yes, It has been enabled for 3 years already in the test config which gets used here.

@denesb
Copy link
Contributor

denesb commented Mar 30, 2023

I have no idea what recent changes could have contributed to this failure, if any.
If this test now fails consistently, you can try bisecting it.

@fruch
Copy link
Contributor

fruch commented Apr 18, 2023

we encounter it during testing of 5.2:

and bisected it there:

🟢 ❯ git log 80de75947b7a..f90af81a13cc --oneline
f90af81a13 release: prepare for 2023.1.0-rc4
7bdea325c9 Merge branch 'branch-5.2' of github.com:scylladb/scylladb into next-2023.1
1fba43c317 docs: minor improvments to the Raft Handling Failures and recovery procedure sections
e380c24c69 Merge 'Improve database shutdown verbosity' from Pavel Emelyanov
2b050e83c8 Merge branch 'branch-5.2' of github.com:scylladb/scylladb into next-2023.1
76a76a95f4 Update tools/java submodule (hdrhistogram with Java 11)
f6837afec7 doc: update the Ubuntu version used in the image
6350c8836d Revert "repair: Reduce repair reader eviction with diff shard count"
5457948437 Update seastar submodule (rpc cancellation during negotiation)
da41001b5c .gitmodules: point seastar submodule at scylla-seastar.git
dd61e8634c doc: related https://github.com/scylladb/scylladb/issues/12754; add the missing information about reporting latencies to the upgrade guide 5.1 to 5.2
b642b4c30e doc: fix the service name in upgrade guides
c013336121 db/view/view_update_check: check_needs_view_update_path(): filter out non-member hosts
b6b35ce061 service: storage_proxy: sequence CDC preimage select with Paxos learn
069e38f02d transport server: fix unexpected server errors handling
61a8003ad1 (tag: scylla-5.2.0-rc3) release: prepare for 5.2.0-rc3

the only CDC related change is:
b6b35ce service: storage_proxy: sequence CDC preimage select with Paxos learn

@kbr- can you take a look ? now that this change is found itself on the release branch...

@kbr-scylla
Copy link
Contributor

kbr-scylla commented Apr 19, 2023

@fruch

@kbr- can you take a look ? now that this change is found itself on the release branch...

(for the future: please don't tag my old account, instead tag my new account @kbr-scylla so I get a notification on my work email)

The CDC change you mentioned only affects the LWT path, but there are no LWT queries in the cassandra-stress profiles that @vponomaryov has posted. So it's not that.

@denesb is it plausible that it's the same cause as #11803? If so maybe we should rerun once the fix gets promoted on 5.2 (9d384e3) (currently 5.2 promotion appears to be stuck though... there was a dtest backport missing, it should be backported now so let's hope it promotes)

@denesb
Copy link
Contributor

denesb commented Apr 19, 2023

@denesb is it plausible that it's the same cause as #11803? If so maybe we should rerun once the fix gets promoted on 5.2 (9d384e3) (currently 5.2 promotion appears to be stuck though... there was a dtest backport missing, it should be backported now so let's hope it promotes)

It is plausible. Worth a try.

@roydahan roydahan modified the milestones: 5.3, 5.2 Apr 19, 2023
@DoronArazii
Copy link

@denesb this case should block 5.2 ?

/Cc @mykaul @avikivity @roydahan

@denesb
Copy link
Contributor

denesb commented Apr 27, 2023

@denesb this case should block 5.2 ?

I have no idea because I still don't know what caused these timeouts. #11803, the issue @kbr-scylla suggested is a plausible cause. The fix for that was backported to 5.2 only last week, so I think it is worth to re-run reproducer and see if it still reproduces with 9d384e3 (5.2 backport of the fix).

If the problem doesn't reproduce then we can close this as duplicate of #11803.
If it still reproduces, then we need to do a bisect because I don't know how to proceed with this investigation.

@DoronArazii
Copy link

Okay, thanks, @vponomaryov FYI ^^ please keep us posted (RC5 has the fix).
Adding the QA-Reproduction label.

@DoronArazii DoronArazii added the status/pending qa reproduction Pending for QA team to reproduce the issue label Apr 27, 2023
@DoronArazii
Copy link

@vponomaryov any updates?

@vponomaryov
Copy link
Contributor Author

vponomaryov commented May 2, 2023

Ran the 5.2.0-rc5 Scylla in here: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-cdc-100gb-4h-test/3

The overall behavior stayed, but we haven't reached the case with timeouts.
So, looks like the perf became a bit better and we don't catch it with the same load values.

There are still lots of the semaphore errors:

 2023-05-02T10:31:15+00:00 longevity-cdc-100gb-4h-dev-db-node-444b5fc5-1     !INFO | scylla[5576]:  [shard 12] reader_concurrency_semaphore - (rate limiting dropped 837 similar messages) Semaphore _read_concurrency_sem with 1/100 count and 16536/170917888 memory resources: timed out, dumping permit diagnostics:
 permits count   memory  table/description/state
 1       1       16K     cdc_test.test_table_postimage/mutation-query/active/used
 2063    0       0B      cdc_test.test_table_preimage_postimage/mutation-query/waiting
 1745    0       0B      cdc_test.test_table_preimage_postimage/data-query/waiting
 2065    0       0B      cdc_test.test_table_postimage/mutation-query/waiting
 1677    0       0B      cdc_test.test_table_postimage/data-query/waiting
 
 7551    1       16K     total
 
 Total: 7551 permits with 1 count and 16K memory resources

Logs:

test-id: 444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f
db_cluster: https://cloudius-jenkins-test.s3.amazonaws.com/444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f/20230502_111936/db-cluster-444b5fc5.tar.gz
sct-runner: https://cloudius-jenkins-test.s3.amazonaws.com/444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f/20230502_111936/sct-444b5fc5.log.tar.gz
sct-runner-events: https://cloudius-jenkins-test.s3.amazonaws.com/444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f/20230502_111936/sct-runner-events-444b5fc5.tar.gz
loader-set: https://cloudius-jenkins-test.s3.amazonaws.com/444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f/20230502_111936/loader-set-444b5fc5.tar.gz
monitor-set: https://cloudius-jenkins-test.s3.amazonaws.com/444b5fc5-a9cb-4cd2-8edc-27cc92f7ea6f/20230502_111936/monitor-set-444b5fc5.tar.gz

@roydahan
Copy link

roydahan commented May 2, 2023

@denesb can you please take a look and provide your feedback on how bad is this?

@denesb
Copy link
Contributor

denesb commented May 2, 2023

The overall behavior stayed, but we haven't reached the case with timeouts.

I don't understand, you say you didn't reach the case with timeout, but you post a timeout from the logs below. Do you mean that you didn't see as many timeouts as before? If yes, how significant was the difference? Did the amount of timeouts go back to the "old" value (before you opened this issue)? Or not quite?

So, looks like the perf became a bit better and we don't catch it with the same load values.

From first look, it still looks like reads are bound by CPU. I will have a look at the metrics later.

@vponomaryov
Copy link
Contributor Author

vponomaryov commented May 2, 2023

The overall behavior stayed, but we haven't reached the case with timeouts.

I don't understand, you say you didn't reach the case with timeout, but you post a timeout from the logs below. Do you mean that you didn't see as many timeouts as before? If yes, how significant was the difference? Did the amount of timeouts go back to the "old" value (before you opened this issue)? Or not quite?

There are no timeouts in loaders, but retries were applied, just the 10 retries limit per query was not reached.
So, yes, since we didn't get timeouts in loaders not exceeding the retry limit, I can say that the perf became a bit better so we still have lots of semaphore timeouts but it is less than was in the test run which was used for the bug report.

I will have a look at the metrics later.

JFYI: the test tearDown started at 2023-05-02 10:32:35,192 (UTC/DB time) and it influenced the load.

@denesb
Copy link
Contributor

denesb commented May 2, 2023

There are no timeouts in loaders, but retries were applied, just the 10 retries limit per query was not reached. So, yes, since we didn't get timeouts in loaders not exceeding the retry limit, I can say that the perf became a bit better so we still have lots of semaphore timeouts but it is less than was in the test run which was used for the bug report.

Does that mean that the way the cluster behaves is within normal parameters? Or do you still consider the current state to be a regression (only a smaller one than what it used to be).

What I want to find out with all these questions, is whether we are still dealing with a regression here, compared to the "good" state before. I just want to avoid going on a wild goose chase, investigating a regression that doesn't exist.

@mykaul mykaul modified the milestones: 5.2, 5.3 May 2, 2023
@vponomaryov
Copy link
Contributor Author

vponomaryov commented May 3, 2023

The diff in timeouts is following:

scylla_version: 5.3.0~dev-20230316.5705df77a155 (The one used for the bug report)

DB1: 1571    (1,5k)
DB2: 2585    (2,6k)
DB3: 135757  (135k)
DB4: 23739   (23,7k)
DB5: 8587278 (8,5M) <- !
DB6: 15849   (15,8k)

scylla_version: 5.2.0-rc5

DB1: 13975 (14k)
DB2: 0
DB3: 1433  (1,4k)
DB4: 3655  (3,6k)
DB5: 0
DB6: 95011 (95k)

And one of the recent runs with 5.1
scylla_version: 5.1.9-0.20230423.ba1a57bd5531 with build-id dc1d4a867f95e936e52c60fa6f3cfe15654f0de4

DB1: 3474362 (3,47k)
DB2: 957     (1k)
DB3: 34156   (34k)
DB4: 513925  (513k)
DB5: 561     (561)
DB6: 166     (166)

So, it can be said that 5.2.0-rc5 doesn't have regression compared to the 5.1 run.

@roydahan
Copy link

roydahan commented May 4, 2023

@denesb it looks like #13759 is very similar to this one, I don't know if/when the symptoms we see are casing timeouts and when not (but it's quite obvious that this is flakey).

@DoronArazii
Copy link

@denesb can you please refer to Roy's assumption ^^

@denesb
Copy link
Contributor

denesb commented May 8, 2023

@denesb it looks like #13759 is very similar to this one, I don't know if/when the symptoms we see are casing timeouts and when not (but it's quite obvious that this is flakey).

I have no idea whether they are the same or not. Timeouts are just the first symptom one will see if something goes wrong on a node. The root cause could be wide array of things (including of course the semaphore itself).

@denesb
Copy link
Contributor

denesb commented May 10, 2023

It is important to not conflate different issue just because they all print semaphore diagnostics to the logs. These issues can have wildly different causes and conflating them all means we will never close any of these issues (we already have way to many unclosed timeout related issues).

Based on #13322 (comment) I'm leaning towards closing this issue and focusing on the other ones that were not investigated yet.

@DoronArazii
Copy link

Thanks Botond.

/Cc @roydahan i'm going with Botond ^^ and closing it. please raise a flag if you have any objection.

@DoronArazii DoronArazii closed this as not planned Won't fix, can't repro, duplicate, stale May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug P2 High Priority status/pending qa reproduction Pending for QA team to reproduce the issue status/regression
Projects
None yet
Development

No branches or pull requests

7 participants