Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read query times out with reader_concurrency_semaphore during rolling upgrade on a mixed cluster #12552

Closed
1 of 2 tasks
KnifeyMoloko opened this issue Jan 18, 2023 · 34 comments · Fixed by #13843
Closed
1 of 2 tasks
Assignees
Milestone

Comments

@KnifeyMoloko
Copy link
Contributor

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

During rolling upgrade we:

  1. Upgrade 2 nodes in sequence.
  2. Rollback one of the upgraded nodes.
  3. Verify data on the mixed cluster.

During the last step we encountered a read query timeout caused by read_concurrency_semaphore timeout. What we're seeing is:

sct.log

< t:2023-01-12 15:21:27,359 f:fill_db_data.py l:3191 c:UpgradeTest          p:INFO  > Ran queries for test "range_tombstones_test: Test deletion by 'composite prefix' (range tombstones)" in 350 seconds
< t:2023-01-12 15:21:27,359 f:common.py       l:1328 c:utils                p:DEBUG > Executing CQL 'select * from keyspace_fill_db_data.range_tombstones_test_scylla_cdc_log' ...
< t:2023-01-12 15:21:29,558 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 023-01-12T15:21:29+00:00 rolling-upgrade--centos-db-node-e0a206ec-2 !    INFO | eth0: DHCPv6 address 2a05:d018:eb8:3100:1687:9bb1:c53a:702d/128 timeout preferred 140 valid 450
< t:2023-01-12 15:21:30,044 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > total,      69006097,   49978,   49978,   49978,    20.0,     2.6,   201.1,   291.5,   341.6,   435.4, 1475.0,  0.01472,      0,      0,       0,       0,       0,       0
< t:2023-01-12 15:21:34,168 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 023-01-12T15:21:33+00:00 rolling-upgrade--centos-db-node-e0a206ec-3 !  NOTICE | scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl --no-pager --no-legend
< t:2023-01-12 15:21:34,168 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2023-01-12T15:21:33+00:00 rolling-upgrade--centos-db-node-e0a206ec-3 !    INFO | pam_unix(sudo:session): session opened for user root by (uid=0)
< t:2023-01-12 15:21:34,168 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2023-01-12T15:21:33+00:00 rolling-upgrade--centos-db-node-e0a206ec-3 !    INFO | pam_unix(sudo:session): session closed for user root
< t:2023-01-12 15:21:34,724 f:connectionpool.py l:228  c:urllib3.connectionpool p:DEBUG > Starting new HTTP connection (1): 10.0.3.10:9093
< t:2023-01-12 15:21:34,725 f:connectionpool.py l:456  c:urllib3.connectionpool p:DEBUG > http://10.0.3.10:9093 "GET /api/v2/alerts?active=1 HTTP/1.1" 200 None
< t:2023-01-12 15:21:34,728 f:connectionpool.py l:228  c:urllib3.connectionpool p:DEBUG > Starting new HTTP connection (1): 10.0.3.10:9093
< t:2023-01-12 15:21:34,729 f:connectionpool.py l:456  c:urllib3.connectionpool p:DEBUG > http://10.0.3.10:9093 "GET /api/v2/alerts HTTP/1.1" 200 None
< t:2023-01-12 15:21:35,032 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > total,      69258364,   50453,   50453,   50453,    19.7,     2.6,   201.9,   289.1,   321.4,   378.0, 1480.0,  0.01467,      0,      0,       0,       0,       0,       0
< t:2023-01-12 15:21:37,363 f:cluster.py      l:3013 c:sdcm.cluster         p:DEBUG > Retrying read after timeout. Attempt #0
< t:2023-01-12 15:21:37,775 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 023-01-12T15:21:37+00:00 rolling-upgrade--centos-db-node-e0a206ec-4 !    INFO |  [shard 7] reader_concurrency_semaphore - (rate limiting dropped 25153 similar messages) Semaphore _read_concurrency_sem with 1/100 count and 16664/149401108 memory resources: timed out, dumping permit diagnostics:

We retry the query 5 times in total. All retries end with the same timeout.

On the target nodes we see:

node 4

Jan 12 15:21:37 rolling-upgrade--centos-db-node-e0a206ec-4 scylla[9405]:  [shard 7] reader_concurrency_semaphore - (rate limiting dropped 25153 similar messages) Semaphore _read_concurrency_sem with 1/100 count and 16664/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        16K        keyspace1.standard1/data-query/active/used
                                                                         15048        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        keyspace_fill_db_data.range_tombstones_test_scylla_cdc_log/shard-reader/waiting

                                                                         15050        1        16K        total

                                                                         Total: 15050 permits with 1 count and 16K memory resources

node 1

Jan 12 15:19:52 rolling-upgrade--centos-db-node-e0a206ec-1 scylla[9385]:  [shard 3] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 1/100 count and 16384/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        16K        keyspace1.standard1/data-query/active/unused
                                                                         25547        0        0B        keyspace1.standard1/data-query/waiting

                                                                         25548        1        16K        total

                                                                         Total: 25548 permits with 1 count and 16K memory resources
Jan 12 15:21:57 rolling-upgrade--centos-db-node-e0a206ec-1 scylla[9385]:  [shard 4] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 1/100 count and 17391/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        17K        keyspace1.standard1/data-query/active/used
                                                                         11147        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        keyspace_fill_db_data.range_tombstones_test_scylla_cdc_log/shard-reader/waiting

                                                                         11149        1        17K        total

                                                                         Total: 11149 permits with 1 count and 17K memory resources
Jan 12 15:22:07 rolling-upgrade--centos-db-node-e0a206ec-1 scylla[9385]:  [shard 3] reader_concurrency_semaphore - (rate limiting dropped 24729 similar messages) Semaphore _read_concurrency_sem with 2/100 count and 36873/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        20K        keyspace1.standard1/data-query/active/blocked
                                                                         1        1        16K        keyspace1.standard1/data-query/active/unused
                                                                         1        0        0B        keyspace_fill_db_data.range_tombstones_test_scylla_cdc_log/shard-reader/waiting
                                                                         21922        0        0B        keyspace1.standard1/data-query/waiting

                                                                         21925        2        36K        total

                                                                         Total: 21925 permits with 2 count and 36K memory resources
Jan 12 15:22:07 rolling-upgrade--centos-db-node-e0a206ec-1 scylla[9385]:  [shard 1] mutation_reader - shard_reader::close(): read_ahead on shard 4 failed: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)

This leads to the query timing out on the client:

2023-01-12 15:22:48.140: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=48ab92ff-9972-4d7a-a252-4a5773d78194, source=UpgradeTest.test_rolling_upgrade (upgrade_test.UpgradeTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 630, in test_rolling_upgrade
self.fill_and_verify_db_data('after rollback the second node')
File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 465, in fill_and_verify_db_data
self.verify_db_data()
File "/home/ubuntu/scylla-cluster-tests/sdcm/fill_db_data.py", line 3362, in verify_db_data
self.run_db_queries(session, session.default_fetch_size)
File "/home/ubuntu/scylla-cluster-tests/sdcm/fill_db_data.py", line 3304, in run_db_queries
self._read_cdc_tables(item, session)
File "/home/ubuntu/scylla-cluster-tests/sdcm/fill_db_data.py", line 3270, in _read_cdc_tables
actual_result = self.get_cdc_log_rows(session, cdc_table)
File "/home/ubuntu/scylla-cluster-tests/sdcm/fill_db_data.py", line 3313, in get_cdc_log_rows
return list(session.execute(f"select * from {self.base_ks}.{cdc_log_table}"))
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1329, in execute_verbose
return execute_orig(*args, **kwargs)
File "cassandra/cluster.py", line 2668, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 4953, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.0.0.219:9042

How frequently does it reproduce?

We've also seen this happen in a previous run with an upgrade from base version 5.0.8-20221221.874fa1520 to target version 5.0.9-20230111.94b8baa79

Installation details

Kernel Version: 5.15.0-1026-aws
Scylla base version (or git commit hash): 5.0.8-20221221.874fa1520 with build-id a6d441e82e54f6facd370c0ce65938c48a993c15
Scylla upgrade target version: 5.0.9-20230111.94b8baa79 with build-id 520004ed6991c6969da975000ec4a2ea10defe99

Cluster size: 4 nodes (i3.2xlarge)

Scylla Nodes used in this run:

  • rolling-upgrade--centos-db-node-e0a206ec-4 (54.75.118.221 | 10.0.2.21) (shards: 8)
  • rolling-upgrade--centos-db-node-e0a206ec-3 (54.246.172.109 | 10.0.0.219) (shards: 8)
  • rolling-upgrade--centos-db-node-e0a206ec-2 (52.215.207.130 | 10.0.0.143) (shards: 8)
  • rolling-upgrade--centos-db-node-e0a206ec-1 (54.75.31.75 | 10.0.3.159) (shards: 8)

OS / Image: ami-07d59029a97cce02e (aws: eu-west-1)

Test: rolling-upgrade-ami-test
Test id: e0a206ec-c4b3-470a-925a-a43b07f9d395
Test name: scylla-5.0/rolling-upgrade/rolling-upgrade-ami-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor e0a206ec-c4b3-470a-925a-a43b07f9d395
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs e0a206ec-c4b3-470a-925a-a43b07f9d395

Logs:

Jenkins job URL

@avikivity
Copy link
Member

@denesb the reader concurrency semaphore reports are odd, they don't show any overload.

@avikivity
Copy link
Member

I remember we fixed the admit procedure, maybe the test is hitting this problem.

@denesb
Copy link
Contributor

denesb commented Jan 18, 2023

Indeed there are some strange reports here. Like the one with one unused permit and many waiters. That is "illegal".

Yes, we backported some fixes that could be relevant, then reverted them because they were suspected of causing timeouts.

The question is whether this test was run on the version before/after the reverts. What do we blame: the fixes or the lack of them.

@avikivity
Copy link
Member

The test was on mixed 5.0.8/5.0.9. @KnifeyMoloko which versions showed the error?

@KnifeyMoloko
Copy link
Contributor Author

@avikivity @denesb Both of the nodes that showed errors were nodes not yet upgraded, i.e. they were on Scylla version 5.0.8-0.20221221.874fa1520 with build-id a6d441e82e54f6facd370c0ce65938c48a993c15

The 2 tests that failed with this were run on the 11th and 12th of January respectively, so after the reverts, if I'm not mistaken.

@DoronArazii
Copy link

@fgelcer do we still see it in the last version? 5.0.9 to 5.0.10?

@ShlomiBalalis
Copy link

Issue description

As part of the upgrade test, we did the following steps:

  • Upgrade node 4
  • Upgrade node 1
  • Rollback node 1

Afterwards, we executed several queries for data validation. The connection however, timed out:
2023-05-05 12:58:05.417: cassandra.OperationTimedOut: errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.0.0.138:9042
Looking at the nodes' logs, we can see that both the nodes that were not upgrades at all, nodes 2 and 3, both reported about Semaphore _system_read_concurrency_sem timing out:
Node 2:

May 05 12:57:13 rolling-upgrade--centos-db-node-2a5039f2-2 scylla[9055]:  [shard 6] reader_concurrency_semaphore - (rate limiting dropped 2858 similar messages) Semaphore _system_read_concurrency_sem with 91/100 count and 16752/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        16K        keyspace1.standard1/data-query/active/unused
                                                                         2        0        368B        ks1.table1/push-view-updates-2/active/unused
                                                                         1        0        0B        system_distributed.service_levels/shard-reader/waiting
                                                                         2        0        0B        keyspace_fill_db_data.static_cf_test_scylla_cdc_log/shard-reader/waiting
                                                                         455        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        *.*/view_builder/active/unused
                                                                         
                                                                         462        1        16K        total
                                                                         
                                                                         Total: 462 permits with 1 count and 16K memory resources
May 05 12:57:31 rolling-upgrade--centos-db-node-2a5039f2-2 scylla[9055]:  [shard 5] reader_concurrency_semaphore - (rate limiting dropped 4730 similar messages) Semaphore _system_read_concurrency_sem with 91/100 count and 17455/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        17K        keyspace1.standard1/data-query/active/used
                                                                         2        0        0B        system_distributed.service_levels/shard-reader/waiting
                                                                         3019        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        keyspace_fill_db_data.static_cf_test_scylla_cdc_log/shard-reader/waiting
                                                                         1        0        0B        *.*/view_builder/active/unused
                                                                         
                                                                         3024        1        17K        total
                                                                         
                                                                         Total: 3024 permits with 1 count and 17K memory resources
May 05 12:57:43 rolling-upgrade--centos-db-node-2a5039f2-2 scylla[9055]:  [shard 6] reader_concurrency_semaphore - (rate limiting dropped 615 similar messages) Semaphore _system_read_concurrency_sem with 91/100 count and 18695/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        17K        keyspace1.standard1/data-query/active/used
                                                                         7        0        1K        ks1.table1/push-view-updates-2/active/unused
                                                                         1        0        0B        keyspace_fill_db_data.static_cf_test_scylla_cdc_log/shard-reader/waiting
                                                                         3        0        0B        system_distributed.service_levels/shard-reader/waiting
                                                                         3172        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        *.*/view_builder/active/unused
                                                                         
                                                                         3185        1        18K        total
                                                                         
                                                                         Total: 3185 permits with 1 count and 18K memory resources

Node 3:

May 05 12:56:50 rolling-upgrade--centos-db-node-2a5039f2-3 scylla[9480]:  [shard 3] reader_concurrency_semaphore - (rate limiting dropped 595 similar messages) Semaphore _system_read_concurrency_sem with 93/100 count and 50159/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        17K        keyspace1.standard1/data-query/active/used
                                                                         1        1        16K        keyspace_fill_db_data.static_cf_test_scylla_cdc_log/shard-reader/inactive
                                                                         1        1        16K        system_distributed.service_levels/shard-reader/inactive
                                                                         392        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        system_distributed.service_levels/shard-reader/waiting
                                                                         1        0        0B        *.*/view_builder/active/unused
                                                                         
                                                                         397        3        49K        total
                                                                         
                                                                         Total: 397 permits with 3 count and 49K memory resources
May 05 12:56:51 rolling-upgrade--centos-db-node-2a5039f2-3 scylla[9480]:  [shard 4] reader_concurrency_semaphore - (rate limiting dropped 7148 similar messages) Semaphore _system_read_concurrency_sem with 91/100 count and 17455/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        17K        keyspace1.standard1/data-query/active/used
                                                                         2        0        0B        system_distributed.service_levels/shard-reader/waiting
                                                                         1        0        0B        keyspace_fill_db_data.static_cf_test_scylla_cdc_log/shard-reader/waiting
                                                                         1        0        0B        system_distributed.service_levels/shard-reader/evicted
                                                                         1160        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        *.*/view_builder/active/unused
                                                                         
                                                                         1166        1        17K        total
                                                                         
                                                                         Total: 1166 permits with 1 count and 17K memory resources
May 05 12:56:53 rolling-upgrade--centos-db-node-2a5039f2-3 scylla[9480]:  [shard 0] reader_concurrency_semaphore - (rate limiting dropped 6596 similar messages) Semaphore _system_read_concurrency_sem with 91/100 count and 17639/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        17K        keyspace1.standard1/data-query/active/used
                                                                         1        0        184B        ks1.table1/push-view-updates-2/active/unused
                                                                         1        0        0B        system_distributed.service_levels/shard-reader/waiting
                                                                         140        0        0B        system_auth.roles/data-query/waiting
                                                                         1        0        0B        keyspace_fill_db_data.static_cf_test_scylla_cdc_log/shard-reader/waiting
                                                                         1176        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        keyspace_fill_db_data.static_cf_test_scylla_cdc_log/multishard-mutation-query/active/unused
                                                                         1        0        0B        system_distributed.service_levels/multishard-mutation-query/active/unused
                                                                         1        0        0B        *.*/view_builder/active/unused
                                                                         
                                                                         1323        1        17K        total
                                                                         
                                                                         Total: 1323 permits with 1 count and 17K memory resources
May 05 12:56:56 rolling-upgrade--centos-db-node-2a5039f2-3 scylla[9480]:  [shard 1] reader_concurrency_semaphore - (rate limiting dropped 7366 similar messages) Semaphore _system_read_concurrency_sem with 91/100 count and 17911/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        17K        keyspace1.standard1/data-query/active/used
                                                                         3        0        520B        ks1.table1/push-view-updates-2/active/unused
                                                                         2270        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        *.*/view_builder/active/unused
                                                                         
                                                                         2275        1        17K        total
                                                                         
                                                                         Total: 2275 permits with 1 count and 17K memory resources
May 05 12:57:10 rolling-upgrade--centos-db-node-2a5039f2-3 scylla[9480]:  [shard 5] reader_concurrency_semaphore - (rate limiting dropped 6383 similar messages) Semaphore _system_read_concurrency_sem with 91/100 count and 17575/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        17K        keyspace1.standard1/data-query/active/used
                                                                         1        0        184B        ks1.table1/push-view-updates-2/active/unused
                                                                         1        0        0B        system_distributed.service_levels/shard-reader/waiting
                                                                         1971        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        keyspace_fill_db_data.static_cf_test_scylla_cdc_log/multishard-mutation-query/active/unused
                                                                         2        0        0B        system_distributed.service_levels/shard-reader/evicted
                                                                         1        0        0B        *.*/view_builder/active/unused
                                                                         
                                                                         1978        1        17K        total
                                                                         
                                                                         Total: 1978 permits with 1 count and 17K memory resources

Impact

Lowers availability during rolling upgrade.

How frequently does it reproduce?

Only reproduces when upgrading from 5.0 (and not 2022.1.5 or 2021.1.19), and has occurred twice out of two runs.

Installation details

Kernel Version: 5.15.0-1033-aws
Scylla version (or git commit hash): 5.0.12-20230410.0b170192a with build-id fa4e79a3a17d58247f707fcbd1c87b40554c7398
Target version: 2022.1.6-0.20230419.a71b1b3f64 with build-id 358c991dc7de52ecb9cf2ff89f7efabb69b41e9e

Cluster size: 4 nodes (i3.2xlarge)

Scylla Nodes used in this run:

  • rolling-upgrade--centos-db-node-2a5039f2-4 (63.35.237.16 | 10.0.0.138) (shards: 8)
  • rolling-upgrade--centos-db-node-2a5039f2-3 (3.250.13.30 | 10.0.2.137) (shards: 8)
  • rolling-upgrade--centos-db-node-2a5039f2-2 (63.33.190.210 | 10.0.3.135) (shards: 8)
  • rolling-upgrade--centos-db-node-2a5039f2-1 (3.250.33.9 | 10.0.1.133) (shards: 8)

OS / Image: ami-09cb9503f17db14ea (aws: eu-west-1)

Test: rolling-upgrade-ami-test
Test id: 2a5039f2-a8d0-432c-935f-9d4427fe88ba
Test name: enterprise-2022.1/rolling-upgrade/rolling-upgrade-ami-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 2a5039f2-a8d0-432c-935f-9d4427fe88ba
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 2a5039f2-a8d0-432c-935f-9d4427fe88ba

Logs:

Jenkins job URL

@mykaul
Copy link
Contributor

mykaul commented May 7, 2023

5.0.12 is not very interesting - 5.0.18 or later is more interesting - and it doesn't happen from those versions, right?

@avikivity
Copy link
Member

There is one "used" permit (I guess that means using cpu) and many blocked. This looks like CPU starvation.

@avikivity
Copy link
Member

Kernel Version: 5.15.0-1033-aws

So this isn't the kernel regression?

@ShlomiBalalis
Copy link

5.0.12 is not very interesting - 5.0.18 or later is more interesting - and it doesn't happen from those versions, right?

I'm pretty sure 5.0.12 is the latest release. Unless 5.0.13 has released already.

@mykaul
Copy link
Contributor

mykaul commented May 7, 2023

5.0.12 is not very interesting - 5.0.18 or later is more interesting - and it doesn't happen from those versions, right?

I'm pretty sure 5.0.12 is the latest release. Unless 5.0.13 has released already.

Correct, mea culpa on the versions mixup. I wonder if they all have the reader semaphore fixes though.

@mykaul
Copy link
Contributor

mykaul commented May 7, 2023

Kernel Version: 5.15.0-1033-aws

So this isn't the kernel regression?

no, .19 had it.

@denesb
Copy link
Contributor

denesb commented May 10, 2023

I think what we are seeing here is fallout from starting to use service-levels, which the OSS nodes don't recognize and hence fall-back to the system scheduling group. User reads flood the system semaphore and internal reads now have to compete with user reads. This is especially bad if there are system reads triggered by user reads (e.g. an auth query).

Furthermore, the system semaphore has much less resources available than the user read semaphore, although I'm only seeing CPU contention here. Basically, this is the same as a 2 lane road (both for same direction) suddenly restricted to a single lane. In that case, even a traffic that goes fluid on 2 lanes can suddenly queue up on a single one.

@denesb
Copy link
Contributor

denesb commented May 10, 2023

I think we should make an effort to recognize service-levels in OSS and map them to the statement scheduling group. This shouldn't be that hard, as all service levels start with the sg: prefix.

@mykaul
Copy link
Contributor

mykaul commented May 10, 2023

I think what we are seeing here is fallout from starting to use service-levels, which the OSS nodes don't recognize and hence fall-back to the system scheduling group. User reads flood the system semaphore and internal reads now have to compete with user reads. This is especially bad if there are system reads triggered by user reads (e.g. an auth query).

Furthermore, the system semaphore has much less resources available than the user read semaphore, although I'm only seeing CPU contention here. Basically, this is the same as a 2 lane road (both for same direction) suddenly restricted to a single lane. In that case, even a traffic that goes fluid on 2 lanes can suddenly queue up on a single one.

So such failures should not happen if we upgrade OSS->OSS or Enterprise->Enterprise, only with OSS->Enterprise?

@denesb
Copy link
Contributor

denesb commented May 10, 2023

So such failures should not happen if we upgrade OSS->OSS or Enterprise->Enterprise, only with OSS->Enterprise?

Yes. Whether this is happening can be recognized from the fact that timeouts are from the system semaphore, rather than the user semaphore (which is just called semaphore, without a prefix).

@mykaul
Copy link
Contributor

mykaul commented May 10, 2023

@roydahan - based on the above, I think we should give lower priority to OSS->Enterprise upgrades - where we see this from time to time.

@denesb
Copy link
Contributor

denesb commented May 10, 2023

I opened an issue for this: #13841.

@denesb
Copy link
Contributor

denesb commented May 10, 2023

Fix here: #13843

@denesb
Copy link
Contributor

denesb commented May 10, 2023

I think BTW that we should close this issue at one point. There is already two entirely unrelated problems included in it, linked only by the fact that both produce timeouts in a mixed cluster. If we keep it open, we will find more and more such cases, all of them conflated into a single issue that is impossible to track.

@mykaul
Copy link
Contributor

mykaul commented May 10, 2023

I think BTW that we should close this issue at one point. There is already two entirely unrelated problems included in it, linked only by the fact that both produce timeouts in a mixed cluster. If we keep it open, we will find more and more such cases, all of them conflated into a single issue that is impossible to track.

Agreed. Let's close it with #13843 .

@denesb
Copy link
Contributor

denesb commented May 10, 2023

Agreed. Let's close it with #13843 .

Updated the PR to close this issue when merged.

@ShlomiBalalis
Copy link

I think BTW that we should close this issue at one point. There is already two entirely unrelated problems included in it, linked only by the fact that both produce timeouts in a mixed cluster. If we keep it open, we will find more and more such cases, all of them conflated into a single issue that is impossible to track.

Agreed. Let's close it with #13843 .

Will the fix enter 2022.1? Cause the 2022.1.7 fails with this issue again:

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.
  • This is not a regression

As part of the upgrade test, we did the following steps:

  • Upgrade node 4
  • Upgrade node 1
  • Rollback node 1

Afterwards, we ran several queries for data validation, and while doing so, the connection timed out:
cassandra.OperationTimedOut: errors={'10.0.1.136:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.0.1.136:9042

reader_concurrency_semaphore messages from node 4 (the only node that is upgraded at this point):

May 09 13:58:57 rolling-upgrade--centos-db-node-617008dc-4 scylla[33077]:  [shard 0] reader_concurrency_semaphore - Semaphore sl:default_read_concurrency_sem with 1/100 count and 80367/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                          permits        count        memory        table/description/state
                                                                          26        0        60K        keyspace_fill_db_data.limit_multiget_test_scylla_cdc_log/multishard-mutation-query/active/unused
                                                                          1        1        18K        keyspace1.standard1/data-query/active/used
                                                                          1        0        184B        ks1.table1/push-view-updates-2/active/unused
                                                                          20        0        0B        keyspace_fill_db_data.limit_multiget_test_scylla_cdc_log/shard-reader/waiting
                                                                          2630        0        0B        keyspace1.standard1/data-query/waiting
                                                                          
                                                                          2678        1        78K        total
                                                                          
                                                                          Total: 2678 permits with 1 count and 78K memory resources
May 09 13:58:57 rolling-upgrade--centos-db-node-617008dc-4 scylla[33077]:  [shard 7] reader_concurrency_semaphore - Semaphore sl:default_read_concurrency_sem with 12/100 count and 218145/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                          permits        count        memory        table/description/state
                                                                          11        11        196K        keyspace_fill_db_data.limit_multiget_test_scylla_cdc_log/shard-reader/inactive
                                                                          1        1        17K        keyspace1.standard1/data-query/active/used
                                                                          1        0        184B        ks1.table1/push-view-updates-2/active/unused
                                                                          13        0        0B        keyspace_fill_db_data.limit_multiget_test_scylla_cdc_log/shard-reader/waiting
                                                                          1258        0        0B        keyspace1.standard1/data-query/waiting
                                                                          
                                                                          1284        12        213K        total
                                                                          
                                                                          Total: 1284 permits with 12 count and 213K memory resources

It's important to note, that this time the issue appeared when upgrading from 2021.1 and not when upgrading from 5.0 (like it did in the previous test)

Impact

Lowers availability during rolling upgrade.

How frequently does it reproduce?

In this run, the issue was only reproduced when upgrading from 2021.1.19 (and not from 5.0.13 or 2022.1.6), as opposed to the previous time the issue reproduced, in the 2022.1.6 rolling upgrade. Then, the issue reproduced when upgrading from the OSS version.

Installation details

Kernel Version: 5.15.0-1033-aws
Scylla version (or git commit hash): 2021.1.19-20230413.0c75931d2e with build-id e33e6faa25fc85927227e488480eaea3147f7ecb
Target version: 2022.1.7-0.20230508.7bd82bd3a3 with build-id b0bf2913e0212aa6390ae049bf57b5a4fdf211ba

Cluster size: 4 nodes (i3.2xlarge)

Scylla Nodes used in this run:

  • rolling-upgrade--centos-db-node-617008dc-4 (34.244.54.75 | 10.0.3.222) (shards: 8)
  • rolling-upgrade--centos-db-node-617008dc-3 (34.246.250.225 | 10.0.3.76) (shards: 8)
  • rolling-upgrade--centos-db-node-617008dc-2 (3.252.205.115 | 10.0.0.95) (shards: 8)
  • rolling-upgrade--centos-db-node-617008dc-1 (52.213.7.123 | 10.0.1.136) (shards: 8)

OS / Image: ami-0ee6c67eb612b033d (aws: eu-west-1)

Test: rolling-upgrade-ami-test
Test id: 617008dc-d94b-4bfe-8234-653b1648b4d5
Test name: enterprise-2022.1/rolling-upgrade/rolling-upgrade-ami-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 617008dc-d94b-4bfe-8234-653b1648b4d5
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 617008dc-d94b-4bfe-8234-653b1648b4d5

Logs:

Jenkins job URL

@mykaul
Copy link
Contributor

mykaul commented May 10, 2023

Will the fix enter 2022.1? Cause the 2022.1.7 fails with this issue again:

Unsure - I don't think it's an important enough use case.

@denesb
Copy link
Contributor

denesb commented May 10, 2023

Will the fix enter 2022.1? Cause the 2022.1.7 fails with this issue again:

This seems to be something else, different compared to the issue in #12552 (comment).

This is more close to the issue in the opening issue, but not quite. It is clear we have problems around mixed clusters, I heard reports from the field too, with skyrocketing latencies when upgrading from 2021.1 -> 2022.1. But since this affects purely enteprise versions, lets open an enterprise issue for this, and close this issue with the PR which fixes the OSS->enterprise path.

avikivity added a commit that referenced this issue May 22, 2023
…tond Dénes

On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.

If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.

This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.

This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.

I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.

Fixes: #13841
Fixes: #12552

Closes #13843

* github.com:scylladb/scylladb:
  message: match unknown tenants to the default tenant
  message: generalize per-tenant connection types
@DoronArazii DoronArazii modified the milestones: 5.x, 5.4 Jun 1, 2023
@ShlomiBalalis
Copy link

ShlomiBalalis commented Jul 11, 2023

Issue description

@denesb The issue possibly was reproduced in the latest upgrade test run of 2022.1.8. Can you confirm if this is, in fact, the same one?

The test stages were as follows:
2023-07-10 16:52:07,570 - 2023-07-10 17:00:05,042: node 2 was upgraded
2023-07-10 17:05:31,556 - 2023-07-10 17:11:36,294: node 1 was upgraded
2023-07-10 17:43:35,486 - 2023-07-10 18:04:01,156: node 1 was rolled back

At this stage, while the cluster was using mixed-version nodes, we verified the data with a long series of queries.
At 2023-07-10 18:12:48.113, we timed out:

2023-07-10 18:12:48.113: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=9478fed1-4d05-48af-8a6b-bb2cf738d6b2, source=UpgradeTest.test_rolling_upgrade (upgrade_test.UpgradeTest)() message=Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 633, in test_rolling_upgrade
    self.fill_and_verify_db_data('after rollback the second node')
  File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 468, in fill_and_verify_db_data
    self.verify_db_data()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/fill_db_data.py", line 3362, in verify_db_data
    self.run_db_queries(session, session.default_fetch_size)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/fill_db_data.py", line 3309, in run_db_queries
    item["cdc_tables"][cdc_table] = self.get_cdc_log_rows(session, cdc_table)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/fill_db_data.py", line 3313, in get_cdc_log_rows
    return list(session.execute(f"select * from {self.base_ks}.{cdc_log_table}"))
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1376, in execute_verbose
    return execute_orig(*args, **kwargs)
  File "cassandra/cluster.py", line 2668, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 4953, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.0.1.68:9042

Looking at the nodes' logs, several of the nodes has reported that reader_concurrency_semaphore was timing out at the time:
node 2:

Jul 10 18:09:57 rolling-upgrade--centos-db-node-3b4b4015-2 scylla[44632]:  [shard 6] reader_concurrency_semaphore - Semaphore sl:default_read_concurrency_sem with 1/100 count and 194124/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                          permits        count        memory        table/description/state
                                                                          185        0        174K        ks1.table1/push-view-updates-2/active/unused
                                                                          1        1        16K        keyspace1.standard1/data-query/active/unused
                                                                          77        0        0B        keyspace1.standard1/data-query/waiting
                                                                          
                                                                          263        1        190K        total
                                                                          
                                                                          Total: 263 permits with 1 count and 190K memory resources

node 3:
Between Jul 10 18:02:43 and Jul 10 18:12:21, the node has reported on 16 reader_concurrency_semaphore time outs, like the following:

Jul 10 18:11:43 rolling-upgrade--centos-db-node-3b4b4015-3 scylla[10136]:  [shard 1] reader_concurrency_semaphore - (rate limiting dropped 2295 similar messages) Semaphore _system_read_concurrency_sem with 92/100 count and 33959/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                          permits        count        memory        table/description/state
                                                                          1        1        17K        keyspace1.standard1/data-query/active/used
                                                                          1        1        16K        keyspace_fill_db_data.range_tombstones_test_scylla_cdc_log/shard-reader/inactive
                                                                          1        0        184B        ks1.table1/push-view-updates-2/active/unused
                                                                          1        0        0B        keyspace_fill_db_data.range_tombstones_test_scylla_cdc_log/shard-reader/waiting
                                                                          37        0        0B        keyspace1.standard1/data-query/waiting
                                                                          2        0        0B        keyspace_fill_db_data.range_tombstones_test_scylla_cdc_log/multishard-mutation-query/active/unused
                                                                          1        0        0B        *.*/view_builder/active/unused
                                                                          
                                                                          44        2        33K        total
                                                                          
                                                                          Total: 44 permits with 2 count and 33K memory resources
Jul 10 18:11:43 rolling-upgrade--centos-db-node-3b4b4015-3 scylla[10136]:  [shard 1] mutation_reader - shard_reader::close(): read_ahead on shard 1 failed: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem)
Jul 10 18:11:45 rolling-upgrade--centos-db-node-3b4b4015-3 scylla[10136]:  [shard 7] reader_concurrency_semaphore - (rate limiting dropped 3963 similar messages) Semaphore _system_read_concurrency_sem with 91/100 count and 17391/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                          permits        count        memory        table/description/state
                                                                          1        1        17K        keyspace1.standard1/data-query/active/used
                                                                          1        0        0B        system_distributed.service_levels/shard-reader/waiting
                                                                          1        0        0B        keyspace_fill_db_data.range_tombstones_test_scylla_cdc_log/shard-reader/waiting
                                                                          2652        0        0B        keyspace1.standard1/data-query/waiting
                                                                          1        0        0B        *.*/view_builder/active/unused
                                                                          
                                                                          2656        1        17K        total
                                                                          
                                                                          Total: 2656 permits with 1 count and 17K memory resources
...
Jul 10 18:11:51 rolling-upgrade--centos-db-node-3b4b4015-3 scylla[10136]:  [shard 3] reader_concurrency_semaphore - (rate limiting dropped 1 similar messages) Semaphore _system_read_concurrency_sem with 93/100 count and 50543/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                          permits        count        memory        table/description/state
                                                                          2        2        32K        system_distributed.service_levels/shard-reader/inactive
                                                                          1        1        17K        keyspace1.standard1/data-query/active/used
                                                                          2        0        368B        ks1.table1/push-view-updates-2/active/unused
                                                                          1        0        0B        keyspace_fill_db_data.range_tombstones_test_scylla_cdc_log/shard-reader/waiting
                                                                          2        0        0B        system_distributed.service_levels/shard-reader/waiting
                                                                          3491        0        0B        keyspace1.standard1/data-query/waiting
                                                                          1        0        0B        *.*/view_builder/active/unused
                                                                          
                                                                          3500        3        49K        total
                                                                          
                                                                          Total: 3500 permits with 3 count and 49K memory resources
...
Jul 10 18:12:21 rolling-upgrade--centos-db-node-3b4b4015-3 scylla[10136]:  [shard 3] reader_concurrency_semaphore - (rate limiting dropped 11566 similar messages) Semaphore _system_read_concurrency_sem with 91/100 count and 17911/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                          permits        count        memory        table/description/state
                                                                          1        1        17K        keyspace1.standard1/data-query/active/used
                                                                          3        0        520B        ks1.table1/push-view-updates-2/active/unused
                                                                          2        0        0B        keyspace_fill_db_data.range_tombstones_test_scylla_cdc_log/shard-reader/waiting
                                                                          4171        0        0B        keyspace1.standard1/data-query/waiting
                                                                          1        0        0B        *.*/view_builder/active/unused
                                                                          
                                                                          4178        1        17K        total
                                                                          
                                                                          Total: 4178 permits with 1 count and 17K memory resources

node 4:
Between Jul 10 18:03:06 and Jul 10 18:10:53, the node has reported on 6 reader_concurrency_semaphore time outs, like the following:

Jul 10 18:03:25 rolling-upgrade--centos-db-node-3b4b4015-4 scylla[9703]:  [shard 7] reader_concurrency_semaphore - (rate limiting dropped 33 similar messages) Semaphore _system_read_concurrency_sem with 94/100 count and 66863/149401108 memory resources: timed out, dumping permit di
agnostics:
                                                                         permits        count        memory        table/description/state
                                                                         2        2        32K        system_auth.roles/shard-reader/inactive
                                                                         1        1        17K        keyspace1.standard1/data-query/active/used
                                                                         1        1        16K        system_distributed.service_levels/shard-reader/inactive
                                                                         2        0        368B        ks1.table1/push-view-updates-2/active/unused
                                                                         2        0        0B        system_auth.roles/shard-reader/waiting
                                                                         616        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        *.*/view_builder/active/unused
                                                                         
                                                                         625        4        65K        total
                                                                         
                                                                         Total: 625 permits with 4 count and 65K memory resources
Jul 10 18:03:25 rolling-upgrade--centos-db-node-3b4b4015-4 scylla[9703]:  [shard 4] reader_concurrency_semaphore - (rate limiting dropped 11794 similar messages) Semaphore _system_read_concurrency_sem with 94/100 count and 66743/149401108 memory resources: timed out, dumping permit
 diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         2        2        32K        system_auth.roles/shard-reader/inactive
                                                                         1        1        17K        keyspace1.standard1/data-query/active/used
                                                                         1        1        16K        system_distributed.service_levels/shard-reader/inactive
                                                                         1        0        184B        ks1.table1/push-view-updates-2/active/unused
                                                                         2        0        0B        system_auth.roles/shard-reader/waiting
                                                                         814        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        *.*/view_builder/active/unused
                                                                         
                                                                         822        4        65K        total
                                                                         
                                                                         Total: 822 permits with 4 count and 65K memory resources
Jul 10 18:03:25 rolling-upgrade--centos-db-node-3b4b4015-4 scylla[9703]:  [shard 0] mutation_reader - shard_reader::close(): read_ahead on shard 6 failed: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem)
Jul 10 18:03:25 rolling-upgrade--centos-db-node-3b4b4015-4 scylla[9703]:  [shard 0] mutation_reader - shard_reader::close(): read_ahead on shard 6 failed: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem)
Jul 10 18:03:25 rolling-upgrade--centos-db-node-3b4b4015-4 scylla[9703]:  [shard 0] mutation_reader - shard_reader::close(): read_ahead on shard 7 failed: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem)
Jul 10 18:03:25 rolling-upgrade--centos-db-node-3b4b4015-4 scylla[9703]:  [shard 0] mutation_reader - shard_reader::close(): read_ahead on shard 7 failed: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem)
...
Jul 10 18:10:53 rolling-upgrade--centos-db-node-3b4b4015-4 scylla[9703]:  [shard 6] reader_concurrency_semaphore - (rate limiting dropped 8217 similar messages) Semaphore _system_read_concurrency_sem with 91/100 count and 17335/149401108 memory resources: timed out, dumping permit diagnostics:
                                                                         permits        count        memory        table/description/state
                                                                         1        1        17K        keyspace1.standard1/data-query/active/used
                                                                         3        0        0B        system_distributed.service_levels/shard-reader/waiting
                                                                         2125        0        0B        keyspace1.standard1/data-query/waiting
                                                                         1        0        0B        *.*/view_builder/active/unused
                                                                         
                                                                         2130        1        17K        total
                                                                         
                                                                         Total: 2130 permits with 1 count and 17K memory resources

Due note that there were several reader_concurrency_semaphore time outs at earlier stages of the test run, but they did not cause any visible issue.

How frequently does it reproduce?

The issue was only reproduced when upgrading from 5.0.13, and not when upgrading from a different enterprise version.
Also, the issue was only reproduced in the ami rolling upgrade test, and not in any other upgrade test scenario (like ubuntu 20, centos 7 and 8).

Installation details

Kernel Version: 5.15.0-1034-aws
Scylla version (or git commit hash): 5.0.13-20230423.a0ca8abe4 with build-id 1196f46d1be44927c95ac67703df3ab2ef98a332
Target Scylla version: 2022.1.8-0.20230706.f1629bf9d7 with build-id 1801687008fa25d4287ce36b492671ee3d046734

Cluster size: 4 nodes (i3.2xlarge)

Scylla Nodes used in this run:

  • rolling-upgrade--centos-db-node-3b4b4015-4 (34.243.190.150 | 10.0.3.29) (shards: 8)
  • rolling-upgrade--centos-db-node-3b4b4015-3 (34.246.173.29 | 10.0.0.155) (shards: 8)
  • rolling-upgrade--centos-db-node-3b4b4015-2 (54.171.248.244 | 10.0.1.68) (shards: 8)
  • rolling-upgrade--centos-db-node-3b4b4015-1 (54.75.122.40 | 10.0.3.116) (shards: 8)

OS / Image: ami-0a5b4324be630104d (aws: eu-west-1)

Test: rolling-upgrade-ami-test
Test id: 3b4b4015-738d-4a36-83b3-6de81b4529ee
Test name: enterprise-2022.1/rolling-upgrade/rolling-upgrade-ami-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 3b4b4015-738d-4a36-83b3-6de81b4529ee
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 3b4b4015-738d-4a36-83b3-6de81b4529ee

Logs:

Jenkins job URL
Argus

@denesb
Copy link
Contributor

denesb commented Jul 12, 2023

@ShlomiBalalis yes, it looks like the same issue. Note that #13841 (the fix) was not yet backported to any versions yet. Furthermore, it will not be backported to 5.0, so it will potentially always reproduce on 5.0 -> 2022.1 upgrades.

denesb pushed a commit that referenced this issue Jul 12, 2023
…tond Dénes

On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.

If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.

This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.

This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.

I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.

Fixes: #13841
Fixes: #12552

Closes #13843

* github.com:scylladb/scylladb:
  message: match unknown tenants to the default tenant
  message: generalize per-tenant connection types

(cherry picked from commit a7c2c9f)
denesb pushed a commit that referenced this issue Jul 12, 2023
…tond Dénes

On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.

If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.

This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.

This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.

I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.

Fixes: #13841
Fixes: #12552

Closes #13843

* github.com:scylladb/scylladb:
  message: match unknown tenants to the default tenant
  message: generalize per-tenant connection types

(cherry picked from commit a7c2c9f)
denesb pushed a commit that referenced this issue Jul 12, 2023
…tond Dénes

On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.

If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.

This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.

This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.

I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.

Fixes: #13841
Fixes: #12552

Closes #13843

* github.com:scylladb/scylladb:
  message: match unknown tenants to the default tenant
  message: generalize per-tenant connection types

(cherry picked from commit a7c2c9f)
@ShlomiBalalis
Copy link

ShlomiBalalis commented Jul 12, 2023

@ShlomiBalalis yes, it looks like the same issue. Note that #13841 (the fix) was not yet backported to any versions yet. Furthermore, it will not be backported to 5.0, so it will potentially always reproduce on 5.0 -> 2022.1 upgrades.

Noted. We will keep that in mind for the next patch release.

@DoronArazii
Copy link

Backported.
Removing the label.

@juliayakovlev
Copy link

juliayakovlev commented Aug 2, 2023

Looks like the issue has been reproduced here.

Base scylla version: 5.2.5-20230716.02bc54d4b6f0 (Build id: e50ee38612a93978a76206ed14f7340298d71deb)
Upgraded scylla version: 2023.1.0~rc8-20230731.b6f7c5a6910c (Build id: f6e718548e76ccf3564ed2387b6582ba8d37793c)

Start node rolling-upgrade-sla--centos-8-db-node-7b5d0902-0-2 upgrade:

2023-07-31 21:31:14.406: (InfoEvent Severity.NORMAL) period_type=not-set event_id=f74458c1-d598-4f67-a8e2-0f0e654d0169: message=Upgrade Node rolling-upgrade-sla--centos-8-db-node-7b5d0902-0-2 begins

reader_concurrency_semaphore - Semaphore sl:$user_read_concurrency_sem :

Jul 31 21:44:38 rolling-upgrade-sla--centos-8-db-node-7b5d0902-0-2 scylla[222491]:  [shard 6] reader_concurrency_semaphore - Semaphore sl:$user_read_concurrency_sem with 4/100 count and 144213/144685 memory resources: timed out, dumping permit diagnostics:
                                                                                   permits        count        memory        table/description/state
                                                                                   4        4        141K        keyspace1.standard1/data-query/active/blocked
                                                                                   2597        0        0B        keyspace1.standard1/data-query/waiting
                                                                                   
                                                                                   2601        4        141K        total
                                                                                   
                                                                                   Total: 2601 permits with 4 count and 141K memory resources

New Scylla version initialization completed:

Jul 31 21:44:53 rolling-upgrade-sla--centos-8-db-node-7b5d0902-0-2 scylla[222491]:  [shard 0] init - Scylla version 2023.1.0~rc8-0.20230731.b6f7c5a6910c initialization completed.

Select from truncated table start and did not finished:

2023-07-31 21:45:53,273 f:common.py       l:1734 c:utils                p:DEBUG > Executing CQL 'SELECT COUNT(*) FROM truncate_table0' ...
Jul 31 21:45:53 rolling-upgrade-sla--centos-8-db-node-7b5d0902-0-2 scylla[222491]:  [shard 2] messaging_service - Service level $user is still unknown, will try to create it now and allow the RPC connection.

session read timeout is received:

File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 121, in read_data_from_truncated_tables
count = self.rows_to_list(session.execute(truncate_query.format(table_name)))
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1727, in execute_verbose
return execute_orig(*args, **kwargs)
File "cassandra/cluster.py", line 2699, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 5018, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'10.142.0.138:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.142.0.138:9042

Test failed.

@denesb is it the same issue?

Installation details

Kernel Version: 4.18.0-500.el8.x86_64
Scylla version (or git commit hash): 5.2.5-20230716.02bc54d4b6f0 with build-id e50ee38612a93978a76206ed14f7340298d71deb
Target Scylla version (or git commit hash): 2023.1.0~rc8-20230731.b6f7c5a6910c with build-id f6e718548e76ccf3564ed2387b6582ba8d37793c

Cluster size: 3 nodes (n1-highmem-8)

Scylla Nodes used in this run:

  • rolling-upgrade-sla--centos-8-db-node-7b5d0902-0-3 (35.237.70.216 | 10.142.0.144) (shards: 7)
  • rolling-upgrade-sla--centos-8-db-node-7b5d0902-0-2 (34.148.67.96 | 10.142.0.142) (shards: 7)
  • rolling-upgrade-sla--centos-8-db-node-7b5d0902-0-1 (35.237.68.179 | 10.142.0.138) (shards: 7)

OS / Image: https://www.googleapis.com/compute/v1/projects/centos-cloud/global/images/family/centos-stream-8 (gce: us-east1)

Test: rolling-upgrade-with-sla-test
Test id: 7b5d0902-a87a-47da-885e-0c5b50994d7d
Test name: enterprise-2023.1/rolling-upgrade/rolling-upgrade-with-sla-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 7b5d0902-a87a-47da-885e-0c5b50994d7d
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 7b5d0902-a87a-47da-885e-0c5b50994d7d

Logs:

Jenkins job URL
Argus

@mykaul
Copy link
Contributor

mykaul commented Aug 2, 2023

@juliayakovlev - did you have read query timeouts on the reproduction? What is this truncated table?

@juliayakovlev
Copy link

@juliayakovlev - did you have read query timeouts on the reproduction? What is this truncated table?

Yes, session read timeout is received:

File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 121, in read_data_from_truncated_tables
count = self.rows_to_list(session.execute(truncate_query.format(table_name)))
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1727, in execute_verbose
return execute_orig(*args, **kwargs)
File "cassandra/cluster.py", line 2699, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 5018, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'10.142.0.138:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.142.0.138:9042

@denesb
Copy link
Contributor

denesb commented Aug 2, 2023

This is not the same issue. Reads are no longer piling on the system semaphore, as before.
I see there is some confusion in the mixed cluster, and we create a $user SL. @eliransin is this expected on OSS->Ent upgrades?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants