New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read query times out with reader_concurrency_semaphore
during rolling upgrade on a mixed cluster
#12552
Comments
@denesb the reader concurrency semaphore reports are odd, they don't show any overload. |
I remember we fixed the admit procedure, maybe the test is hitting this problem. |
Indeed there are some strange reports here. Like the one with one unused permit and many waiters. That is "illegal". Yes, we backported some fixes that could be relevant, then reverted them because they were suspected of causing timeouts. The question is whether this test was run on the version before/after the reverts. What do we blame: the fixes or the lack of them. |
The test was on mixed 5.0.8/5.0.9. @KnifeyMoloko which versions showed the error? |
@avikivity @denesb Both of the nodes that showed errors were nodes not yet upgraded, i.e. they were on Scylla version 5.0.8-0.20221221.874fa1520 with build-id a6d441e82e54f6facd370c0ce65938c48a993c15 The 2 tests that failed with this were run on the 11th and 12th of January respectively, so after the reverts, if I'm not mistaken. |
@fgelcer do we still see it in the last version? 5.0.9 to 5.0.10? |
Issue description
As part of the upgrade test, we did the following steps:
Afterwards, we executed several queries for data validation. The connection however, timed out:
Node 3:
ImpactLowers availability during rolling upgrade. How frequently does it reproduce?Only reproduces when upgrading from 5.0 (and not 2022.1.5 or 2021.1.19), and has occurred twice out of two runs. Installation detailsKernel Version: 5.15.0-1033-aws Cluster size: 4 nodes (i3.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
5.0.12 is not very interesting - 5.0.18 or later is more interesting - and it doesn't happen from those versions, right? |
There is one "used" permit (I guess that means using cpu) and many blocked. This looks like CPU starvation. |
So this isn't the kernel regression? |
I'm pretty sure 5.0.12 is the latest release. Unless 5.0.13 has released already. |
Correct, mea culpa on the versions mixup. I wonder if they all have the reader semaphore fixes though. |
no, .19 had it. |
I think what we are seeing here is fallout from starting to use service-levels, which the OSS nodes don't recognize and hence fall-back to the system scheduling group. User reads flood the system semaphore and internal reads now have to compete with user reads. This is especially bad if there are system reads triggered by user reads (e.g. an auth query). Furthermore, the system semaphore has much less resources available than the user read semaphore, although I'm only seeing CPU contention here. Basically, this is the same as a 2 lane road (both for same direction) suddenly restricted to a single lane. In that case, even a traffic that goes fluid on 2 lanes can suddenly queue up on a single one. |
I think we should make an effort to recognize service-levels in OSS and map them to the |
So such failures should not happen if we upgrade OSS->OSS or Enterprise->Enterprise, only with OSS->Enterprise? |
Yes. Whether this is happening can be recognized from the fact that timeouts are from the system semaphore, rather than the user semaphore (which is just called semaphore, without a prefix). |
@roydahan - based on the above, I think we should give lower priority to OSS->Enterprise upgrades - where we see this from time to time. |
I opened an issue for this: #13841. |
Fix here: #13843 |
I think BTW that we should close this issue at one point. There is already two entirely unrelated problems included in it, linked only by the fact that both produce timeouts in a mixed cluster. If we keep it open, we will find more and more such cases, all of them conflated into a single issue that is impossible to track. |
Agreed. Let's close it with #13843 . |
Updated the PR to close this issue when merged. |
Will the fix enter 2022.1? Cause the 2022.1.7 fails with this issue again: Issue description
As part of the upgrade test, we did the following steps:
Afterwards, we ran several queries for data validation, and while doing so, the connection timed out:
It's important to note, that this time the issue appeared when upgrading from 2021.1 and not when upgrading from 5.0 (like it did in the previous test) ImpactLowers availability during rolling upgrade. How frequently does it reproduce?In this run, the issue was only reproduced when upgrading from 2021.1.19 (and not from Installation detailsKernel Version: 5.15.0-1033-aws Cluster size: 4 nodes (i3.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Unsure - I don't think it's an important enough use case. |
This seems to be something else, different compared to the issue in #12552 (comment). This is more close to the issue in the opening issue, but not quite. It is clear we have problems around mixed clusters, I heard reports from the field too, with skyrocketing latencies when upgrading from 2021.1 -> 2022.1. But since this affects purely enteprise versions, lets open an enterprise issue for this, and close this issue with the PR which fixes the OSS->enterprise path. |
…tond Dénes On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name. If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen. This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster. This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore. I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before. Fixes: #13841 Fixes: #12552 Closes #13843 * github.com:scylladb/scylladb: message: match unknown tenants to the default tenant message: generalize per-tenant connection types
Issue description@denesb The issue possibly was reproduced in the latest upgrade test run of 2022.1.8. Can you confirm if this is, in fact, the same one? The test stages were as follows: At this stage, while the cluster was using mixed-version nodes, we verified the data with a long series of queries.
Looking at the nodes' logs, several of the nodes has reported that
node 3:
node 4:
Due note that there were several How frequently does it reproduce?The issue was only reproduced when upgrading from Installation detailsKernel Version: 5.15.0-1034-aws Cluster size: 4 nodes (i3.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
@ShlomiBalalis yes, it looks like the same issue. Note that #13841 (the fix) was not yet backported to any versions yet. Furthermore, it will not be backported to 5.0, so it will potentially always reproduce on 5.0 -> 2022.1 upgrades. |
…tond Dénes On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name. If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen. This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster. This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore. I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before. Fixes: #13841 Fixes: #12552 Closes #13843 * github.com:scylladb/scylladb: message: match unknown tenants to the default tenant message: generalize per-tenant connection types (cherry picked from commit a7c2c9f)
…tond Dénes On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name. If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen. This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster. This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore. I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before. Fixes: #13841 Fixes: #12552 Closes #13843 * github.com:scylladb/scylladb: message: match unknown tenants to the default tenant message: generalize per-tenant connection types (cherry picked from commit a7c2c9f)
…tond Dénes On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name. If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen. This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster. This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore. I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before. Fixes: #13841 Fixes: #12552 Closes #13843 * github.com:scylladb/scylladb: message: match unknown tenants to the default tenant message: generalize per-tenant connection types (cherry picked from commit a7c2c9f)
Noted. We will keep that in mind for the next patch release. |
Backported. |
Looks like the issue has been reproduced here. Base scylla version: 5.2.5-20230716.02bc54d4b6f0 (Build id: e50ee38612a93978a76206ed14f7340298d71deb) Start node
reader_concurrency_semaphore - Semaphore sl:$user_read_concurrency_sem :
New Scylla version initialization completed:
Select from truncated table start and did not finished:
session read timeout is received:
Test failed. @denesb is it the same issue? Installation detailsKernel Version: 4.18.0-500.el8.x86_64 Cluster size: 3 nodes (n1-highmem-8) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
@juliayakovlev - did you have read query timeouts on the reproduction? What is this truncated table? |
Yes, session read timeout is received:
|
This is not the same issue. Reads are no longer piling on the system semaphore, as before. |
Issue description
During rolling upgrade we:
During the last step we encountered a read query timeout caused by
read_concurrency_semaphore
timeout. What we're seeing is:sct.log
We retry the query 5 times in total. All retries end with the same timeout.
On the target nodes we see:
node 4
node 1
This leads to the query timing out on the client:
How frequently does it reproduce?
We've also seen this happen in a previous run with an upgrade from base version 5.0.8-20221221.874fa1520 to target version 5.0.9-20230111.94b8baa79
Installation details
Kernel Version: 5.15.0-1026-aws
Scylla base version (or git commit hash):
5.0.8-20221221.874fa1520
with build-ida6d441e82e54f6facd370c0ce65938c48a993c15
Scylla upgrade target version: 5.0.9-20230111.94b8baa79 with build-id
520004ed6991c6969da975000ec4a2ea10defe99
Cluster size: 4 nodes (i3.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-07d59029a97cce02e
(aws: eu-west-1)Test:
rolling-upgrade-ami-test
Test id:
e0a206ec-c4b3-470a-925a-a43b07f9d395
Test name:
scylla-5.0/rolling-upgrade/rolling-upgrade-ami-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor e0a206ec-c4b3-470a-925a-a43b07f9d395
$ hydra investigate show-logs e0a206ec-c4b3-470a-925a-a43b07f9d395
Logs:
Jenkins job URL
The text was updated successfully, but these errors were encountered: