New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodetool enablebinary starts the CQL server in the streaming group #15485
Comments
nice |
I couldn't for the life of me dig out the schema pointer. |
Also rephase the messages a bit so they are more uniform. The goal of this change is to make semaphore mismatches easier to diagnose, by including the table name and the permit name in the printout. While at it, add a test for semaphore mismatch, it didn't have one. Refs: scylladb#15485
Also rephase the messages a bit so they are more uniform. The goal of this change is to make semaphore mismatches easier to diagnose, by including the table name and the permit name in the printout. While at it, add a test for semaphore mismatch, it didn't have one. Refs: scylladb#15485
@denesb I think I'm seen this coredump again (on the same test case):
Installation detailsKernel Version: 5.15.0-1042-gcp Cluster size: 5 nodes (n2-highmem-16) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Unfortunately #15508 was still not merged :(. |
Got it again during the same test on master:
Installation detailsKernel Version: 5.15.0-1046-gcp Cluster size: 5 nodes (n2-highmem-16) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
@bhalevy - we need to determine if this one is a blocker for 5.4 or not. |
Yes, this is a blocked, we need to find out how readers are leaked to other scheduling groups. |
Looking at the last reproducer. The error message looks like this:
We are attempting to continue a read -- started in the statement group -- in the streaming group. Looking at the task-queues:
We are indeed currently in the streaming group. This is definitely a bug. |
Huh, looks like we started the CQL server in the streaming group:
|
Found the bug:
|
YAY! |
The semahore mismatch detection code is just a canary in the coal mine here. The bug is in scylla for who knows how long. |
We have at least one customer who shall remain nameless that uses this functionality and will need this fixed (CC @fee-mendes ) |
Updated title and description. BTW I think we should do this more -- often the title/opening description of issues is that of the original bug report, and one has to scroll back and read multiple comments to understand the result of the investigation. |
Title - absolutely - I do it regularly. |
I advocate for adding a |
If only github comments had timestamps so you could repair history in hind sight... ;-) |
Seems like a good business use case for AI to summarize an ongoing issue. |
Praise be to scylladb/seastar@42970cf. |
Yes, before seeing this in the logs, I was chasing a complete dead-end of possibly misbehaving |
Currently, it is started/stopped in the streaming/maintenance sg, which is what the API itself runs in. Starting the native transport in the streaming sg, will lead to severely degraded performance, as the streaming sg has significantly less CPU/disk shares and reader concurrency semaphore resources. Furthermore, it will lead to multi-paged reads possibly switching between scheduling groups mid-way, triggering an internal error. To fix, use `with_scheduling_group()` for both starting and stopping native transport. Technically, it is only strictly necessary for starting, but I added it for stop as well for consistency. Also apply the same treatment to RPC (Thrift). Although no one uses it, best to fix it, just to be on the safe side. I think we need a more systematic approach for solving this once and for all, like passing the scheduling group to the protocol server and have it switch to it internally. This allows the server to always run on the correct scheduling group, not depending on the caller to remember using it. However, I think this is best done in a follow-up, to keep this critical patch small and easily backportable. Fixes: scylladb#15485
Fix here: #16019. |
@mykaul @fee-mendes FYI, I just checked and on enterprise, the effects are much reduced, because upon authentication, the connection will be moved to the SL appropriate scheduling group anyway. |
Currently, it is started/stopped in the streaming/maintenance sg, which is what the API itself runs in. Starting the native transport in the streaming sg, will lead to severely degraded performance, as the streaming sg has significantly less CPU/disk shares and reader concurrency semaphore resources. Furthermore, it will lead to multi-paged reads possibly switching between scheduling groups mid-way, triggering an internal error. To fix, use `with_scheduling_group()` for both starting and stopping native transport. Technically, it is only strictly necessary for starting, but I added it for stop as well for consistency. Also apply the same treatment to RPC (Thrift). Although no one uses it, best to fix it, just to be on the safe side. I think we need a more systematic approach for solving this once and for all, like passing the scheduling group to the protocol server and have it switch to it internally. This allows the server to always run on the correct scheduling group, not depending on the caller to remember using it. However, I think this is best done in a follow-up, to keep this critical patch small and easily backportable. Fixes: scylladb#15485
Currently, it is started/stopped in the streaming/maintenance sg, which is what the API itself runs in. Starting the native transport in the streaming sg, will lead to severely degraded performance, as the streaming sg has significantly less CPU/disk shares and reader concurrency semaphore resources. Furthermore, it will lead to multi-paged reads possibly switching between scheduling groups mid-way, triggering an internal error. To fix, use `with_scheduling_group()` for both starting and stopping native transport. Technically, it is only strictly necessary for starting, but I added it for stop as well for consistency. Also apply the same treatment to RPC (Thrift). Although no one uses it, best to fix it, just to be on the safe side. I think we need a more systematic approach for solving this once and for all, like passing the scheduling group to the protocol server and have it switch to it internally. This allows the server to always run on the correct scheduling group, not depending on the caller to remember using it. However, I think this is best done in a follow-up, to keep this critical patch small and easily backportable. Fixes: #15485 Closes #16019 (cherry picked from commit dfd7981)
Currently, it is started/stopped in the streaming/maintenance sg, which is what the API itself runs in. Starting the native transport in the streaming sg, will lead to severely degraded performance, as the streaming sg has significantly less CPU/disk shares and reader concurrency semaphore resources. Furthermore, it will lead to multi-paged reads possibly switching between scheduling groups mid-way, triggering an internal error. To fix, use `with_scheduling_group()` for both starting and stopping native transport. Technically, it is only strictly necessary for starting, but I added it for stop as well for consistency. Also apply the same treatment to RPC (Thrift). Although no one uses it, best to fix it, just to be on the safe side. I think we need a more systematic approach for solving this once and for all, like passing the scheduling group to the protocol server and have it switch to it internally. This allows the server to always run on the correct scheduling group, not depending on the caller to remember using it. However, I think this is best done in a follow-up, to keep this critical patch small and easily backportable. Fixes: #15485 Closes #16019 (cherry picked from commit dfd7981)
Currently, it is started/stopped in the streaming/maintenance sg, which is what the API itself runs in. Starting the native transport in the streaming sg, will lead to severely degraded performance, as the streaming sg has significantly less CPU/disk shares and reader concurrency semaphore resources. Furthermore, it will lead to multi-paged reads possibly switching between scheduling groups mid-way, triggering an internal error. To fix, use `with_scheduling_group()` for both starting and stopping native transport. Technically, it is only strictly necessary for starting, but I added it for stop as well for consistency. Also apply the same treatment to RPC (Thrift). Although no one uses it, best to fix it, just to be on the safe side. I think we need a more systematic approach for solving this once and for all, like passing the scheduling group to the protocol server and have it switch to it internally. This allows the server to always run on the correct scheduling group, not depending on the caller to remember using it. However, I think this is best done in a follow-up, to keep this critical patch small and easily backportable. Fixes: #15485 Closes #16019 (cherry picked from commit dfd7981)
Backported to 5.1, 5.2, 5.4. |
Summary
The CQL server is supposed to run in the
statement
group. However issuingnodetool disablebinary
, followed bynodetool enablebinary
will result in the CQL server being started in thestreaming
(maintenance) group, in which the REST API itself runs. This results in degraded performance for all subsequent queries, as thestreaming
group has much reduced shares, compared to thestatement
group. It will also result in semaphore mismatch errors from time-to-time, as reads arriving from remote nodes will have the correct,statement
scheduling group, so reads switching between coordinators, will possibly experience switching between scheduling groups between different pages.Original Description
Split out from #15373, for the full context see said issue.
Download the core:
I looked around in the core but I couldn't yet determine how this happened. This could be an old bug, only brought to the surface by #14736, which added semaphore check validation to the single-partition read path (before, only range scans had this check).
The text was updated successfully, but these errors were encountered: