New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constant Errors with Reactor Stalled on new Node -- Constant High Load #14008
Comments
@xemul can you help with triaging this issue |
Just a small note: when a signal handler is involved, the last address in the backtrace (before the handler) is offset by 1 with respect to what
Note the
instead of:
on the last frame. |
@tynli how many nodes are in the cluster? |
12 nodes, with 3 nodes in each of 4 data centers. Each data center has 1 rack. I was trying to replace the cluster one bit at a time, by adding higher-performance nodes and then decommissioning the lower-performance nodes. I ended up setting up a separate cluster, and migrating the data between clusters. I do not have access to a unit that is currently in this state. |
Thanks for the info. |
Since 6aa91c1 targets exactly the code path reported here and we've seen nothing of this source after #12761 I suggest to close this issue (and reopen if it is reproduced in 5.3 or later). Backport might be possible to 5.2 but not 5.1 which doesn't include any of the locator/topology baseline changes. |
@bhalevy how do we see such a huge stall with only 12 nodes? |
It's more likely the problem is with query_partition_key_range_concurrent(). |
It's possible we never get to yield in the nested call path here: scylladb/service/storage_proxy.cc Line 4585 in 5c9ecd5
|
Prevent stalls caused by query_partition_key_range_concurrent nested called if it never yields. Fixes scylladb#14008 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prevent stalls caused by query_partition_key_range_concurrent nested calls when it never yields. Fixes scylladb#14008 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prevent stalls caused by query_partition_key_range_concurrent nested calls when it never yields. Fixes scylladb#14008 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add calls to `maybe_yield` in the per-range loops to prevent stalls caused by query_partition_key_range_concurrent nested calls when it never yields. Fixes scylladb#14008 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add calls to `maybe_yield` in the per-range loops to prevent stalls caused by query_partition_key_range_concurrent nested calls when it never yields. Fixes scylladb#14008 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…enny Halevy Prevent stalls caused by query_partition_key_range_concurrent nested calls when it never yields. Fixes #14008 Closes #14884 * github.com:scylladb/scylladb: storage_proxy: query_partition_key_range_concurrent: maybe_yield in loop storage_proxy: query_partition_key_range_concurrent: fixup indentation storage_proxy: query_partition_key_range_concurrent: turn tail recursion to iteration storage_proxy: coroutinize query_partition_key_range
@scylladb/scylla-maint - can you (if we need to) backport to 5.2? (@bhalevy - please ack) |
Doesn't apply cleanly to 5.2. |
I think we shouldn't backport to 5.2, unless we see the problem on 5.2. On 5.2. this method is very diferent, the stalls could have been introduced by the refactoring done since. |
We do see stalls in older releases. |
The code for compare_endpoints originates at the dawn of time (locator: Convert AbstractNetworkTopologySnitch.java to C++) and is called on the fast path from storage_proxy via `sort_by_proximity`. This series considerably reduces the function's footprint by: 1. carefully coding the many comparisons in the function so to reduce the number of conditional banches (apparently the compiler isn't doing a good enough job at optimizing it in this case) 2. avoid sstring copy in topology::get_{datacenter,rack} \Closes scylladb#12761 Refs scylladb#14008 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit 6aa91c1)
I think this is already backported - @bhalevy ? |
It was backported to 2023.1 |
This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
Installation details
Scylla version (or git commit hash): 5.1.5
Cluster size: 3x4
OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 20.04
Hardware details (for performance issues) Delete if unneeded
Platform (physical/VM/cloud instance type/docker): Bare Metal
Hardware: sockets=1 cores=96 hyperthreading=yes memory=768GB
Disks: (SSD/HDD, count) 8x NVMe, RAID-0
After adding a node to my cluster, I have been consistently getting messages like the following in syslog:
The CPU load on this server has remained high even after the join completed, with these messages occurring every few seconds. This is the same server for which I had the issues mentioned in #13439, and I ended up modifying the contents of
io_properties.yaml
. The values I put in there reflect the results I got on this hardware with thefio
tool.Here is the decoded backtrace:
The text was updated successfully, but these errors were encountered: