New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repairing a cluster after a restore causes severe reactor stalls throughout the cluster (due to expensive logging within do_repair_ranges() without yield) #14330
Comments
@ShlomiBalalis - please provide decoded stalls, at least for some. For example the last one, I've never seen it (but I may be unfamiliar with all of course):
|
The issue was reproduced in the AWS job as well:
Afterwards, we truncate the keyspace and execute a restore:
And then we repair the cluster, and all of the nodes in the cluster suffer from reactor stalls. Only that in this run, the repairs took significantly longer (Over 80 minutes overall):
Logs: |
@ShlomiBalalis - why is this under Manager? Should I move it to core? |
Scylla manager did a Even if the issue is somehow connected to the way how we restore (but I doubt), we are not able to debug it without the input from scylla core defining why these errors appear. |
Decode of
|
I figured as much. I just wanted your input to be sure.
Shall I move the issue to the core, then?
I will decode each stall and add them to the issue in a separate file |
All of the backtraces of the initial, azure run, decoded: |
Thanks, they all look the same to me (same area), but someone else will have to verify that. |
They do, but there are slight differences here and there so I included them all (minus the duplicates) |
The issue is clear: do_repair_ranges() runs repair_range() on each range via parallel_for_each(): Lines 948 to 950 in 643e69a
But there are 25600 ranges, and parallel_for_each() does not preempt — it synchronously calls each of the passed objects to obtain futures. Only when all objects are completed or turned into futures, does parallel_for_each() return. Normally this wouldn't be much of a problem (I guess, because we didn't bump into the issue before), but in this scenario each object passed to parallel_for_each prints an expensive log before returning a future:
So there are 25600 expensive logs that have to be printed in a single reactor cycle, and this is the stall. In normal operation, the first several objects would start repairs and the rest would immediately block on the limited concurrency semaphore, without doing much work — so even though all 25600 function calls would be executed simultaneously, it wouldn't cause a (big) stall (I guess). But in this scenario the ranges are skipped (and finish synchronously), so the semaphore is released immediately, and each item gets to run some expensive work before being turned into a future, which causes a stall. Quick fix: add |
@eliransin - Michal diagnosed the issue and there's a quick fix and longer term items above - can you assign it? |
@eliransin - ping? |
@eliransin - ping? |
@denesb - please see what needs to be done here - is it as 6.0 material? earlier? backlog? |
This is such a simple fix, there is no point in pushing it out. I will send a PR with the fix today. |
…nges We have observed do_repair_ranges() receiving tens of thousands of ranges to repairs on occasion. do_repair_ranges() repairs all ranges in parallel, with parallel_for_each(). This is normally fine, as the lambda inside parallel_for_each() takes a semaphore and this will result in limited concurrency. However, in some instances, it is possible that most of these ranges are skipped. In this case the lambda will become synchronous, only logging a message. This can cause stalls beacuse there are no opportunities to yield. Solve this by adding an explicit yield to prevent this. Fixes: scylladb#14330
Fix here: #15879. |
…nges We have observed do_repair_ranges() receiving tens of thousands of ranges to repairs on occasion. do_repair_ranges() repairs all ranges in parallel, with parallel_for_each(). This is normally fine, as the lambda inside parallel_for_each() takes a semaphore and this will result in limited concurrency. However, in some instances, it is possible that most of these ranges are skipped. In this case the lambda will become synchronous, only logging a message. This can cause stalls beacuse there are no opportunities to yield. Solve this by adding an explicit yield to prevent this. Fixes: scylladb#14330
…nges We have observed do_repair_ranges() receiving tens of thousands of ranges to repairs on occasion. do_repair_ranges() repairs all ranges in parallel, with parallel_for_each(). This is normally fine, as the lambda inside parallel_for_each() takes a semaphore and this will result in limited concurrency. However, in some instances, it is possible that most of these ranges are skipped. In this case the lambda will become synchronous, only logging a message. This can cause stalls beacuse there are no opportunities to yield. Solve this by adding an explicit yield to prevent this. Fixes: #14330 Closes #15879 (cherry picked from commit 90a8489)
…nges We have observed do_repair_ranges() receiving tens of thousands of ranges to repairs on occasion. do_repair_ranges() repairs all ranges in parallel, with parallel_for_each(). This is normally fine, as the lambda inside parallel_for_each() takes a semaphore and this will result in limited concurrency. However, in some instances, it is possible that most of these ranges are skipped. In this case the lambda will become synchronous, only logging a message. This can cause stalls beacuse there are no opportunities to yield. Solve this by adding an explicit yield to prevent this. Fixes: #14330 Closes #15879 (cherry picked from commit 90a8489)
Backported to 5.2, 5.4. Skipped 5.1, it's performance only. |
scylla version: 5.2.1-0.20230508.f1c45553bc29 with build-id 88ac66b1719cc7c5b7e982aa34ba5dc95909b84a
Client version: 3.1.1-0.20230612.401edeb8
Server version: 3.1.1-0.20230612.401edeb8
At first, we execute a simple backup task:
Afterwards, we truncate the target keyspace and restore it:
Eventually, as part of the restore procedure, we repair the cluster once the restore task has ended. Once we did that, however, the entire cluster has suffered multiple severe reactor stalls:
Logs:
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/0d2064b9-8c10-4821-ae86-be6b71d050af/20230621_031502/db-cluster-0d2064b9.tar.gz
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/0d2064b9-8c10-4821-ae86-be6b71d050af/20230621_031502/loader-set-0d2064b9.tar.gz
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/0d2064b9-8c10-4821-ae86-be6b71d050af/20230621_031502/monitor-set-0d2064b9.tar.gz
sct - https://cloudius-jenkins-test.s3.amazonaws.com/0d2064b9-8c10-4821-ae86-be6b71d050af/20230621_031502/sct-runner-0d2064b9.tar.gz
The text was updated successfully, but these errors were encountered: