-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
667592 ms stall in ~data_sync_repair_task_impl #15095
Comments
For some reason there was a another long (but not THAT long) stall from a completely different code path on node1 at the same time. Could they be both relate to networking somehow?
Decoded:
|
@xemul are there known issues in that area? |
Only 667592 ms? |
We need a contest! |
@avikivity shouldn't we observe exponentially arger stalls as they are detected and the threshold is doubled each time? |
667 seconds? that's 11 minutes. It looks like an infinite loop, not a stall. |
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/249/artifact/logs-full.debug.041/1692521802039_compaction_additional_test.py%3A%3ATestCompactionAdditional%3A%3Atest_double_compaction_by_cleanup_and_major_compactions/node1.log
|
could be related too |
The stall is also visible in other tests running on the same machine at the same time:
So obviously it's a OS problem, not a Scylla problem. |
Maybe related to the infamous three-digit stalls? |
The two timestamps are 1370 seconds apart. That's a little over 2X the duration of the stall. |
I think it rules out games with CLOCK_THREAD_CPUTIME_ID, since realtime agrees something funny happened. |
Yes. Log spam -> syslog stalled -> everything blocked. Does dtest/ccm run with scylladb logs directed to syslog? It shouldn't. |
Probably, everything is stalled so the stall detector itself isn't getting CPU time. |
That, and also the lengths of the stalls vary wildly across samples, even though the realtime "before" and "after" agrees in all cases. The stalls are: 667 s, 13 s, 115 s, 372 s, 775 s. (Each in a different process.) |
But the log spam happens at a different time. The stall happens 1 hour after the log spam ends. |
Isn't there a rate limit on the number of confusing things in a single issue? |
Another option is a swapstorm more or less killing the node. |
This fact doesn't seem very confusing — a natural explanation is that the log spam is unrelated to the stall. But 13 million INFO logs in a test is insane in its own right. This is obviously some bug that deserves its own fix. |
Yes please file one and assign to @Deexie |
How can I checkout to the exact version used in Jenkins? The logs look pretty similar to #14966, I'm not sure yet if they are related anyhow. |
Scylla logs its git hash and build id when it starts. The version used in this run is d57a951, which predates the fix for #14966 (e9d9489). So this run could have been affected. |
Thanks! I checked it on current master and the problem does not appear |
A swapstorm wouldn't register on CLOCK_CPU_THREAD_ID. Though sometimes the kernel enlists a faulting thread to work to free memory. Idea: make --overprovisioned (which should really be named --underprovisioned, or --overcommitted) disable transparent hugepages. I'll post a patch. |
I'm out of my depth here. Is it really possible for a swapstorm to stall the process for 12 minutes? I understand that it can slow down everything to a crawl, but can it slow the process so much that it can't even handle a signal and look at the clock for minutes? |
--overprovisioned (which should really be renamed --overcommitted or --underprovisioned) means that Seastar should sacrifice performance in favor of playing well with other applications on the same system. One such sacrifice can be avoiding transparent hugepages. Transparent hugepages can waste memory if only part of a hugepage is used, waste kernel CPU time trying to coalesce a huge page, and make it harder to swap. Be more friendly by avoiding transparent hugepages when --overprovisioned is specified. This is a little complicated since we allocate memory before we know whether --overprovisioned is specified or not (even before main() runs). Since it's only a small amount of memory, conservatively (from the performance point of view) assume we want transparent hugepages, and back down and undo if it later turns out we don't. I verified with strace that madvise() calls happen as expected. Ref scylladb/scylladb#15095.
It's possible but I admit it's unlikely. In the days of spinning disks it could easily happen, since 12 minutes = 72,000 seeks, so if there is that much activity before the code involved gets paged in then nothing happens. In those days reaching for the reset button was sometimes the only option. In these days of sub-millisecond flash it's very rare. |
--overprovisioned (which should really be renamed --overcommitted or --underprovisioned) means that Seastar should sacrifice performance in favor of playing well with other applications on the same system. One such sacrifice can be avoiding transparent hugepages. Transparent hugepages can waste memory if only part of a hugepage is used, waste kernel CPU time trying to coalesce a huge page, and make it harder to swap. Be more friendly by avoiding transparent hugepages when --overprovisioned is specified. This is a little complicated since we allocate memory before we know whether --overprovisioned is specified or not (even before main() runs). Since it's only a small amount of memory, conservatively (from the performance point of view) assume we want transparent hugepages, and back down and undo if it later turns out we don't. I verified with strace that madvise() calls happen as expected. Ref scylladb/scylladb#15095.
Even with rotating disks, rest their soul in peace, you had NCQ which would re-order those seeks to complete faster.
|
--overprovisioned (which should really be renamed --overcommitted or --underprovisioned) means that Seastar should sacrifice performance in favor of playing well with other applications on the same system. One such sacrifice can be avoiding transparent hugepages. Transparent hugepages can waste memory if only part of a hugepage is used, waste kernel CPU time trying to coalesce a huge page, and make it harder to swap. Be more friendly by avoiding transparent hugepages when --overprovisioned is specified. This is a little complicated since we allocate memory before we know whether --overprovisioned is specified or not (even before main() runs). Since it's only a small amount of memory, conservatively (from the performance point of view) assume we want transparent hugepages, and back down and undo if it later turns out we don't. I verified with strace that madvise() calls happen as expected. Ref scylladb/scylladb#15095. Closes #1796
* seastar 0784da876...6e80e84a3 (29): > Revert "shared_token_bucket: Make duration->tokens conversion more solid" > Merge 'chunked_fifo: let incremetal operator return iterator not basic_iterator' from Kefu Chai > memory: diable transparent hugepages if --overprovisioned is specified Ref scylladb#15095 > http/exception: s/<TAB>/ / > install-dependencies.sh: re-add protobuf > Merge 'Keep capacity on fair_queue_entry' from Pavel Emelyanov > Merge 'Fix server-side RPC stream shutdown' from Pavel Emelyanov Fixes scylladb#13100 > smp: make service management semaphore thread local > tls_test: abort_accept() after getting server socket > Merge 'Print more IO info with ioinfo app' from Pavel Emelyanov > rpc: Fix client-side stream registration race Ref scylladb#13100 > tests: perf: shard_token_bucket: avoid capturing unused variables in lambdas > build: pass -DBoost_NO_CXX98_FUNCTION_BASE to C++ compiler > reactor: Drop some dangling friend declarations > fair_queue: Do not re-evaluate request capacity twice > build: do not use serial number file when signing a cert > shared_token_bucket: Make duration->tokens conversion more solid > tests: Add perf test for shard_token_bucket > Merge 'Make make_file_impl() less yielding' from Pavel Emelyanov > fair_queue: Remove individual requests counting > reactor, linux-aio: print value of aio-max-nr on error > Merge 'build, net: disable implicit fallthough' from Kefu Chai > shared_token_bucket: Fix duration_for() underflow > rpc: Generalize get_stats_internal() method > doc/building-dpdk.md: fix invalid file path of README-DPDK.md > install-dependencies: add centos9 > Merge 'log: report scheduling group along with shard id' from Kefu Chai > dns: handle exception in do_sendv for udp > Merge 'Add a stall detector histogram' from Amnon Heiman
* seastar 0784da87...6e80e84a (29): > Revert "shared_token_bucket: Make duration->tokens conversion more solid" > Merge 'chunked_fifo: let incremetal operator return iterator not basic_iterator' from Kefu Chai > memory: diable transparent hugepages if --overprovisioned is specified Ref scylladb#15095 > http/exception: s/<TAB>/ / > install-dependencies.sh: re-add protobuf > Merge 'Keep capacity on fair_queue_entry' from Pavel Emelyanov > Merge 'Fix server-side RPC stream shutdown' from Pavel Emelyanov Fixes scylladb#13100 > smp: make service management semaphore thread local > tls_test: abort_accept() after getting server socket > Merge 'Print more IO info with ioinfo app' from Pavel Emelyanov > rpc: Fix client-side stream registration race Ref scylladb#13100 > tests: perf: shard_token_bucket: avoid capturing unused variables in lambdas > build: pass -DBoost_NO_CXX98_FUNCTION_BASE to C++ compiler > reactor: Drop some dangling friend declarations > fair_queue: Do not re-evaluate request capacity twice > build: do not use serial number file when signing a cert > shared_token_bucket: Make duration->tokens conversion more solid > tests: Add perf test for shard_token_bucket > Merge 'Make make_file_impl() less yielding' from Pavel Emelyanov > fair_queue: Remove individual requests counting > reactor, linux-aio: print value of aio-max-nr on error > Merge 'build, net: disable implicit fallthough' from Kefu Chai > shared_token_bucket: Fix duration_for() underflow > rpc: Generalize get_stats_internal() method > doc/building-dpdk.md: fix invalid file path of README-DPDK.md > install-dependencies: add centos9 > Merge 'log: report scheduling group along with shard id' from Kefu Chai > dns: handle exception in do_sendv for udp > Merge 'Add a stall detector histogram' from Amnon Heiman Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
* seastar 0784da87...6e80e84a (29): > Revert "shared_token_bucket: Make duration->tokens conversion more solid" > Merge 'chunked_fifo: let incremetal operator return iterator not basic_iterator' from Kefu Chai > memory: diable transparent hugepages if --overprovisioned is specified Ref #15095 > http/exception: s/<TAB>/ / > install-dependencies.sh: re-add protobuf > Merge 'Keep capacity on fair_queue_entry' from Pavel Emelyanov > Merge 'Fix server-side RPC stream shutdown' from Pavel Emelyanov Fixes #13100 > smp: make service management semaphore thread local > tls_test: abort_accept() after getting server socket > Merge 'Print more IO info with ioinfo app' from Pavel Emelyanov > rpc: Fix client-side stream registration race Ref #13100 > tests: perf: shard_token_bucket: avoid capturing unused variables in lambdas > build: pass -DBoost_NO_CXX98_FUNCTION_BASE to C++ compiler > reactor: Drop some dangling friend declarations > fair_queue: Do not re-evaluate request capacity twice > build: do not use serial number file when signing a cert > shared_token_bucket: Make duration->tokens conversion more solid > tests: Add perf test for shard_token_bucket > Merge 'Make make_file_impl() less yielding' from Pavel Emelyanov > fair_queue: Remove individual requests counting > reactor, linux-aio: print value of aio-max-nr on error > Merge 'build, net: disable implicit fallthough' from Kefu Chai > shared_token_bucket: Fix duration_for() underflow > rpc: Generalize get_stats_internal() method > doc/building-dpdk.md: fix invalid file path of README-DPDK.md > install-dependencies: add centos9 > Merge 'log: report scheduling group along with shard id' from Kefu Chai > dns: handle exception in do_sendv for udp > Merge 'Add a stall detector histogram' from Amnon Heiman Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #15218
* seastar 0784da87...6e80e84a (29): > Revert "shared_token_bucket: Make duration->tokens conversion more solid" > Merge 'chunked_fifo: let incremetal operator return iterator not basic_iterator' from Kefu Chai > memory: diable transparent hugepages if --overprovisioned is specified Ref #15095 > http/exception: s/<TAB>/ / > install-dependencies.sh: re-add protobuf > Merge 'Keep capacity on fair_queue_entry' from Pavel Emelyanov > Merge 'Fix server-side RPC stream shutdown' from Pavel Emelyanov Fixes #13100 > smp: make service management semaphore thread local > tls_test: abort_accept() after getting server socket > Merge 'Print more IO info with ioinfo app' from Pavel Emelyanov > rpc: Fix client-side stream registration race Ref #13100 > tests: perf: shard_token_bucket: avoid capturing unused variables in lambdas > build: pass -DBoost_NO_CXX98_FUNCTION_BASE to C++ compiler > reactor: Drop some dangling friend declarations > fair_queue: Do not re-evaluate request capacity twice > build: do not use serial number file when signing a cert > shared_token_bucket: Make duration->tokens conversion more solid > tests: Add perf test for shard_token_bucket > Merge 'Make make_file_impl() less yielding' from Pavel Emelyanov > fair_queue: Remove individual requests counting > reactor, linux-aio: print value of aio-max-nr on error > Merge 'build, net: disable implicit fallthough' from Kefu Chai > shared_token_bucket: Fix duration_for() underflow > rpc: Generalize get_stats_internal() method > doc/building-dpdk.md: fix invalid file path of README-DPDK.md > install-dependencies: add centos9 > Merge 'log: report scheduling group along with shard id' from Kefu Chai > dns: handle exception in do_sendv for udp > Merge 'Add a stall detector histogram' from Amnon Heiman Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #15218
91cc544 should reduce the likelihood of a swapstorm |
--overprovisioned (which should really be renamed --overcommitted or --underprovisioned) means that Seastar should sacrifice performance in favor of playing well with other applications on the same system. One such sacrifice can be avoiding transparent hugepages. Transparent hugepages can waste memory if only part of a hugepage is used, waste kernel CPU time trying to coalesce a huge page, and make it harder to swap. Be more friendly by avoiding transparent hugepages when --overprovisioned is specified. This is a little complicated since we allocate memory before we know whether --overprovisioned is specified or not (even before main() runs). Since it's only a small amount of memory, conservatively (from the performance point of view) assume we want transparent hugepages, and back down and undo if it later turns out we don't. I verified with strace that madvise() calls happen as expected. Ref scylladb/scylladb#15095. Closes scylladb#1796
This hasn't reproduced. |
--overprovisioned (which should really be renamed --overcommitted or --underprovisioned) means that Seastar should sacrifice performance in favor of playing well with other applications on the same system. One such sacrifice can be avoiding transparent hugepages. Transparent hugepages can waste memory if only part of a hugepage is used, waste kernel CPU time trying to coalesce a huge page, and make it harder to swap. Be more friendly by avoiding transparent hugepages when --overprovisioned is specified. This is a little complicated since we allocate memory before we know whether --overprovisioned is specified or not (even before main() runs). Since it's only a small amount of memory, conservatively (from the performance point of view) assume we want transparent hugepages, and back down and undo if it later turns out we don't. I verified with strace that madvise() calls happen as expected. Ref scylladb/scylladb#15095. Closes scylladb#1796
Seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/6417/artifact/dtest/logs-full.release.011/1692525345967_nodetool_additional_test.py%3A%3ATestNodetool%3A%3Atest_concurrent_drain/node2.log
Decoded:
The text was updated successfully, but these errors were encountered: