New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logalloc::region_impl::object_descriptor::encode stalls (More than 100ms reactor stalls from memory allocation) - kernel issue? #8828
Comments
@avikivity FYI |
In both cases it doesn't look like an allocator problem, instead a large row/cell problem. The code is stalling while copying large amounts of data, but there's no indication the allocator is slow. |
@michoecho please take a look |
Looking. |
I can't figure anything out. It's just as @avikivity says: all the 4 stalls in the log happen in |
I have some vague memory of seeing object_descriptor::encode in other places too. Maybe it's some kind of artifact, like we're swapping and this write causes the page to be swapped in. |
I see that scylla_setup calls scylla_memory_setup --lock-memory, but maybe s-c-t uses a custom script that bypasses that. |
@fgelcer how does s-c-t setup nodes? maybe it's really a question for the machine image. |
Right, |
Except that 4th stall doesn't quite match the theory. It happens a bit later than |
It cannot be a swap-in from disk since we use CLOCK_THREAD_CPUTIME_ID for the timer (see It could kernel compaction/zeroing (that can happen in the thread that's faulting). As for the delays, here's an explanation:
I don't believe this, it doesn't completely fit the reports. The non-encode() stall is on shard 8, the same shard as an encode stall, but at different times. Are the reported times correct (e.g. scylla time or parse time)? And why would allocating and zeroing a page, even a large page, take so much time? Why would there even be problems? We start a fresh machine with all its memory free. |
SCT first has a definition of the instance type, then, it sends a post boot script through API to run before we even have access to it, and then we configure the if you need exact parameters we send to these post boot scripts, and |
I need the contents of /etc/sysconfig/scylla-server (the SCYLLA_ARGS stuff) |
this test has
|
|
|
|
that specific job doesn't exist anymore and AFAICT we don't collect
|
@slivne , also, the content of the
|
this is from the job @asias reported the issue... |
Well, I didn't believe in the theory anyway. But now I have no other idea about what could have gone wrong. |
the specific reported instance is |
Yes, so it's not the balloon |
It again happened one time on centos rolling upgrade: |
Issue descriptionduring rolling upgrade job, before we started to upgrade the 1st node:
ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsKernel Version: 4.18.0-489.el8.x86_64 Cluster size: 4 nodes (n1-highmem-8) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
I have ideas for a fix, I'll try to push out something soon. |
"I have discovered a truly marvelous demonstration of this proposition that this Github comment is too narrow to contain." ? |
Here's an old version of the idea: avikivity/seastar@6fea38b |
Filed scylladb/seastar#1702 |
…vity To avoid latency spikes due to page faults on anonymous memory, we mlockall() all memory. This prevents memory from being swapped out, but we can still see a latency spike when faulting in the memory the first time it is touched. It's rare for this to be a problem, as faulting in an anonymous memory page on first use is cheap - the kernel just has to zero it. However, things are complicated by transparent hugepages as the kernel first has to defragment memory. This can take quite a while, as was observed in [1]. The solution is to launch background threads that will attempt to fault-in the memory ahead of the application. We need to be careful so that these threads themselves don't compete with the application and cause latency spikes themselves, so we take the following steps: 1. We launch just one thread per NUMA node, reducing lock contention 2. We place the threads in the SCHED_IDLE class, so the kernel will favor application threads 3. We let the thread affinity float over the entire NUMA node, so it can find the least contended core Tested on an old Intel E5 E5-2697. It was able to fault 200GB in 50 cpu seconds (25 wall-clock seconds) for a rate of 8 GB/s. This shows that even a large-memory application will be able to fault in memory faster than it will need it. [1] scylladb/scylladb#8828 Closes #1702 * https://github.com/scylladb/seastar: smp: wire up memory prefaulter smp: introduce memory prefaulter resource: compute numa_node_id to cpuset mapping in resources::allocate() posix: add posix_thread attribute for thread affinity memory: return NUMA layout of allocated memory during initialization
locator/*_snitch.cc updated for http::reply losing the _status_code member without a deprecation notice. * seastar 99d28ff057...2b7a341210 (23): > Merge 'Prefault memory when --lock-memory 1 is specified' from Avi Kivity Fixes scylladb#8828. > reactor: use structured binding when appropriate > Simplify payload length and mask parsing. > memcached: do not used deprecated API > build: serialize calls to openssl certificate generation > reactor: epoll backend: initialize _highres_timer_pending > shared_ptr: deprecate lw_shared_ptr operator=(T&&) > tests: fail spawn_test if output is empty > Support specifying the "build root" in configure > Merge 'Cleanup RPC request/response frames maintenance' from Pavel Emelyanov > build: correct the syntax error in comment > util: print_safe: fix hex print functions > Add code examples for handling exceptions > smp: warn if --memory parameter is not supported > Merge 'gate: track holders' from Benny Halevy > file: call lambda with std::invoke() > deleter: Delete move and copy constructors > file: fix the indent > file: call close() without the syscall thread > reactor: use s/::free()/::io_uring_free_probe()/ > Merge 'seastar-json2code: generate better-formatted code' from Kefu Chai > reactor: Don't re-evaliate local reactor for thread_pool > Merge 'Improve http::reply re-allocations and copying in client' from Pavel Emelyanov
@avikivity please evaluate for backport. |
ping @avikivity |
I think it's pretty safe by now. But it doesn't fix a regression, and it mostly shows up in tests. In production nodes are long-lived and after they fault in their memory they don't see this problem. |
scylla: 76d7c76
test: 6 nodes in the cluster, run replace operation to replace one of the node in a loop
+100ms reactor stalls from memory were seen.
For example:
Full log attached.
longevity-100gb-4h-ReplaceNode-test.run6.console.txt.gz
The text was updated successfully, but these errors were encountered: