New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokio NUMA awareness #5076
Comments
I'll dump my reproduction code in this repo as soon as possible: https://github.com/crisidev/tokio-numa-perf-repro |
I've been thinking about how to support this as well. |
I think ideally, we may want either some stuff behind unstable or in tokio-util that allow you to construct a runtime such that it fits well within the numa arch. Kinda like some sort of advanced toolkit. |
Yeah.. This is what I was thinking. I could find some time to give it a try with some guidance @LucioFranco. |
I agree that we should use the unstable flag for this so that we can iterate more quickly here. Regarding the actual design, I'd be interested in knowing where the bottlenecks are under NUMA. In particular, I'm interested in work-stealing and the IO driver. Another point which might be worth investigating is what perf looks like with a single non-NUMA-aware |
Hypothesis: I think we might be bottlenecking on wakes. Our current architecture serializes waking from our driver. Theoretically, given enough sockets and tasks (say, in a high-throughput server), we could expect to see waking tasks become a bottleneck, which would likely manifest as low cpu utilization due to threads blocking on receiving wakes. This gets worse potentially because systems with high core counts often have NUMA, which may make cross-thread atomic operations, such as wakes, far more expensive and potentially make this much worse of an issue. @crisidev I have some asks from you here. Can you take a look at what CPU utilization looks like between your benchmarks? Theoretically, we should see overall underutilization here, as plenty of worker capacity would be left on the table due to slow wakes. Also can you get flamegraphs of the running systems? I would expect to see a relatively high percentage of time spent in parking relative to actually doing work. |
@Noah-Kennedy sure, I'll get some data. I can already tell you that I was able to see Regarding the utilization, do you have anything in particular in mind? I can grab a screeshot / terminal recording of htop to show the underutilization, but if you want something specific, I am happy to gather it. |
I think just watching it on htop is enough |
I gathered some data running this code here:
|
Were debug symbols enabled on that flamegraph? |
I run the flamegraph in release mode with |
I have in the past found that the flamegraphs over tokio executions provide more helpful output if you run with forced frame pointers (through rustflags). I think this particular flamegraph actually does have debug symbols, but most of the time is spent in the kernel somehow. You can tell Linux to allow tracking symbols into the kernel by doing: $ echo 0 | sudo tee /proc/sys/kernel/kptr_restrict Alternatively, I think |
Thanks Jon. I will rerun the tests as soon as I have some times following your advices. I think you are right, it looks like a lot of time just spent in the kernel. Let's see if we figure out doing what! |
@crisidev which linux kernel version are you on? |
I'll be 100% sure once I redo the tests, but the instance I used was with Ubuntu 22.04, which should ship with kernel 5.15. |
Ah. I was curious because the Linux kernel, especially in some older versions, has often not been as well optimized for AMD chips as it is for Intel chips. |
I just spinned up a new machine and I can confirm it comes with |
Here is the new flamegraph, filled with nice symbols. @jonhoo advices worked perfectly! |
Wait why is there forking occuring? |
This benchmark seems to be spending all of its time returning from a fork. |
@crisidev can you link the benchmark code or provide steps so that I can try and replicate the benchmark? |
All the code is here, if you look at |
Could you also send a flamegraph from your intel system? I'm curious if that spends all of its time in ret_from_fork as well. |
Here it is. It is VERY different from the AMD NUMA machine. I did not notice in the beginning, but looking at your picture here in Github, I think we met at Rustconf 2022, so howdy, nice to talk with you again :) |
The intel flamegraph actually looks normal. I have no idea why on the EPYC chip you are stuck returning from a fork for much of the time. |
I think fork is used to balance the threads load between different NUMA nodes. I am taking a wild guess here: on a non NUMA system, spawning a thread is cheap, while here it involves a full fork every time the kernel balancer moves the thread to a different node. I have nothing to back up this theory. Anyone with more understanding of these systems that can tell me I am completely off? This LKLM thread is interesting IMO: https://lore.kernel.org/lkml/20190605155922.17153-1-matt@codeblueprint.co.uk/T/ |
I have just retested on AMD Epyc, this time pinning the POC binary to the NUMA node 0 (32 cores), both for CPU and memory. I am now able to reach more or less the same performance of a non NUMA machine with 32 cores. Core utilization is much higher than before now. wrk result:
|
TIL |
@crisidev this is quite interesting. Could you please try two more runs with NUMA, both with a 120 second duration? On one run, the only change from earlier should be the longer duration. On the other, could you shrink the blocking thread pool to 16 threads? Sorry I keep having to ask you to run benchmarks, but I don't currently have access to an EPYC system of my own. |
It's ok, don't worry, I have the capacity and I can run the benchmarks as much as you like. I'll rerun everything tomorrow following what you asked. I'll figure out if I can give you access to a test machine where we can play around together. |
Regarding |
Is your feature request related to a problem? Please describe.
I was doing some performance testing on multiple, tokio and hyper based web frameworks on different machines and I found poor performances on large hosts with multiple NUMA nodes.
I run the tests using smithy-rs, axum, warp and actix-web, with very similar results.
The operation I am running is a very simple
GET /
, returning the static string "hello world" and I am using wrk:wrk -t16 -c1024 -d10s --latency http://localhost:8080/
.On a 128 cores, 4 NUMA nodes AMD Epyc 7R13, I can reach an average of 700k request / second, and I can see cores being underutilized, with an average of 20% utilization jumping from a group of cores to another. All the frameworks tested yield more or less the same result.
As a comparison, I run the same tests on a Intel Xeon Platinum 8375C with 32 cores on a single domain, which yields 1.8M request per seconds and all the cores are fully utilized.
Describe the solution you'd like
AFAIK tokio is not NUMA aware, so ideally I need to run multiple runtimes, one per NUMA node (different processes), sharing a socket where hyper can bind to. I would be really nice to have some utilities in tokio that would help dealing with this complexity.
The text was updated successfully, but these errors were encountered: