Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokio NUMA awareness #5076

Open
crisidev opened this issue Oct 5, 2022 · 31 comments
Open

Tokio NUMA awareness #5076

crisidev opened this issue Oct 5, 2022 · 31 comments
Labels
A-tokio Area: The main tokio crate C-feature-request Category: A feature request.

Comments

@crisidev
Copy link

crisidev commented Oct 5, 2022

Is your feature request related to a problem? Please describe.
I was doing some performance testing on multiple, tokio and hyper based web frameworks on different machines and I found poor performances on large hosts with multiple NUMA nodes.

I run the tests using smithy-rs, axum, warp and actix-web, with very similar results.

The operation I am running is a very simple GET /, returning the static string "hello world" and I am using wrk: wrk -t16 -c1024 -d10s --latency http://localhost:8080/.

On a 128 cores, 4 NUMA nodes AMD Epyc 7R13, I can reach an average of 700k request / second, and I can see cores being underutilized, with an average of 20% utilization jumping from a group of cores to another. All the frameworks tested yield more or less the same result.

As a comparison, I run the same tests on a Intel Xeon Platinum 8375C with 32 cores on a single domain, which yields 1.8M request per seconds and all the cores are fully utilized.

Describe the solution you'd like

AFAIK tokio is not NUMA aware, so ideally I need to run multiple runtimes, one per NUMA node (different processes), sharing a socket where hyper can bind to. I would be really nice to have some utilities in tokio that would help dealing with this complexity.

@crisidev crisidev added A-tokio Area: The main tokio crate C-feature-request Category: A feature request. labels Oct 5, 2022
@crisidev
Copy link
Author

crisidev commented Oct 5, 2022

I'll dump my reproduction code in this repo as soon as possible: https://github.com/crisidev/tokio-numa-perf-repro

@Noah-Kennedy
Copy link
Contributor

I've been thinking about how to support this as well.

@LucioFranco
Copy link
Member

I think ideally, we may want either some stuff behind unstable or in tokio-util that allow you to construct a runtime such that it fits well within the numa arch. Kinda like some sort of advanced toolkit.

@crisidev
Copy link
Author

crisidev commented Oct 6, 2022

Yeah.. This is what I was thinking. I could find some time to give it a try with some guidance @LucioFranco.

@Noah-Kennedy
Copy link
Contributor

I agree that we should use the unstable flag for this so that we can iterate more quickly here.

Regarding the actual design, I'd be interested in knowing where the bottlenecks are under NUMA. In particular, I'm interested in work-stealing and the IO driver.

Another point which might be worth investigating is what perf looks like with a single non-NUMA-aware multi_thread runtime vs having multiple runtimes (one per NUMA node) that use SO_REUSEPORT to split incoming accepts between them all.

@Noah-Kennedy
Copy link
Contributor

Hypothesis: I think we might be bottlenecking on wakes.

Our current architecture serializes waking from our driver. Theoretically, given enough sockets and tasks (say, in a high-throughput server), we could expect to see waking tasks become a bottleneck, which would likely manifest as low cpu utilization due to threads blocking on receiving wakes.

This gets worse potentially because systems with high core counts often have NUMA, which may make cross-thread atomic operations, such as wakes, far more expensive and potentially make this much worse of an issue.

@crisidev I have some asks from you here.

Can you take a look at what CPU utilization looks like between your benchmarks? Theoretically, we should see overall underutilization here, as plenty of worker capacity would be left on the table due to slow wakes.

Also can you get flamegraphs of the running systems? I would expect to see a relatively high percentage of time spent in parking relative to actually doing work.

@crisidev
Copy link
Author

crisidev commented Oct 7, 2022

@Noah-Kennedy sure, I'll get some data.

I can already tell you that I was able to see underutilization of CPUs, with plenty of unused capacity. Tomorrow I should have some time to redo the tests again and produce a flamegraph that we can analyse.

Regarding the utilization, do you have anything in particular in mind? I can grab a screeshot / terminal recording of htop to show the underutilization, but if you want something specific, I am happy to gather it.

@Noah-Kennedy
Copy link
Contributor

I think just watching it on htop is enough

@crisidev
Copy link
Author

crisidev commented Oct 8, 2022

I gathered some data running this code here:

$ wrk -t16 -c1024 -d10s --latency http://localhost:3000/

Running 10s test @ http://localhost:3000/
  16 threads and 1024 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.22ms   10.96ms 427.33ms   98.96%
    Req/Sec    48.87k    17.69k  104.40k    75.76%
  Latency Distribution
     50%    0.86ms
     75%    1.77ms
     90%    3.69ms
     99%   13.48ms
  7643207 requests in 10.10s, 0.90GB read
  Socket errors: connect 19, read 0, write 0, timeout 0
Requests/sec: 756721.48
Transfer/sec:     91.65MB

@Noah-Kennedy
Copy link
Contributor

Were debug symbols enabled on that flamegraph?

@crisidev
Copy link
Author

crisidev commented Oct 9, 2022

I run the flamegraph in release mode with debug=true configured in Cargo.toml. I have very little to no experience with flamegraph, so I have probably not run it in the right way.

@jonhoo
Copy link
Sponsor Contributor

jonhoo commented Oct 9, 2022

I have in the past found that the flamegraphs over tokio executions provide more helpful output if you run with forced frame pointers (through rustflags).

I think this particular flamegraph actually does have debug symbols, but most of the time is spent in the kernel somehow. You can tell Linux to allow tracking symbols into the kernel by doing:

$ echo 0 | sudo tee /proc/sys/kernel/kptr_restrict

Alternatively, I think sudo perf record will also have the same effect. See if that helps!

@crisidev
Copy link
Author

Thanks Jon. I will rerun the tests as soon as I have some times following your advices. I think you are right, it looks like a lot of time just spent in the kernel. Let's see if we figure out doing what!

@Noah-Kennedy
Copy link
Contributor

@crisidev which linux kernel version are you on?

@crisidev
Copy link
Author

I'll be 100% sure once I redo the tests, but the instance I used was with Ubuntu 22.04, which should ship with kernel 5.15.

@Noah-Kennedy
Copy link
Contributor

Ah. I was curious because the Linux kernel, especially in some older versions, has often not been as well optimized for AMD chips as it is for Intel chips.

@crisidev
Copy link
Author

I just spinned up a new machine and I can confirm it comes with 5.15.0.

@crisidev
Copy link
Author

Here is the new flamegraph, filled with nice symbols. @jonhoo advices worked perfectly!

@Noah-Kennedy
Copy link
Contributor

Wait why is there forking occuring?

@Noah-Kennedy
Copy link
Contributor

This benchmark seems to be spending all of its time returning from a fork.

@Noah-Kennedy
Copy link
Contributor

@crisidev can you link the benchmark code or provide steps so that I can try and replicate the benchmark?

@crisidev
Copy link
Author

All the code is here, if you look at setup.sh and flamegraph.sh and README.md, you should be able to reproduce what I did without issues.

@Noah-Kennedy
Copy link
Contributor

Could you also send a flamegraph from your intel system? I'm curious if that spends all of its time in ret_from_fork as well.

@crisidev
Copy link
Author

crisidev commented Oct 10, 2022

Here it is. It is VERY different from the AMD NUMA machine.

I did not notice in the beginning, but looking at your picture here in Github, I think we met at Rustconf 2022, so howdy, nice to talk with you again :)

@Noah-Kennedy
Copy link
Contributor

The intel flamegraph actually looks normal. I have no idea why on the EPYC chip you are stuck returning from a fork for much of the time.

@crisidev
Copy link
Author

I think fork is used to balance the threads load between different NUMA nodes. I am taking a wild guess here: on a non NUMA system, spawning a thread is cheap, while here it involves a full fork every time the kernel balancer moves the thread to a different node. I have nothing to back up this theory. Anyone with more understanding of these systems that can tell me I am completely off?

This LKLM thread is interesting IMO: https://lore.kernel.org/lkml/20190605155922.17153-1-matt@codeblueprint.co.uk/T/

@crisidev
Copy link
Author

I have just retested on AMD Epyc, this time pinning the POC binary to the NUMA node 0 (32 cores), both for CPU and memory.

I am now able to reach more or less the same performance of a non NUMA machine with 32 cores. Core utilization is much higher than before now.

wrk result:

Running 10s test @ http://localhost:3000/
  32 threads and 2048 connections
^[  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   725.42us  489.83us  13.22ms   85.70%
    Req/Sec    69.81k    31.11k  111.71k    75.01%
  Latency Distribution
     50%  568.00us
     75%    0.87ms
     90%    1.30ms
     99%    2.58ms
  14038225 requests in 10.10s, 1.66GB read
Requests/sec: 1390315.58
Transfer/sec:    168.39MB

@Noah-Kennedy
Copy link
Contributor

I think fork is used to balance the threads load between different NUMA nodes. I am taking a wild guess here: on a non NUMA system, spawning a thread is cheap, while here it involves a full fork every time the kernel balancer moves the thread to a different node. I have nothing to back up this theory. Anyone with more understanding of these systems that can tell me I am completely off?

TIL

@Noah-Kennedy
Copy link
Contributor

Noah-Kennedy commented Oct 11, 2022

@crisidev this is quite interesting. Could you please try two more runs with NUMA, both with a 120 second duration? On one run, the only change from earlier should be the longer duration. On the other, could you shrink the blocking thread pool to 16 threads?

Sorry I keep having to ask you to run benchmarks, but I don't currently have access to an EPYC system of my own.

@crisidev
Copy link
Author

It's ok, don't worry, I have the capacity and I can run the benchmarks as much as you like. I'll rerun everything tomorrow following what you asked. I'll figure out if I can give you access to a test machine where we can play around together.

@bobrik
Copy link

bobrik commented Oct 11, 2022

Regarding __intel_pmu_enable_all on the flamegraph: it's something that's AMD specific (contrary to the name). You can make it go away if you use -e cpu-clock rather than the default -e cycles. The latter seems to count idle cycles on AMD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-feature-request Category: A feature request.
Projects
None yet
Development

No branches or pull requests

5 participants