Tokio NUMA awareness #5076

crisidev · 2022-10-05T15:23:23Z

Is your feature request related to a problem? Please describe.
I was doing some performance testing on multiple, tokio and hyper based web frameworks on different machines and I found poor performances on large hosts with multiple NUMA nodes.

I run the tests using smithy-rs, axum, warp and actix-web, with very similar results.

The operation I am running is a very simple GET /, returning the static string "hello world" and I am using wrk: wrk -t16 -c1024 -d10s --latency http://localhost:8080/.

On a 128 cores, 4 NUMA nodes AMD Epyc 7R13, I can reach an average of 700k request / second, and I can see cores being underutilized, with an average of 20% utilization jumping from a group of cores to another. All the frameworks tested yield more or less the same result.

As a comparison, I run the same tests on a Intel Xeon Platinum 8375C with 32 cores on a single domain, which yields 1.8M request per seconds and all the cores are fully utilized.

Describe the solution you'd like

AFAIK tokio is not NUMA aware, so ideally I need to run multiple runtimes, one per NUMA node (different processes), sharing a socket where hyper can bind to. I would be really nice to have some utilities in tokio that would help dealing with this complexity.

The text was updated successfully, but these errors were encountered:

crisidev · 2022-10-05T18:28:38Z

I'll dump my reproduction code in this repo as soon as possible: https://github.com/crisidev/tokio-numa-perf-repro

Noah-Kennedy · 2022-10-06T04:23:37Z

I've been thinking about how to support this as well.

LucioFranco · 2022-10-06T09:11:37Z

I think ideally, we may want either some stuff behind unstable or in tokio-util that allow you to construct a runtime such that it fits well within the numa arch. Kinda like some sort of advanced toolkit.

crisidev · 2022-10-06T09:20:36Z

Yeah.. This is what I was thinking. I could find some time to give it a try with some guidance @LucioFranco.

Noah-Kennedy · 2022-10-06T15:11:14Z

I agree that we should use the unstable flag for this so that we can iterate more quickly here.

Regarding the actual design, I'd be interested in knowing where the bottlenecks are under NUMA. In particular, I'm interested in work-stealing and the IO driver.

Another point which might be worth investigating is what perf looks like with a single non-NUMA-aware multi_thread runtime vs having multiple runtimes (one per NUMA node) that use SO_REUSEPORT to split incoming accepts between them all.

Noah-Kennedy · 2022-10-07T16:42:41Z

Hypothesis: I think we might be bottlenecking on wakes.

Our current architecture serializes waking from our driver. Theoretically, given enough sockets and tasks (say, in a high-throughput server), we could expect to see waking tasks become a bottleneck, which would likely manifest as low cpu utilization due to threads blocking on receiving wakes.

This gets worse potentially because systems with high core counts often have NUMA, which may make cross-thread atomic operations, such as wakes, far more expensive and potentially make this much worse of an issue.

@crisidev I have some asks from you here.

Can you take a look at what CPU utilization looks like between your benchmarks? Theoretically, we should see overall underutilization here, as plenty of worker capacity would be left on the table due to slow wakes.

Also can you get flamegraphs of the running systems? I would expect to see a relatively high percentage of time spent in parking relative to actually doing work.

crisidev · 2022-10-07T17:28:10Z

@Noah-Kennedy sure, I'll get some data.

I can already tell you that I was able to see underutilization of CPUs, with plenty of unused capacity. Tomorrow I should have some time to redo the tests again and produce a flamegraph that we can analyse.

Regarding the utilization, do you have anything in particular in mind? I can grab a screeshot / terminal recording of htop to show the underutilization, but if you want something specific, I am happy to gather it.

Noah-Kennedy · 2022-10-07T22:48:17Z

I think just watching it on htop is enough

crisidev · 2022-10-08T09:29:55Z

I gathered some data running this code here:

$ wrk -t16 -c1024 -d10s --latency http://localhost:3000/

Running 10s test @ http://localhost:3000/
  16 threads and 1024 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.22ms   10.96ms 427.33ms   98.96%
    Req/Sec    48.87k    17.69k  104.40k    75.76%
  Latency Distribution
     50%    0.86ms
     75%    1.77ms
     90%    3.69ms
     99%   13.48ms
  7643207 requests in 10.10s, 0.90GB read
  Socket errors: connect 19, read 0, write 0, timeout 0
Requests/sec: 756721.48
Transfer/sec:     91.65MB

Flamegraph: https://github.com/crisidev/tokio-numa-perf-repro/blob/main/axum-get/flamegraph-2022-10-08.svg
htop recording: https://asciinema.org/a/cJDhr6TMvIHBnFMMe7bUpKjhY

Noah-Kennedy · 2022-10-08T16:02:08Z

Were debug symbols enabled on that flamegraph?

crisidev · 2022-10-09T07:03:16Z

I run the flamegraph in release mode with debug=true configured in Cargo.toml. I have very little to no experience with flamegraph, so I have probably not run it in the right way.

jonhoo · 2022-10-09T17:51:34Z

I have in the past found that the flamegraphs over tokio executions provide more helpful output if you run with forced frame pointers (through rustflags).

I think this particular flamegraph actually does have debug symbols, but most of the time is spent in the kernel somehow. You can tell Linux to allow tracking symbols into the kernel by doing:

$ echo 0 | sudo tee /proc/sys/kernel/kptr_restrict

Alternatively, I think sudo perf record will also have the same effect. See if that helps!

crisidev · 2022-10-10T10:36:50Z

Thanks Jon. I will rerun the tests as soon as I have some times following your advices. I think you are right, it looks like a lot of time just spent in the kernel. Let's see if we figure out doing what!

Noah-Kennedy · 2022-10-10T13:30:45Z

@crisidev which linux kernel version are you on?

crisidev · 2022-10-10T13:45:36Z

I'll be 100% sure once I redo the tests, but the instance I used was with Ubuntu 22.04, which should ship with kernel 5.15.

Noah-Kennedy · 2022-10-10T14:15:13Z

Ah. I was curious because the Linux kernel, especially in some older versions, has often not been as well optimized for AMD chips as it is for Intel chips.

crisidev · 2022-10-10T14:20:21Z

I just spinned up a new machine and I can confirm it comes with 5.15.0.

crisidev · 2022-10-10T14:27:14Z

Here is the new flamegraph, filled with nice symbols. @jonhoo advices worked perfectly!

Noah-Kennedy · 2022-10-10T14:28:46Z

Wait why is there forking occuring?

Noah-Kennedy · 2022-10-10T14:30:00Z

This benchmark seems to be spending all of its time returning from a fork.

Noah-Kennedy · 2022-10-10T14:31:08Z

@crisidev can you link the benchmark code or provide steps so that I can try and replicate the benchmark?

crisidev · 2022-10-10T14:38:55Z

All the code is here, if you look at setup.sh and flamegraph.sh and README.md, you should be able to reproduce what I did without issues.

Noah-Kennedy · 2022-10-10T14:43:11Z

Could you also send a flamegraph from your intel system? I'm curious if that spends all of its time in ret_from_fork as well.

crisidev · 2022-10-10T14:54:59Z

Here it is. It is VERY different from the AMD NUMA machine.

I did not notice in the beginning, but looking at your picture here in Github, I think we met at Rustconf 2022, so howdy, nice to talk with you again :)

Noah-Kennedy · 2022-10-10T20:57:03Z

The intel flamegraph actually looks normal. I have no idea why on the EPYC chip you are stuck returning from a fork for much of the time.

crisidev · 2022-10-11T08:46:03Z

I think fork is used to balance the threads load between different NUMA nodes. I am taking a wild guess here: on a non NUMA system, spawning a thread is cheap, while here it involves a full fork every time the kernel balancer moves the thread to a different node. I have nothing to back up this theory. Anyone with more understanding of these systems that can tell me I am completely off?

This LKLM thread is interesting IMO: https://lore.kernel.org/lkml/20190605155922.17153-1-matt@codeblueprint.co.uk/T/

crisidev · 2022-10-11T14:33:36Z

I have just retested on AMD Epyc, this time pinning the POC binary to the NUMA node 0 (32 cores), both for CPU and memory.

I am now able to reach more or less the same performance of a non NUMA machine with 32 cores. Core utilization is much higher than before now.

wrk result:

Running 10s test @ http://localhost:3000/
  32 threads and 2048 connections
^[  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   725.42us  489.83us  13.22ms   85.70%
    Req/Sec    69.81k    31.11k  111.71k    75.01%
  Latency Distribution
     50%  568.00us
     75%    0.87ms
     90%    1.30ms
     99%    2.58ms
  14038225 requests in 10.10s, 1.66GB read
Requests/sec: 1390315.58
Transfer/sec:    168.39MB

Noah-Kennedy · 2022-10-11T16:06:13Z

I think fork is used to balance the threads load between different NUMA nodes. I am taking a wild guess here: on a non NUMA system, spawning a thread is cheap, while here it involves a full fork every time the kernel balancer moves the thread to a different node. I have nothing to back up this theory. Anyone with more understanding of these systems that can tell me I am completely off?

TIL

Noah-Kennedy · 2022-10-11T16:14:54Z

@crisidev this is quite interesting. Could you please try two more runs with NUMA, both with a 120 second duration? On one run, the only change from earlier should be the longer duration. On the other, could you shrink the blocking thread pool to 16 threads?

Sorry I keep having to ask you to run benchmarks, but I don't currently have access to an EPYC system of my own.

crisidev · 2022-10-11T17:09:04Z

It's ok, don't worry, I have the capacity and I can run the benchmarks as much as you like. I'll rerun everything tomorrow following what you asked. I'll figure out if I can give you access to a test machine where we can play around together.

bobrik · 2022-10-11T18:08:36Z

Regarding __intel_pmu_enable_all on the flamegraph: it's something that's AMD specific (contrary to the name). You can make it go away if you use -e cpu-clock rather than the default -e cycles. The latter seems to count idle cycles on AMD.

crisidev added A-tokio Area: The main tokio crate C-feature-request Category: A feature request. labels Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokio NUMA awareness #5076

Tokio NUMA awareness #5076

crisidev commented Oct 5, 2022 •

edited

crisidev commented Oct 5, 2022

Noah-Kennedy commented Oct 6, 2022

LucioFranco commented Oct 6, 2022

crisidev commented Oct 6, 2022

Noah-Kennedy commented Oct 6, 2022

Noah-Kennedy commented Oct 7, 2022

crisidev commented Oct 7, 2022

Noah-Kennedy commented Oct 7, 2022

crisidev commented Oct 8, 2022

Noah-Kennedy commented Oct 8, 2022

crisidev commented Oct 9, 2022

jonhoo commented Oct 9, 2022

crisidev commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

crisidev commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

crisidev commented Oct 10, 2022

crisidev commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

crisidev commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

crisidev commented Oct 10, 2022 •

edited

Noah-Kennedy commented Oct 10, 2022

crisidev commented Oct 11, 2022

crisidev commented Oct 11, 2022

Noah-Kennedy commented Oct 11, 2022

Noah-Kennedy commented Oct 11, 2022 •

edited

crisidev commented Oct 11, 2022

bobrik commented Oct 11, 2022

Tokio NUMA awareness #5076

Tokio NUMA awareness #5076

Comments

crisidev commented Oct 5, 2022 • edited

crisidev commented Oct 5, 2022

Noah-Kennedy commented Oct 6, 2022

LucioFranco commented Oct 6, 2022

crisidev commented Oct 6, 2022

Noah-Kennedy commented Oct 6, 2022

Noah-Kennedy commented Oct 7, 2022

crisidev commented Oct 7, 2022

Noah-Kennedy commented Oct 7, 2022

crisidev commented Oct 8, 2022

Noah-Kennedy commented Oct 8, 2022

crisidev commented Oct 9, 2022

jonhoo commented Oct 9, 2022

crisidev commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

crisidev commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

crisidev commented Oct 10, 2022

crisidev commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

crisidev commented Oct 10, 2022

Noah-Kennedy commented Oct 10, 2022

crisidev commented Oct 10, 2022 • edited

Noah-Kennedy commented Oct 10, 2022

crisidev commented Oct 11, 2022

crisidev commented Oct 11, 2022

Noah-Kennedy commented Oct 11, 2022

Noah-Kennedy commented Oct 11, 2022 • edited

crisidev commented Oct 11, 2022

bobrik commented Oct 11, 2022

crisidev commented Oct 5, 2022 •

edited

crisidev commented Oct 10, 2022 •

edited

Noah-Kennedy commented Oct 11, 2022 •

edited