Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default Runtime configuration / Executor & Reactor on different threads is suboptimal #265

Closed
sdroege opened this Issue Mar 29, 2018 · 7 comments

Comments

Projects
None yet
6 participants
@sdroege
Copy link
Contributor

sdroege commented Mar 29, 2018

See testcase, which is a simplification from some real code but exposes the same behaviour.

The basic workload is lots of UDP sockets that receive packets, then do some not too complicated work on them and send out packets again. In the testcase the packets are just sent back and no actual work happens on them, in the real application audio RTP packets are received, transcoded and forwarded.

The number of sockets in the testcase can be configured with the N_PORTS at the top, and then the actual "runtime" can be configured with the N_THREADS constant.

-1 runs Reactor and Executor on the very same thread, single-threaded. Positive numbers basically work like the tokio Runtime and run the Reactor in the background and use a thread-pool Executor. 0 uses the default thread-pool setting (4 on my machine) and any higher number uses that number of threads for the thread pool.

The code can be found here: https://github.com/sdroege/tokio-udp-benchmark

Results on my machine (build in release mode!) are around 40% with the default number of threads (basically the tokio Runtime), around 22% with everything single-threaded. This is on Linux, I expect results to be worse on macOS (which has notoriously big overhead for threading/synchronization primitives).

For this scenario it means that the number of concurrent streams is two times lower with the default Runtime. While the single-threaded case is bound at 100% usage of a single core, it can be easily extended to multiple cores by distributing the sockets over multiple Reactor/Executor that each run on their own threads.

My guess is that for all scenarios that are mostly IO-bound (or: lots of packets and very little work per packet), the default Runtime does not work very well and the threading-overhead is killing performance. I expect that the same can be observed with e.g. trustdns.
For CPU-bound scenarios, the default Runtime is probably working well though.

@kpp

This comment has been minimized.

Copy link
Contributor

kpp commented Mar 29, 2018

Woah! Here is the code for single-threaded runtime. This issue is linked with #235.

@tanriol shall you implement single threaded runtime or shall I?

@tanriol

This comment has been minimized.

Copy link

tanriol commented Mar 29, 2018

@sdroege Thank you for the example code, I was unable to grasp this from the documentation too!

@kpp Yeah, I'm planning to implement it in some time. However, I need to understand how to integrate both with a timer implementation from #249.

@carllerche

This comment has been minimized.

Copy link
Member

carllerche commented Mar 29, 2018

@sdroege Thanks for the report. The test case you have provided is very helpful.

my machine

Could you clarify the details of your machine (specifically # of cores and whether it is NUMA based or not).

For the record, currently a performance degradation is expected for micro benchmark situations that are unable to take advantage of concurrency. However, I was hoping that it would be no more than 20%.

I would also say that this performance degradation is (hopefully) a temporary issue. The current implementation of the stack has a number of known performance issues that will be fixed over time. I am trying to get the Tokio stack to a "feature complete" state first before we start tuning the implementation details. This will help guide performance related changes. I believe that, once tuned, the Runtime w/ threading can either be close enough that it doesn't matter or faster than a single threaded setup.

Also, having a set of workloads like the one you provided is going to be very helpful for the tuning process. So, the more the better!

However, it is understandable that users would like to avoid the performance penalty immediately, which is why #235 should get done.

The rough roadmap I'm looking at is:

  • Integrate timers + some additional thread pool features.
  • Solid test / benchmark suite covering cases we care about.
  • Tune

Spoiler: My hunch is that there will not be a dedicated reactor thread by default in the future.

@sdroege

This comment has been minimized.

Copy link
Contributor Author

sdroege commented Mar 29, 2018

Could you clarify the details of your machine (specifically # of cores and whether it is NUMA based or not).

Intel i7-4790K, a quadcore / 8 hyperthreads

currently a performance degradation is expected for micro benchmark situations

Note that while the above is arguably a microbenchmark, the actual application exposes exactly the same behaviour. It receives audio RTP packets, decodes them, encodes them again and sends them out via another socket. All of this basically requiring almost no CPU by itself, the tricky part is only the relatively high packet rate, and lots of very small packets.

before we start tuning the implementation

This also seems kind of a architectural problem, not an optimization. For some workloads the current default is great, for others not.

With regards to optimizations, there are a few other things I saw passing by while profiling that are independent of the threading architecture.

@carllerche

This comment has been minimized.

Copy link
Member

carllerche commented Mar 29, 2018

This also seems kind of a architectural problem, not an optimization.

I guess I'm not sure where you draw the line between architectural vs. tuning, but I definitely plan on tweaking what runs on which threads.

@sdroege

This comment has been minimized.

Copy link
Contributor Author

sdroege commented Apr 2, 2018

@kpp Yeah, I'm planning to implement it in some time. However, I need to understand how to integrate both with a timer implementation from #249.

@tanriol I've integrated that into the testcase too now, see sdroege/tokio-udp-benchmark@5ead278 (the very last part of the commit). Hope this helps

@tobz

This comment has been minimized.

Copy link
Member

tobz commented Nov 28, 2018

I'm going to close this as this should be resolved by #660, even if the parity isn't quite there with the current thread runtime. On top of that, -tokio-io-pool exists if the default multithreaded runtime is still too slow.

@tobz tobz closed this Nov 28, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.