Default Runtime configuration / Executor & Reactor on different threads is suboptimal #265
See testcase, which is a simplification from some real code but exposes the same behaviour.
The basic workload is lots of UDP sockets that receive packets, then do some not too complicated work on them and send out packets again. In the testcase the packets are just sent back and no actual work happens on them, in the real application audio RTP packets are received, transcoded and forwarded.
The number of sockets in the testcase can be configured with the
The code can be found here: https://github.com/sdroege/tokio-udp-benchmark
Results on my machine (build in release mode!) are around 40% with the default number of threads (basically the tokio
For this scenario it means that the number of concurrent streams is two times lower with the default
My guess is that for all scenarios that are mostly IO-bound (or: lots of packets and very little work per packet), the default
The text was updated successfully, but these errors were encountered:
@sdroege Thanks for the report. The test case you have provided is very helpful.
Could you clarify the details of your machine (specifically # of cores and whether it is NUMA based or not).
For the record, currently a performance degradation is expected for micro benchmark situations that are unable to take advantage of concurrency. However, I was hoping that it would be no more than 20%.
I would also say that this performance degradation is (hopefully) a temporary issue. The current implementation of the stack has a number of known performance issues that will be fixed over time. I am trying to get the Tokio stack to a "feature complete" state first before we start tuning the implementation details. This will help guide performance related changes. I believe that, once tuned, the Runtime w/ threading can either be close enough that it doesn't matter or faster than a single threaded setup.
Also, having a set of workloads like the one you provided is going to be very helpful for the tuning process. So, the more the better!
However, it is understandable that users would like to avoid the performance penalty immediately, which is why #235 should get done.
The rough roadmap I'm looking at is:
Spoiler: My hunch is that there will not be a dedicated reactor thread by default in the future.
Intel i7-4790K, a quadcore / 8 hyperthreads
Note that while the above is arguably a microbenchmark, the actual application exposes exactly the same behaviour. It receives audio RTP packets, decodes them, encodes them again and sends them out via another socket. All of this basically requiring almost no CPU by itself, the tricky part is only the relatively high packet rate, and lots of very small packets.
This also seems kind of a architectural problem, not an optimization. For some workloads the current default is great, for others not.
With regards to optimizations, there are a few other things I saw passing by while profiling that are independent of the threading architecture.