Troubleshooting Tokio stalling and overhead in multiple custom libraries #8176
Replies: 2 comments 1 reply
-
|
I would separate this into two experiments before changing much code: runtime ownership and per-operation wakeups. The runtime-per-filesystem design is a very plausible source of the 4-5x loss. If you mount 40 filesystems and each one creates its own multi-thread Tokio runtime, you can easily end up with far more scheduler workers plus blocking-pool workers than useful CPU cores. For this kind of stack, I would prefer one application-owned runtime and pass a tokio::runtime::Builder::new_multi_thread()
.worker_threads(num_cpus::get_physical())
.max_blocking_threads(/* intentionally bounded */)
.enable_all()
.build()?;For the The
A useful minimum benchmark matrix would be:
Use If this helps narrow it down, please mark the comment as the answer so others can find the diagnostic path faster. |
Beta Was this translation helpful? Give feedback.
-
|
@yudin-s I think there is another problem - the current FUSE implementation's event loop spawns short-lived handler tasks. I wonder if Tokio is designed to handle that many or I should follow an actor model instead. And yes, speaking about oversubscription, although the flamegraph tests activates only 1 running filesystem, running multiple stress-tests at once seems to cause an even bigger slowdown (cargo may runs into this, and some unfortunate combination may prolong the test suite to a very long 13 minutes duration instead of the usual 2 minutes). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am currently writing and/or porting a stack of libraries that facilitate creating a fully asynchronous passthrough FUSE filesystem that supports contextual encrypted views, with most of the I/O communication relying on batched io_uring operations.
Here are the source code of the lower-level libraries, all of which are built upon Tokio:
I have been encountering snags where the overhead incurred by various asynchronous and inter-thread communication mechanisms are so large that it caused a 4-5x operation throughput reduction compared to an older synchronous mechanism. Here is the flamegraph of the example, where I tried to open 64k handles across 40 partitions. The futex_wake (due to lock contention or channel starvation) often occupies 13-20% of total CPU time.
I would like to make a few educated guess first:
Beta Was this translation helpful? Give feedback.
All reactions