Troubleshooting Tokio stalling and overhead in multiple custom libraries #8176

khanhtranngoccva · 2026-05-29T07:07:29Z

khanhtranngoccva
May 29, 2026

I am currently writing and/or porting a stack of libraries that facilitate creating a fully asynchronous passthrough FUSE filesystem that supports contextual encrypted views, with most of the I/O communication relying on batched io_uring operations.

Here are the source code of the lower-level libraries, all of which are built upon Tokio:

io_uring client with fallback that supports most major file operations
Asynchronous FUSE library port, based on cberner's original library

I have been encountering snags where the overhead incurred by various asynchronous and inter-thread communication mechanisms are so large that it caused a 4-5x operation throughput reduction compared to an older synchronous mechanism. Here is the flamegraph of the example, where I tried to open 64k handles across 40 partitions. The futex_wake (due to lock contention or channel starvation) often occupies 13-20% of total CPU time.

I would like to make a few educated guess first:

The FUSER library manages its own runtime per filesystem instead of allowing the application to handle the runtime creation. This may lead to thread oversubscription and thrashing. In the future, I plan to fix this by allowing callers to pass runtime handles. However, I do not know yet if the long execution time of transition_from_searching is caused by oversubscription or not.
spawn_blocking significantly holds up the threads due to a shared mutex contention, whose fix was rolled back due to introducing a regression. I attempted to mitigate this by foregoing spawn_blocking from certain calls that is nearly instantaneous like fcntl (F_GETFL/F_SETFL) or takes very little time (read at /proc/{pid}/stat).
There is starvation in my I/O library where not enough I/O requests are entering the queue, so idle penalty appears. There might be a few reasons - my test cases are probably not correct because I only had up to 40 simultaneous operations at any time, or the round-robin algorithm is not intelligent enough to stay on the same I/O queue, or there are spawn_blocking and transitioning penalties.
The I/O completion queue reaper in the io_uring library is impeded by waking up 1 waiting task at a time, slowing everything down because every call results in a futex_wake. I think this also plays a part in causing I/O starvation.

yudin-s · 2026-05-29T07:21:23Z

yudin-s
May 29, 2026

I would separate this into two experiments before changing much code: runtime ownership and per-operation wakeups.

The runtime-per-filesystem design is a very plausible source of the 4-5x loss. If you mount 40 filesystems and each one creates its own multi-thread Tokio runtime, you can easily end up with far more scheduler workers plus blocking-pool workers than useful CPU cores. For this kind of stack, I would prefer one application-owned runtime and pass a Handle/executor into the FUSE layer. Then benchmark with:

tokio::runtime::Builder::new_multi_thread()
    .worker_threads(num_cpus::get_physical())
    .max_blocking_threads(/* intentionally bounded */)
    .enable_all()
    .build()?;

For the spawn_blocking part, the decision rule I use is: do not use spawn_blocking for tiny syscalls just because the API is synchronous. The handoff, queueing, wakeup, and return path can cost more than the syscall. Reserve it for calls that can actually block for filesystem/device time, or move those to a small dedicated thread pool that is specific to the FUSE/io_uring bridge.

The futex_wake profile also suggests that the completion path may be waking one waiter per operation. For io_uring-style batching, try to make the hot path batch at both ends:

submit several SQEs before notifying;
reap several CQEs per poll;
complete multiple waiters from one reaper pass;
avoid a shared mutex/channel for all partitions if the work can be sharded;
avoid per-request oneshot/Mutex traffic in the hottest path if an indexed slab plus Notify/batched wakeup can represent the same state.

A useful minimum benchmark matrix would be:

one shared runtime vs one runtime per filesystem;
1, 4, 16, 40 mounted filesystems with fixed worker thread count;
short syscalls inline vs through spawn_blocking;
one global queue vs queue per partition/reaper shard.

Use tokio-console for task latency/wakeup visibility and perf lock/flamegraphs for the futex side. If transition_from_searching drops sharply when the runtimes are consolidated, you have oversubscription/scheduler churn. If it stays high but futex_wake remains dominant, focus on reducing shared synchronization and batching completions.

If this helps narrow it down, please mark the comment as the answer so others can find the diagnostic path faster.

0 replies

khanhtranngoccva · 2026-05-29T07:38:47Z

khanhtranngoccva
May 29, 2026
Author

@yudin-s I think there is another problem - the current FUSE implementation's event loop spawns short-lived handler tasks. I wonder if Tokio is designed to handle that many or I should follow an actor model instead.

And yes, speaking about oversubscription, although the flamegraph tests activates only 1 running filesystem, running multiple stress-tests at once seems to cause an even bigger slowdown (cargo may runs into this, and some unfortunate combination may prolong the test suite to a very long 13 minutes duration instead of the usual 2 minutes).

1 reply

yudin-s May 30, 2026

Tokio is generally fine with many short-lived tasks, so I would not treat task creation alone as the likely problem. The thing to measure is whether the FUSE workload is creating more runnable/blocking work than the runtime and io_uring side can drain, especially when several stress tests run at once.

I would not switch to an actor model just because many tasks are created. I would switch only if it gives you one of these concrete benefits: bounded concurrency, batching/coalescing, fewer shared locks, or clearer backpressure between FUSE and the io_uring side.

A good comparison benchmark would be three versions of the hot path:

1. current: spawn one Tokio task per FUSE request
2. bounded: spawn per request, but gate the expensive part behind a Semaphore
3. actor/worker: send requests over mpsc to N long-lived workers and batch completions where possible

If version 2 fixes most of the slowdown, the problem is probably oversubscription/backpressure rather than task creation itself. If version 3 is much better, the actor shape is buying batching, lock locality, or completion handling. tokio-console should also show whether those tasks spend time runnable, waiting on locks/channels, or stuck behind blocking work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Troubleshooting Tokio stalling and overhead in multiple custom libraries #8176

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Troubleshooting Tokio stalling and overhead in multiple custom libraries #8176

Uh oh!

khanhtranngoccva May 29, 2026

Replies: 2 comments · 1 reply

Uh oh!

yudin-s May 29, 2026

Uh oh!

Uh oh!

khanhtranngoccva May 29, 2026 Author

Uh oh!

yudin-s May 30, 2026

khanhtranngoccva
May 29, 2026

Replies: 2 comments 1 reply

yudin-s
May 29, 2026

khanhtranngoccva
May 29, 2026
Author