SQ_POLL perf issue

Below you can find our unexpected findings after benchmarking asymmetric io_uring backend - using SQ_POLL leads to de-gradated performance for networking I/O.

What is SQ_POLL

SQ_POLL is a io_uring feature that tells the kernel to create a special kernel polling thread for io_uring instance, which will continuously poll SQ for new submissions.

Instead of calling io_uring_enter, which is called internally by the by liburing/io_uring_submit, after submitting each batch of SQEs, the program can skip this syscall if the polling thread is running.

The polling thread is paused if there was not any new SQEs for some configurable peiod of time.

Why bother

The asymmetric io_uring backend was intended to leverage SQ_POLL, with the expectation that the polling thread could be shared across multiple io_uring instances.

Expectations

By using SQ_POLL, we anticipated improvements in both latency and throughput. The polling thread continuously checks the submission queue entries (SQEs), which inherently consumes CPU cycles due to busy-waiting when no work is available. This high CPU usage could potentially impact other threads running on the same core, particularly worker threads, which in our design are pinned to the same CPU.

However, worker threads are lightweight in terms of CPU consumption, and in practice, they are expected to be invoked infrequently. The assumption was that the SQ_POLL thread would immediately handle SQEs as they arrive, providing low-latency processing—though in reality.

Code

You can find the current version of the code probably here. Please note that this is an ongoing project, thus, some of the info from here might be outdated.

Solution overview

Performance

In our Benchmarks we've find out that with the increasing ratio of io_uring per CPU, the worse this design behaves.

Setup

each shard is pinned to a seperate cpu (not vcpu)
there is 1 worker CPU per n number of shard.
- apart from N CPUs for shards, N+1th CPU/vCPU is used for sq_thread_cpu and workers' affinity
- see IORING_SETUP_SQPOLL and io_uring_register_iowq_aff

With SQ_POLL

1 shard, 1 worker CPU

In this scenario, asymmetric io_uring backend is on pair with other implementations.

io_tester overwrite 128KB parallelism 20

rpc_tester rpc_streaming unidirectional 128kB parallelism 1

7 shards, 1 worker CPU

io_tester overwrite 128KB parallelism 20

Disk I/O still seems to perform really well here

rpc_tester rpc_streaming unidirectional 128kB parallelism 1

Networking I/O via rpc implementation performs here much worse (per shard)

Comparison

CPUs	benchmark	stats
1 shard + 1 CPU	io_throughput_writes	IOPS
1 shard + 1 CPU	io_latency_reads	p0.5 latencies
1 shard + 1 CPU	io_latency_reads	p0.99 latencies
1 shard + 1 CPU	io_latency_reads	p0.999 latencies
1 shard + 1 CPU	rpc_throughput_uni_128kB	Messages
1 shard + 1 CPU	rpc_latency_write_16kB	p0.5 latencies
1 shard + 1 CPU	rpc_latency_write_16kB	p0.99 latencies
1 shard + 1 CPU	rpc_latency_write_16kB	p0.999 latencies
7 shard + 1 CPU	io_throughput_writes	IOPS
7 shard + 1 CPU	io_latency_reads	p0.5 latencies
7 shard + 1 CPU	io_latency_reads	p0.99 latencies
7 shard + 1 CPU	io_latency_reads	p0.999 latencies
7 shard + 1 CPU	rpc_throughput_uni_128kB	Messages
7 shard + 1 CPU	rpc_latency_write_16kB	p0.5 latencies
7 shard + 1 CPU	rpc_latency_write_16kB	p0.99 latencies
7 shard + 1 CPU	rpc_latency_write_16kB	p0.999 latencies

Benchmarks

used io_tester and rpc_tester with various suites.
nightly, on a shared machine, which might have skewed the results.
multiple iterations were run. Bars are the mean value, error bars are std_dev.
networking was tested on the loopback interface.
run on: potwor2 machine, below you can find some of it's properties.

vCPUs

Benchmarks were also run, with sq_thread_cpu and workers' affinity pinned to vCPU, not real CPU. They can be found among other results.

Full results

nopoll-all.zip poll-all.zip

Questions

is sq_thread somehow starving the workers?
maybe we're hitting maximum number of workers, especially the unbounded ones?
- https://blog.cloudflare.com/missing-manuals-io_uring-worker-pool/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQ_POLL perf issue

What is SQ_POLL

Why bother

Expectations

Code

Solution overview

Performance

Setup

With SQ_POLL

1 shard, 1 worker CPU

io_tester overwrite 128KB parallelism 20

rpc_tester rpc_streaming unidirectional 128kB parallelism 1

7 shards, 1 worker CPU

io_tester overwrite 128KB parallelism 20

rpc_tester rpc_streaming unidirectional 128kB parallelism 1

Comparison

Benchmarks

vCPUs

Full results

Questions

Uh oh!

Uh oh!

Clone this wiki locally