Skip to content

SQ_POLL perf issue

Marcin Szopa edited this page Mar 3, 2026 · 6 revisions

Below you can find our unexpected findings after benchmarking asymmetric io_uring backend - using SQ_POLL leads to de-gradated performance for networking I/O.

What is SQ_POLL

SQ_POLL is a io_uring feature that tells the kernel to create a special kernel polling thread for io_uring instance, which will continuously poll SQ for new submissions.

Instead of calling io_uring_enter, which is called internally by the by liburing/io_uring_submit, after submitting each batch of SQEs, the program can skip this syscall if the polling thread is running.

The polling thread is paused if there was not any new SQEs for some configurable peiod of time.

Why bother

The asymmetric io_uring backend was intended to leverage SQ_POLL, with the expectation that the polling thread could be shared across multiple io_uring instances.

Expectations

By using SQ_POLL, we anticipated improvements in both latency and throughput. The polling thread continuously checks the submission queue entries (SQEs), which inherently consumes CPU cycles due to busy-waiting when no work is available. This high CPU usage could potentially impact other threads running on the same core, particularly worker threads, which in our design are pinned to the same CPU.

However, worker threads are lightweight in terms of CPU consumption, and in practice, they are expected to be invoked infrequently. The assumption was that the SQ_POLL thread would immediately handle SQEs as they arrive, providing low-latency processing—though in reality.

Code

You can find the current version of the code probably here. Please note that this is an ongoing project, thus, some of the info from here might be outdated.

Solution overview

Performance

In our Benchmarks we've find out that with the increasing ratio of io_uring per CPU, the worse this design behaves.

Setup

  • each shard is pinned to a seperate cpu (not vcpu)
  • there is 1 worker CPU per n number of shard.

With SQ_POLL

1 shard, 1 worker CPU

In this scenario, asymmetric io_uring backend is on pair with other implementations.

io_tester overwrite 128KB parallelism 20
image
rpc_tester rpc_streaming unidirectional 128kB parallelism 1
image

7 shards, 1 worker CPU

io_tester overwrite 128KB parallelism 20

Disk I/O still seems to perform really well here

image
rpc_tester rpc_streaming unidirectional 128kB parallelism 1

Networking I/O via rpc implementation performs here much worse (per shard)

image

Comparison

CPUs benchmark stats with SQ_POLL without SQ_POLL
1 shard + 1 CPU io_throughput_writes IOPS image image
1 shard + 1 CPU io_latency_reads p0.5 latencies image image
1 shard + 1 CPU io_latency_reads p0.99 latencies image image
1 shard + 1 CPU io_latency_reads p0.999 latencies image image
1 shard + 1 CPU rpc_throughput_uni_128kB Messages image image
1 shard + 1 CPU rpc_latency_write_16kB p0.5 latencies image image
1 shard + 1 CPU rpc_latency_write_16kB p0.99 latencies image image
1 shard + 1 CPU rpc_latency_write_16kB p0.999 latencies image image
7 shard + 1 CPU io_throughput_writes IOPS image image
7 shard + 1 CPU io_latency_reads p0.5 latencies image image
7 shard + 1 CPU io_latency_reads p0.99 latencies image image
7 shard + 1 CPU io_latency_reads p0.999 latencies image image
7 shard + 1 CPU rpc_throughput_uni_128kB Messages image image
7 shard + 1 CPU rpc_latency_write_16kB p0.5 latencies image image
7 shard + 1 CPU rpc_latency_write_16kB p0.99 latencies image image
7 shard + 1 CPU rpc_latency_write_16kB p0.999 latencies image image

Benchmarks

  • used io_tester and rpc_tester with various suites.
  • nightly, on a shared machine, which might have skewed the results.
  • multiple iterations were run. Bars are the mean value, error bars are std_dev.
  • networking was tested on the loopback interface.
  • run on: potwor2 machine, below you can find some of it's properties.

vCPUs

Benchmarks were also run, with sq_thread_cpu and workers' affinity pinned to vCPU, not real CPU. They can be found among other results.

Full results

nopoll-all.zip poll-all.zip

Questions