-
Notifications
You must be signed in to change notification settings - Fork 1
SQ_POLL perf issue
Below you can find our unexpected findings after benchmarking asymmetric io_uring backend - using SQ_POLL leads to de-gradated performance for networking I/O.
SQ_POLL is a io_uring feature that tells the kernel to create a special kernel polling thread for io_uring instance, which will continuously poll SQ for new submissions.
Instead of calling io_uring_enter, which is called internally by the by liburing/io_uring_submit, after submitting each batch of SQEs, the program can skip this syscall if the polling thread is running.
The polling thread is paused if there was not any new SQEs for some configurable peiod of time.
The asymmetric io_uring backend was intended to leverage SQ_POLL, with the expectation that the polling thread could be shared across multiple io_uring instances.
By using SQ_POLL, we anticipated improvements in both latency and throughput. The polling thread continuously checks the submission queue entries (SQEs), which inherently consumes CPU cycles due to busy-waiting when no work is available. This high CPU usage could potentially impact other threads running on the same core, particularly worker threads, which in our design are pinned to the same CPU.
However, worker threads are lightweight in terms of CPU consumption, and in practice, they are expected to be invoked infrequently. The assumption was that the SQ_POLL thread would immediately handle SQEs as they arrive, providing low-latency processing—though in reality.
You can find the current version of the code probably here. Please note that this is an ongoing project, thus, some of the info from here might be outdated.
In our Benchmarks we've find out that with the increasing ratio of io_uring per CPU, the worse this design behaves.
- each shard is pinned to a seperate cpu (not vcpu)
- there is 1 worker CPU per n number of shard.
- apart from N CPUs for shards, N+1th CPU/vCPU is used for
sq_thread_cpuand workers' affinity - see IORING_SETUP_SQPOLL and io_uring_register_iowq_aff
- apart from N CPUs for shards, N+1th CPU/vCPU is used for
In this scenario, asymmetric io_uring backend is on pair with other implementations.
Disk I/O still seems to perform really well here
Networking I/O via rpc implementation performs here much worse (per shard)
| CPUs | benchmark | stats | with SQ_POLL | without SQ_POLL |
|---|---|---|---|---|
| 1 shard + 1 CPU | io_throughput_writes | IOPS | ![]() |
![]() |
| 1 shard + 1 CPU | io_latency_reads | p0.5 latencies | ![]() |
![]() |
| 1 shard + 1 CPU | io_latency_reads | p0.99 latencies | ![]() |
![]() |
| 1 shard + 1 CPU | io_latency_reads | p0.999 latencies | ![]() |
![]() |
| 1 shard + 1 CPU | rpc_throughput_uni_128kB | Messages | ![]() |
![]() |
| 1 shard + 1 CPU | rpc_latency_write_16kB | p0.5 latencies | ![]() |
![]() |
| 1 shard + 1 CPU | rpc_latency_write_16kB | p0.99 latencies | ![]() |
![]() |
| 1 shard + 1 CPU | rpc_latency_write_16kB | p0.999 latencies | ![]() |
![]() |
| 7 shard + 1 CPU | io_throughput_writes | IOPS | ![]() |
![]() |
| 7 shard + 1 CPU | io_latency_reads | p0.5 latencies | ![]() |
![]() |
| 7 shard + 1 CPU | io_latency_reads | p0.99 latencies | ![]() |
![]() |
| 7 shard + 1 CPU | io_latency_reads | p0.999 latencies | ![]() |
![]() |
| 7 shard + 1 CPU | rpc_throughput_uni_128kB | Messages | ![]() |
![]() |
| 7 shard + 1 CPU | rpc_latency_write_16kB | p0.5 latencies | ![]() |
![]() |
| 7 shard + 1 CPU | rpc_latency_write_16kB | p0.99 latencies | ![]() |
![]() |
| 7 shard + 1 CPU | rpc_latency_write_16kB | p0.999 latencies | ![]() |
![]() |
- used
io_testerandrpc_testerwith various suites. - nightly, on a shared machine, which might have skewed the results.
- multiple iterations were run. Bars are the mean value, error bars are std_dev.
- networking was tested on the loopback interface.
- run on: potwor2 machine, below you can find some of it's properties.
Benchmarks were also run, with sq_thread_cpu and workers' affinity pinned to vCPU, not real CPU. They can be found among other results.
- is
sq_threadsomehow starving the workers? - maybe we're hitting maximum number of workers, especially the unbounded ones?































