-
Notifications
You must be signed in to change notification settings - Fork 1
Asymmetric io_uring backend
- move I/O handling from the shards
- currently, some of the I/O operations are done during syscalls in the reactor loop
-
io_uringapi offers capabilities to offload operations to the configured CPUs - offload shards, let them focus on the computations
io_uring is a Linux api for asynchronous IO. It is designed on two cyclic buffers, called Submission Queue and Completion Queue. Program can schedule by putting request for the operation onto submission queue, and reaping results posted by the kernel from the Completion Queue. The API offers many optimizations, including possibilities to decrease number of syscalls. See io_uring(7) — Linux manual page
CPUs where Seastar apps are run. In the typical scenario, each shard is pinned to one of the application cores. Also, in such scenario there's an obvious correlation: shard ~ core.
On large machine (with huge number of CPUs) some of the vCPUs aren't used as application cores. Mainly due to not enough NIC queues. Let's assume that it's about 1/8 of the total number of CPUs + vCPUs.
The new backend aims to use the networking cores and use them to do IO operations. The io_uring instances would be configured in such way, that the kernel does all operations only on the networking cores, leaving application cores for app logic, not IO.
- io_uring offers possibility to specify where kernel spawns and runs workers, which are responsible for performing the operations. See /io_uring_register_iowq_aff.
- we will associate a single networking core with a single set of (
sq_poll_thread,workers affinity)
- we will associate a single networking core with a single set of (
- as the ratio networking : application cores is 1 : 7, multiple io_uring instances would have to "share" the CPUs were workers are run. However, io_uring offers the option to unify worker thread pool for multiple io_uring instances: IORING_SETUP_ATTACH_WQ. This way we prevent time stealing between the workers from different thread pools, as there is one per CPU.
- each shard has their own instance of io_uring
- they have access only to it, communicate directly
flowchart TD
subgraph urings0
subgraph uring0["io_uring 0"]
SQ0["SQ"]
CQ0["CQ"]
end
subgraph uring1["io_uring 1"]
SQ1["SQ"]
CQ1["CQ"]
end
subgraph uring2["io_uring 2"]
SQ2["SQ"]
CQ2["CQ"]
end
subgraph uring3["io_uring 3"]
SQ3["SQ"]
CQ3["CQ"]
end
end
subgraph SHARDS0 ["Shards"]
S0["Shard0"]
S1["Shard1"]
S2["Shard2"]
S3["Shard3"]
end
S0 --> |submit| SQ0
SQ0 --> |result| CQ0
CQ0 --> |reap| S0
S1 --> |submit| SQ1
SQ1 --> |result| CQ1
CQ1 --> |reap| S1
S2 --> |submit| SQ2
SQ2 --> |result| CQ2
CQ2 --> |reap| S2
S3 --> |submit| SQ3
SQ3 --> |result| CQ3
CQ3 --> |reap| S3
- networking cores with workers running on them: min(shards, |networking_cores|)
- each (of the utilized) networking cores has exacly one thread pool pinned to it
- io_urings are grouped and evenly distributed between the networking cores. Each io_uring within the group are attached to each other. They share both, worker thread pool and the polling thread.
flowchart TD
subgraph NETWORKING_CORES["Networking cores"]
subgraph URING_WORKERS["IO_URING Workers"]
W0["io_uring Thread pool W0"]
W1["io_uring Thread pool W1"]
end
end
subgraph Group0
subgraph urings0
subgraph uring0["io_uring 0"]
SQ0["SQ"]
CQ0["CQ"]
end
subgraph uring1["io_uring 1"]
SQ1["SQ"]
CQ1["CQ"]
end
end
subgraph SHARDS0 ["Shards"]
S0["Shard0"]
S1["Shard1"]
end
end
subgraph Group1
subgraph urings1
subgraph uring2["io_uring 2"]
SQ2["SQ"]
CQ2["CQ"]
end
subgraph uring3["io_uring 3"]
SQ3["SQ"]
CQ3["CQ"]
end
end
subgraph SHARDS1 ["Shards"]
S2["Shard2"]
S3["Shard3"]
end
end
W0 --> |Work on| urings0
W1 --> |Work on| urings1
S0 --> |work with| uring0
S1 --> |work with| uring1
S2 --> |work with| uring2
S3 --> |work with| uring3