Asymmetric io_uring backend

Jump to bottom

Marcin Szopa edited this page Mar 3, 2026 · 7 revisions

Idea

Move I/O handling away from the shards
Currently, some I/O operations are performed during syscalls in the reactor loop
The io_uring API provides capabilities to offload operations to specific CPUs
By offloading I/O, shards can focus on computations

No heuristics for networking

Currently, in the existing backend implementations, there are two main paths for I/O requests:

Networking I/O: A heuristic speculation determines whether the operation is expected to complete in non-blocking mode. If so, the syscall is made immediately. If the operation would block, or if the speculation fails, the respective backend mechanism is used instead.
Disk I/O: Always uses the respective backend mechanism (poll then syscall for epoll/linux-aio backends, or SQE submission to io_uring for the io_uring backend).

Following the principle of offloading work to dedicated CPUs, the asymmetric io_uring backend eliminates the speculative fast-track path for networking I/O. All I/O operations, both networking and disk, are consistently handled through the io_uring mechanism, ensuring that application cores remain focused on computation rather than I/O speculation and handling.

io_uring

io_uring is a Linux API for asynchronous I/O. It is designed around two cyclic buffers: the Submission Queue and the Completion Queue. Programs schedule operations by placing requests on the Submission Queue and reaping results posted by the kernel on the Completion Queue. The API offers many optimizations, including reducing the number of syscalls required. For more details, see io_uring(7) — Linux manual page.

Application cores

Application cores are the CPUs where Seastar applications run. Typically, each shard is pinned to a single application core, establishing a one-to-one mapping between shards and cores.

Networking cores

On large machines with many CPUs, some vCPUs cannot be used as application cores due to insufficient NIC queues. We assume that approximately 1/8 of the total CPUs and vCPUs fall into this category.

Asymmetric io_uring

The new backend leverages networking cores to perform I/O operations. The io_uring instances are configured so that the kernel performs all I/O operations exclusively on networking cores, leaving application cores free for application logic.

How

io_uring provides the ability to specify where the kernel spawns and runs workers that perform I/O operations. See io_uring_register_iowq_aff.
- We associate each networking core with a unique set of (sq_poll_thread, workers affinity)
Since the networking-to-application core ratio is 1:7, multiple io_uring instances must share the CPUs where workers run. However, io_uring provides the option to unify worker thread pools across multiple io_uring instances using IORING_SETUP_ATTACH_WQ. This prevents contention between workers from different thread pools by maintaining a single unified pool per CPU.

Overview

Each shard has its own io_uring instance
- Shards have exclusive access to their instance and communicate with it directly

flowchart TD
    subgraph urings
        subgraph uring0["io_uring 0"]
            SQ0["SQ"]
            CQ0["CQ"]
        end

        subgraph uring1["io_uring 1"]
            SQ1["SQ"]
            CQ1["CQ"]
        end

        subgraph uring2["io_uring 2"]
            SQ2["SQ"]
            CQ2["CQ"]
        end

        subgraph uring3["io_uring 3"]
            SQ3["SQ"]
            CQ3["CQ"]
        end
    end

    subgraph SHARDS0 ["Shards"]
        S0["Shard0"]
        S1["Shard1"]
        S2["Shard2"]
        S3["Shard3"]
    end

  S0 --> |SQE submitted| SQ0
  SQ0 --> |some result available, CQE posted| CQ0
  CQ0 --> |CQE reaped by the shard| S0

  S1 --> |SQE submitted| SQ1
  SQ1 --> |some result available, CQE posted| CQ1
  CQ1 --> |CQE reaped by the shard| S1

  S2 --> |SQE submitted| SQ2
  SQ2 --> |some result available, CQE posted| CQ2
  CQ2 --> |CQE reaped by the shard| S2

  S3 --> |SQE submitted| SQ3
  SQ3 --> |some result available, CQE posted| CQ3
  CQ3 --> |CQE reaped by the shard| S3

Number of networking cores with workers running: min(shards, |networking_cores|)
Each utilized networking core has exactly one thread pool pinned to it
io_uring instances are grouped and evenly distributed across networking cores. All io_uring instances within a group are attached to each other, sharing both the worker thread pool and the polling thread.

flowchart TD
  subgraph NETWORKING_CORES["Networking cores"]
    subgraph URING_WORKERS["IO_URING Workers"]
            W0["io_uring Thread pool W0"]
            W1["io_uring Thread pool W1"]
    end
  end

  subgraph Group0
    subgraph urings0
        subgraph uring0["io_uring 0"]
            SQ0["SQ"]
            CQ0["CQ"]
        end

        subgraph uring1["io_uring 1"]
            SQ1["SQ"]
            CQ1["CQ"]
        end
    end

    subgraph SHARDS0 ["Shards"]
        S0["Shard0"]
        S1["Shard1"]
    end
  end

  subgraph Group1
    subgraph urings1
        subgraph uring2["io_uring 2"]
            SQ2["SQ"]
            CQ2["CQ"]
        end

        subgraph uring3["io_uring 3"]
            SQ3["SQ"]
            CQ3["CQ"]
        end
    end

    subgraph SHARDS1 ["Shards"]
        S2["Shard2"]
        S3["Shard3"]
    end
  end


  W0 --> |Work on| urings0
  W1 --> |Work on| urings1

  S0 --> |work with| uring0
  S1 --> |work with| uring1
  S2 --> |work with| uring2
  S3 --> |work with| uring3

We utilize the SQ_POLL feature, which causes the kernel to spawn a thread that polls the Submission Queue for new submissions. This polling thread is shared among all io_uring instances within the same group. See IORING_SETUP_SQPOLL

flowchart TD
  subgraph NETWORKING_CORES["Networking cores"]
    subgraph CORE0["Networking core 0"]
            POLL0["io_uring SQ poll thread SP0"]
            W0["io_uring Thread pool W0"]
    end
    subgraph CORE1["Networking core 1"]
            POLL1["io_uring SQ poll thread SP1"]
            W1["io_uring Thread pool W1"]
    end
  end

  subgraph Group0
    subgraph urings0
        subgraph uring0["io_uring 0"]
            SQ0["SQ"]
            CQ0["CQ"]
        end

        subgraph uring1["io_uring 1"]
            SQ1["SQ"]
            CQ1["CQ"]
        end
    end
  end

  subgraph Group1
    subgraph urings1
        subgraph uring2["io_uring 2"]
            SQ2["SQ"]
            CQ2["CQ"]
        end

        subgraph uring3["io_uring 3"]
            SQ3["SQ"]
            CQ3["CQ"]
        end
    end
  end


  W0 --> |Work on| urings0

  POLL0 --> |Poll| SQ0
  POLL0 --> |Poll| SQ1

  W1 --> |Work on| urings1

  POLL1 --> |Poll| SQ2
  POLL1 --> |Poll| SQ3

Implementation

The implementation uses the --async-workers-cpuset CLI parameter to specify the CPU set for async workers. The number of networking cores utilized is determined by min(shards, |async_workers_cores|). io_uring instances are evenly distributed among the available worker cores using a simple round-robin algorithm, ensuring balanced workload distribution across the networking cores.

Additional improvements

TODO: buffers