Asymmetric io_uring backend

Jump to bottom

Marcin Szopa edited this page Mar 3, 2026 · 7 revisions

Idea

move I/O handling from the shards
currently, some of the I/O operations are done during syscalls in the reactor loop
io_uring api offers capabilities to offload operations to the configured CPUs
offload shards, let them focus on the computations

io_uring

io_uring is a Linux api for asynchronous IO. It is designed on two cyclic buffers, called Submission Queue and Completion Queue. Program can schedule by putting request for the operation onto submission queue, and reaping results posted by the kernel from the Completion Queue. The API offers many optimizations, including possibilities to decrease number of syscalls. See io_uring(7) — Linux manual page

Application cores

CPUs where Seastar apps are run. In the typical scenario, each shard is pinned to one of the application cores. Also, in such scenario there's an obvious correlation: shard ~ core.

Networking cores

On large machine (with huge number of CPUs) some of the vCPUs aren't used as application cores. Mainly due to not enough NIC queues. Let's assume that it's about 1/8 of the total number of CPUs + vCPUs.

Asymmetric io_uring

The new backend aims to use the networking cores and use them to do IO operations. The io_uring instances would be configured in such way, that the kernel does all operations only on the networking cores, leaving application cores for app logic, not IO.

How

io_uring offers possibility to specify where kernel spawns and runs workers, which are responsible for performing the operations. See /io_uring_register_iowq_aff.
- we will associate a single networking core with a single set of (sq_poll_thread, workers affinity)
as the ratio networking : application cores is 1 : 7, multiple io_uring instances would have to "share" the CPUs were workers are run. However, io_uring offers the option to unify worker thread pool for multiple io_uring instances: IORING_SETUP_ATTACH_WQ. This way we prevent time stealing between the workers from different thread pools, as there is one per CPU.

Overview

each shard has their own instance of io_uring
- they have access only to it, communicate directly

flowchart TD
    subgraph urings0
        subgraph uring0["io_uring 0"]
            SQ0["SQ"]
            CQ0["CQ"]
        end

        subgraph uring1["io_uring 1"]
            SQ1["SQ"]
            CQ1["CQ"]
        end

        subgraph uring2["io_uring 2"]
            SQ2["SQ"]
            CQ2["CQ"]
        end

        subgraph uring3["io_uring 3"]
            SQ3["SQ"]
            CQ3["CQ"]
        end
    end

    subgraph SHARDS0 ["Shards"]
        S0["Shard0"]
        S1["Shard1"]
        S2["Shard2"]
        S3["Shard3"]
    end

  S0 --> |submit| SQ0
  SQ0 --> |result| CQ0
  CQ0 --> |reap| S0

  S1 --> |submit| SQ1
  SQ1 --> |result| CQ1
  CQ1 --> |reap| S1

  S2 --> |submit| SQ2
  SQ2 --> |result| CQ2
  CQ2 --> |reap| S2

  S3 --> |submit| SQ3
  SQ3 --> |result| CQ3
  CQ3 --> |reap| S3

networking cores with workers running on them: min(shards, |networking_cores|)
each (of the utilized) networking cores has exacly one thread pool pinned to it
io_urings are grouped and evenly distributed between the networking cores. Each io_uring within the group are attached to each other. They share both, worker thread pool and the polling thread.

flowchart TD
  subgraph NETWORKING_CORES["Networking cores"]
    subgraph URING_WORKERS["IO_URING Workers"]
            W0["io_uring Thread pool W0"]
            W1["io_uring Thread pool W1"]
    end
  end

  subgraph Group0
    subgraph urings0
        subgraph uring0["io_uring 0"]
            SQ0["SQ"]
            CQ0["CQ"]
        end

        subgraph uring1["io_uring 1"]
            SQ1["SQ"]
            CQ1["CQ"]
        end
    end

    subgraph SHARDS0 ["Shards"]
        S0["Shard0"]
        S1["Shard1"]
    end
  end

  subgraph Group1
    subgraph urings1
        subgraph uring2["io_uring 2"]
            SQ2["SQ"]
            CQ2["CQ"]
        end

        subgraph uring3["io_uring 3"]
            SQ3["SQ"]
            CQ3["CQ"]
        end
    end

    subgraph SHARDS1 ["Shards"]
        S2["Shard2"]
        S3["Shard3"]
    end
  end


  W0 --> |Work on| urings0
  W1 --> |Work on| urings1

  S0 --> |work with| uring0
  S1 --> |work with| uring1
  S2 --> |work with| uring2
  S3 --> |work with| uring3