xNVMe ioengine Part 2

This blog post is co-authored by Ankit Kumar and Vincent Fu. We are grateful to Kanchan Joshi, Adam Manzanares, and Krishna Kanth Reddy for their helpful comments, but of course all remaining errors and omissions are ours alone.

In Part 1 we covered the basics of building and using xNVMe, provided a few examples, and compared the performance of xNVMe backends with fio ioengines. In this entry we will cover some xNVMe ioengine options, provide more performance comparisons (including the new io_uring_cmd backend), and discuss xNVMe's SPDK backend.

xNVMe ioengine options

The xNVMe ioengine provides options for polling and vectored I/O and allows selection of the submission path for NVMe administrative commands. Support for polling and vectored I/O options varies by backend. These options are:

Option	Description	Availability
`hipri`	If this option is set, fio will attempt to use polled I/O completions. Normal I/O completions generate interrupts to signal the completion of I/O, but polled completions do not. Hence they require active reaping by the application. The benefits are more efficient I/O for high IOPS scenarios and lower latencies for low queue depth I/O. These benefits come at the cost of potentially higher CPU utilization.	Only supported by `xnvme_async={libaio, io_uring, io_uring_cmd}`
`sqthread_poll`	Normally fio will submit I/O by issuing a system call to notify the kernel of available items in the submission queue. If this option is set, the act of submitting I/O will be done by a kernel polling thread. This reduces context switches from user-space to kernel-space and also frees up cycles for fio but comes at the cost of using more CPU in the system.	Only supported by `xnvme_async={io_uring, io_uring_cmd}`
`xnvme_iovec`	If set, xNVMe will use vectored read/write commands. For non-vectored I/O the submission queue entry `addr` field contains the source or target buffer address and the `len` field contains the byte count for I/O transfer. Meanwhile for vectored I/O, the submission queue entry `len` field specifies the length of the vector of buffers at the `addr` field. Because fio can currently only send vectors of length 1 this option is admittedly not useful.	Supported by all the sync and async backends

To select the NVMe administrative command submission pathway, specify:

xnvme_admin={nvme, block}

The table below describes the two options:

Pathway	Description
`nvme`	This is the default option and uses the Linux/FreeBSD NVMe driver ioctl().
`block`	If specified xNVMe will instead use the Linux/FreeBSD block layer ioctl() and sysfs entries for administrative commands. This option is useful when using xNVMe to access a device (such as null_blk) that is not NVMe and for which the NVMe driver ioctls will not be available. For more details see the discussion in the xNVMe documentation.

Example: test platform

Let us examine the impact of xNVMe's polling options with some performance data. The table below lists the key hardware and software components of our evaluation system. It is comprised of low cost commodity hardware such that reconstructing the environment and reproducing the experiments is possible with little effort and cost. We use a low latency SSD as the target device.

Item	Description
CPU	AMD Ryzen 9 5900X 4.9GHz
Memory	16GB DDR4 2667Mhz
Motherboard	MEG X570 GODLIKE (MS-7C34)

Linux	Ubuntu 22.04 LTS
fio	3.32
xNVMe	0.5.0
gcc	11.2.0
kernel	6.0.0-rc2

This is the same hardware platform that we used in Part 1 although we are now using a later build of fio, a later release of xNVMe, and a newer kernel.

Example: xNVMe polling options

Let us explore the two polling options for the xNVMe ioengine with the xnvme_async=io_uring backend. We will focus on CPU utilization and the number of context switches.

To effectively use completion polling (hipri) we need to set the NVMe driver's poll_queues option. This will specify the number of queues to use for polled I/O with the NVMe driver. The default is zero. Instantiate one queue for use with polled I/O with the following commands:

cat /sys/block/nvme0n1/queue/io_poll
0
 
rmmod nvme-core
rmmod nvme
insmod drivers/nvme/host/nvme-core.ko
insmod drivers/nvme/host/nvme.ko poll_queues=1
 
cat /sys/block/nvme0n1/queue/io_poll
1

Below is the fio job file used for our tests:

xnvme-compare1.fio

[global]
thread=1
filename=/dev/nvme0n1
time_based
runtime=1m
numjobs=1
rw=randread
bs=4k
cpus_allowed=1
norandommap
direct=1
 
[default]
ioengine=xnvme
xnvme_async=io_uring
 
[hipri]
ioengine=xnvme
xnvme_async=io_uring
hipri=1
 
[sqpoll]
ioengine=xnvme
xnvme_async=io_uring
sqthread_poll=1

Here is the output for the three different configurations we tested:

Default configuration

$ fio xnvme-compare1.fio --section=default
default: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=446MiB/s][r=114k IOPS][eta 00m:00s]
default: (groupid=0, jobs=1): err= 0: pid=519771: Thu Oct 13 01:07:36 2022
  read: IOPS=114k, BW=445MiB/s (467MB/s)(26.1GiB/60001msec)
    slat (nsec): min=800, max=30210, avg=873.33, stdev=59.77
    clat (nsec): min=110, max=975201, avg=7694.55, stdev=836.92
     lat (nsec): min=7370, max=976301, avg=8567.89, stdev=839.26
    clat percentiles (nsec):
     |  1.00th=[ 7520],  5.00th=[ 7584], 10.00th=[ 7584], 20.00th=[ 7584],
     | 30.00th=[ 7648], 40.00th=[ 7648], 50.00th=[ 7648], 60.00th=[ 7648],
     | 70.00th=[ 7648], 80.00th=[ 7648], 90.00th=[ 7712], 95.00th=[ 7840],
     | 99.00th=[ 8512], 99.50th=[11968], 99.90th=[14272], 99.95th=[14400],
     | 99.99th=[15936]
   bw (  KiB/s): min=444448, max=458208, per=100.00%, avg=455874.96, stdev=1297.00, samples=119
   iops        : min=111112, max=114552, avg=113968.74, stdev=324.25, samples=119
  lat (nsec)   : 250=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=99.29%, 20=0.71%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=7.48%, sys=21.07%, ctx=6836163, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=6836299,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=445MiB/s (467MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s), io=26.1GiB (28.0GB), run=60001-60001msec

Polled completions with hipri

$ fio xnvme-compare1.fio --section=hipri
hipri: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=595MiB/s][r=152k IOPS][eta 00m:00s]
hipri: (groupid=0, jobs=1): err= 0: pid=519927: Thu Oct 13 01:11:25 2022
  read: IOPS=152k, BW=595MiB/s (624MB/s)(34.9GiB/60001msec)
    slat (nsec): min=750, max=104598, avg=816.08, stdev=67.95
    clat (nsec): min=80, max=23389, avg=5606.79, stdev=470.59
     lat (nsec): min=5960, max=110448, avg=6422.86, stdev=474.30
    clat percentiles (nsec):
     |  1.00th=[ 5472],  5.00th=[ 5472], 10.00th=[ 5536], 20.00th=[ 5536],
     | 30.00th=[ 5536], 40.00th=[ 5536], 50.00th=[ 5536], 60.00th=[ 5600],
     | 70.00th=[ 5600], 80.00th=[ 5600], 90.00th=[ 5600], 95.00th=[ 5664],
     | 99.00th=[ 6560], 99.50th=[ 9024], 99.90th=[12224], 99.95th=[12224],
     | 99.99th=[12608]
   bw (  KiB/s): min=607336, max=611000, per=100.00%, avg=609484.77, stdev=1179.68, samples=119
   iops        : min=151834, max=152750, avg=152371.19, stdev=294.92, samples=119
  lat (nsec)   : 100=0.01%, 250=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=99.59%, 20=0.40%, 50=0.01%
  cpu          : usr=5.91%, sys=94.09%, ctx=243, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=9140795,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=595MiB/s (624MB/s), 595MiB/s-595MiB/s (624MB/s-624MB/s), io=34.9GiB (37.4GB), run=60001-60001msec

Submission queue polling with sqthread_poll

$ fio xnvme-compare1.fio --section=sqpoll
sqpoll: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=499MiB/s][r=128k IOPS][eta 00m:00s]
sqpoll: (groupid=0, jobs=1): err= 0: pid=519964: Thu Oct 13 01:14:31 2022
  read: IOPS=128k, BW=498MiB/s (523MB/s)(29.2GiB/60000msec)
    slat (nsec): min=120, max=2680, avg=241.33, stdev=42.61
    clat (usec): min=6, max=23381, avg= 7.47, stdev= 8.59
     lat (usec): min=7, max=23383, avg= 7.71, stdev= 8.59
    clat percentiles (nsec):
     |  1.00th=[ 7328],  5.00th=[ 7328], 10.00th=[ 7328], 20.00th=[ 7392],
     | 30.00th=[ 7392], 40.00th=[ 7456], 50.00th=[ 7456], 60.00th=[ 7456],
     | 70.00th=[ 7456], 80.00th=[ 7456], 90.00th=[ 7520], 95.00th=[ 7520],
     | 99.00th=[ 8160], 99.50th=[ 9792], 99.90th=[14272], 99.95th=[14272],
     | 99.99th=[15936]
   bw (  KiB/s): min=482216, max=511152, per=100.00%, avg=510294.45, stdev=2935.98, samples=119
   iops        : min=120554, max=127788, avg=127573.61, stdev=733.99, samples=119
  lat (usec)   : 10=99.51%, 20=0.49%, 50=0.01%
  lat (msec)   : 10=0.01%, 50=0.01%
  cpu          : usr=99.98%, sys=0.00%, ctx=112, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=7653842,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=498MiB/s (523MB/s), 498MiB/s-498MiB/s (523MB/s-523MB/s), io=29.2GiB (31.3GB), run=60000-60000msec

The table below summarizes the results for the three different configurations, listing IOPS, latency, userspace CPU utilization, system CPU utilization, and the number of context switches experienced by fio. Note that fio's reported user space and system (kernel) CPU usage and context switches are obtained from a call to getrusage with the RUSAGE_SELF flag. This includes only resource usage by the calling process (fio) and its children. CPU usage and context switches from other contexts are not included.

	default	`hipri`	`sqthread_poll`
Mean total latency (ns)	8568	6423	7710
IOPS	114K	152K	128K
CPU usr	7.48%	5.91%	99.98%
CPU sys	21.07%	94.09%	0%
Context switches	6836163	243	112

Comparing the results yields the following observations:

default: The most striking result is the large number of context switches, essentially one for each read command. These come from completion interrupts as well as system calls as fio submits and reaps completions. Compared to the other jobs, the default configuration has much lower CPU utilization, slightly lower IOPS, and slightly higher latency.
hipri: This enables completion polling via the IORING_SETUP_IOPOLL flag. This configuration performs busy-waiting for I/O completions, and provides lower latency but much higher CPU utilization compared to the default. CPU consumption is mostly in the kernel context as most of the CPU usage occurs as the kernel is polling the device for completions. As there are no interrupts from the NVMe driver for IO completions, the context switches measured by fio are few.
sqthread_poll: This enables submission queue polling via the IORING_SETUP_SQPOLL flag. As this creates a kernel thread for submission queue polling there is no context switch during I/O submission. Furthermore, because this kernel thread submits IO commands from a different CPU than the one running the fio process, the fio process is not interrupted by completion interrupts. Moreover, when submission queue polling is enabled, xNVMe (and fio's io_uring ioengine as well) essentially polls the completion queue from user space. This explains the high user space CPU consumption as CPU cycles are consumed by request submission and polling of the completion queue. To inspect system CPU usage and observe resource usage of the kernel polling thread use either htop or the sar utility. For sar the following will report CPU usage for each core with an interval of one second:

sar -P ALL 1

For more information see cpu-utilization.

Also see the note at the end for a libaio polling example.

Example: xNVMe vs other fio ioengines

Now let us continue the comparison between the xNVMe ioengine's backends and fio ioengines from Part 1.

We show a few more workloads to illustrate the performance penalty introduced by xNVMe. The first series of experiments compares xNVMe's io_uring backend against fio's io_uring ioengine with polled completions. Then we will compare the xNVMe io_uring_cmd backend with fio's new io_uring_cmd ioengine. Later on we will also see how the xNVMe ioengine using the SPDK driver fares against the default SPDK fio ioengine.

The job file we used for the experiments is below.

xnvme-compare2.fio

[global]
thread=1
time_based
runtime=1m
numjobs=1
rw=randread
bs=4k
cpus_allowed=1
norandommap
 
[uring]
ioengine=io_uring
direct=1
hipri=1
nonvectored=1

[xnvme_uring]
ioengine=xnvme
xnvme_async=io_uring
direct=1
hipri=1

[ucmd]
ioengine=io_uring_cmd
nonvectored=1

[xnvme_ucmd]
ioengine=xnvme
xnvme_async=io_uring_cmd

[ucmd_sqthread]
ioengine=io_uring_cmd
nonvectored=1
sqthread_poll=1

[xnvme_ucmd_sqthread]
ioengine=xnvme
xnvme_async=io_uring_cmd
sqthread_poll=1

Here is the output for the six different configurations:

io_uring ioengine with polled completions

$ fio xnvme-compare2.fio --section=uring --filename=/dev/nvme0n1
uring: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1
fio-3.32-39-g0574e
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=597MiB/s][r=153k IOPS][eta 00m:00s]
uring: (groupid=0, jobs=1): err= 0: pid=520027: Thu Oct 13 01:21:39 2022
  read: IOPS=153k, BW=597MiB/s (626MB/s)(35.0GiB/60001msec)
    slat (nsec): min=700, max=18049, avg=762.65, stdev=59.67
    clat (nsec): min=80, max=26059, avg=5632.44, stdev=471.87
     lat (nsec): min=5920, max=29739, avg=6395.10, stdev=474.23
    clat percentiles (nsec):
     |  1.00th=[ 5472],  5.00th=[ 5536], 10.00th=[ 5536], 20.00th=[ 5536],
     | 30.00th=[ 5536], 40.00th=[ 5600], 50.00th=[ 5600], 60.00th=[ 5600],
     | 70.00th=[ 5600], 80.00th=[ 5600], 90.00th=[ 5664], 95.00th=[ 5792],
     | 99.00th=[ 6560], 99.50th=[ 9152], 99.90th=[12224], 99.95th=[12224],
     | 99.99th=[12864]
   bw (  KiB/s): min=610264, max=611352, per=100.00%, avg=611085.98, stdev=280.43, samples=119
   iops        : min=152566, max=152838, avg=152771.50, stdev=70.11, samples=119
  lat (nsec)   : 100=0.01%, 250=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=99.59%, 20=0.41%, 50=0.01%
  cpu          : usr=4.99%, sys=95.00%, ctx=219, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=9163979,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=597MiB/s (626MB/s), 597MiB/s-597MiB/s (626MB/s-626MB/s), io=35.0GiB (37.5GB), run=60001-60001msec
 
Disk stats (read/write):
  nvme0n1: ios=9148582/0, merge=0/0, ticks=53935/0, in_queue=53935, util=99.87%

io_uring backend (xNVMe ioengine) with polled completions

$ fio xnvme-compare2.fio --section=xnvme_uring --filename=/dev/nvme0n1
xnvme_uring: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=595MiB/s][r=152k IOPS][eta 00m:00s]
xnvme_uring: (groupid=0, jobs=1): err= 0: pid=520074: Thu Oct 13 01:23:07 2022
  read: IOPS=152k, BW=595MiB/s (623MB/s)(34.8GiB/60000msec)
    slat (nsec): min=770, max=29609, avg=830.05, stdev=59.40
    clat (nsec): min=80, max=106778, avg=5595.02, stdev=471.78
     lat (nsec): min=5940, max=107878, avg=6425.06, stdev=474.11
    clat percentiles (nsec):
     |  1.00th=[ 5472],  5.00th=[ 5472], 10.00th=[ 5472], 20.00th=[ 5536],
     | 30.00th=[ 5536], 40.00th=[ 5536], 50.00th=[ 5536], 60.00th=[ 5536],
     | 70.00th=[ 5536], 80.00th=[ 5600], 90.00th=[ 5600], 95.00th=[ 5664],
     | 99.00th=[ 6496], 99.50th=[ 9024], 99.90th=[12224], 99.95th=[12224],
     | 99.99th=[12608]
   bw (  KiB/s): min=606960, max=610960, per=100.00%, avg=608883.36, stdev=1037.71, samples=119
   iops        : min=151740, max=152740, avg=152220.87, stdev=259.42, samples=119
  lat (nsec)   : 100=0.01%, 250=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=99.59%, 20=0.40%, 50=0.01%
  lat (usec)   : 250=0.01%
  cpu          : usr=5.91%, sys=94.09%, ctx=231, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=9131935,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=595MiB/s (623MB/s), 595MiB/s-595MiB/s (623MB/s-623MB/s), io=34.8GiB (37.4GB), run=60000-60000msec

io_uring_cmd ioengine

$ fio xnvme-compare2.fio --section=ucmd --filename=/dev/ng0n1
ucmd: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring_cmd, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=452MiB/s][r=116k IOPS][eta 00m:00s]
ucmd: (groupid=0, jobs=1): err= 0: pid=520104: Thu Oct 13 01:24:59 2022
  read: IOPS=116k, BW=451MiB/s (473MB/s)(26.4GiB/60001msec)
    slat (nsec): min=670, max=13370, avg=731.30, stdev=54.58
    clat (nsec): min=90, max=2017.0k, avg=7712.97, stdev=1107.87
     lat (usec): min=7, max=2017, avg= 8.44, stdev= 1.11
    clat percentiles (nsec):
     |  1.00th=[ 7520],  5.00th=[ 7584], 10.00th=[ 7584], 20.00th=[ 7648],
     | 30.00th=[ 7648], 40.00th=[ 7648], 50.00th=[ 7648], 60.00th=[ 7712],
     | 70.00th=[ 7712], 80.00th=[ 7712], 90.00th=[ 7712], 95.00th=[ 7776],
     | 99.00th=[ 8512], 99.50th=[11584], 99.90th=[14272], 99.95th=[14400],
     | 99.99th=[15552]
   bw (  KiB/s): min=455120, max=464448, per=100.00%, avg=462195.13, stdev=1356.73, samples=119
   iops        : min=113780, max=116112, avg=115548.78, stdev=339.19, samples=119
  lat (nsec)   : 100=0.01%, 250=0.01%
  lat (usec)   : 4=0.01%, 10=99.32%, 20=0.68%, 50=0.01%, 100=0.01%
  lat (usec)   : 250=0.01%, 500=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=7.06%, sys=21.48%, ctx=6930490, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=6930622,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=451MiB/s (473MB/s), 451MiB/s-451MiB/s (473MB/s-473MB/s), io=26.4GiB (28.4GB), run=60001-60001msec

io_uring_cmd backend (xNVMe ioengine)

$ fio xnvme-compare2.fio --section=xnvme_ucmd --filename=/dev/ng0n1
xnvme_ucmd: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=449MiB/s][r=115k IOPS][eta 00m:00s]
xnvme_ucmd: (groupid=0, jobs=1): err= 0: pid=520153: Thu Oct 13 01:31:06 2022
  read: IOPS=115k, BW=448MiB/s (470MB/s)(26.3GiB/60001msec)
    slat (nsec): min=710, max=23210, avg=775.53, stdev=58.47
    clat (nsec): min=100, max=137437, avg=7731.31, stdev=492.69
     lat (nsec): min=7920, max=152877, avg=8506.83, stdev=496.66
    clat percentiles (nsec):
     |  1.00th=[ 7520],  5.00th=[ 7584], 10.00th=[ 7648], 20.00th=[ 7648],
     | 30.00th=[ 7648], 40.00th=[ 7648], 50.00th=[ 7712], 60.00th=[ 7712],
     | 70.00th=[ 7712], 80.00th=[ 7712], 90.00th=[ 7712], 95.00th=[ 7840],
     | 99.00th=[ 8512], 99.50th=[11712], 99.90th=[14528], 99.95th=[14528],
     | 99.99th=[14912]
   bw (  KiB/s): min=446616, max=460656, per=100.00%, avg=459226.22, stdev=1388.44, samples=119
   iops        : min=111654, max=115164, avg=114806.55, stdev=347.11, samples=119
  lat (nsec)   : 250=0.01%
  lat (usec)   : 10=99.31%, 20=0.69%, 50=0.01%, 250=0.01%
  cpu          : usr=7.87%, sys=21.32%, ctx=6886793, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=6886881,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=448MiB/s (470MB/s), 448MiB/s-448MiB/s (470MB/s-470MB/s), io=26.3GiB (28.2GB), run=60001-60001msec

io_uring_cmd ioengine with sqthread_poll enabled

$ fio xnvme-compare2.fio --section=ucmd_sqthread --filename=/dev/ng0n1
ucmd: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring_cmd, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=546MiB/s][r=140k IOPS][eta 00m:00s]
ucmd: (groupid=0, jobs=1): err= 0: pid=520218: Thu Oct 13 01:34:59 2022
  read: IOPS=139k, BW=545MiB/s (571MB/s)(31.9GiB/60001msec)
    clat (msec): min=1990.5M, max=1990.6M, avg=1990535777.15, stdev=17308.19
     lat (usec): min=6, max=23643, avg= 7.04, stdev= 8.31
    clat percentiles (msec):
     |  1.00th=[17113],  5.00th=[17113], 10.00th=[17113], 20.00th=[17113],
     | 30.00th=[17113], 40.00th=[17113], 50.00th=[17113], 60.00th=[17113],
     | 70.00th=[17113], 80.00th=[17113], 90.00th=[17113], 95.00th=[17113],
     | 99.00th=[17113], 99.50th=[17113], 99.90th=[17113], 99.95th=[17113],
     | 99.99th=[17113]
   bw (  KiB/s): min=526874, max=571016, per=100.00%, avg=557953.97, stdev=3962.98, samples=119
   iops        : min=131718, max=142754, avg=139488.49, stdev=990.78, samples=119
  lat (msec)   : >=2000=100.00%
  cpu          : usr=99.98%, sys=0.00%, ctx=265, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=8366947,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=545MiB/s (571MB/s), 545MiB/s-545MiB/s (571MB/s-571MB/s), io=31.9GiB (34.3GB), run=60001-60001msec
xNVMe ioengine using io_uring_cmd async backend with sqthread_poll enabled :

io_uring_cmd backend (xNVMe ioengine) with sqthread_poll enabled

$ fio xnvme-compare2.fio --section=xnvme_ucmd_sqthread --filename=/dev/ng0n1
xnvme_ucmd: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=543MiB/s][r=139k IOPS][eta 00m:00s]
xnvme_ucmd: (groupid=0, jobs=1): err= 0: pid=520189: Thu Oct 13 01:33:00 2022
  read: IOPS=138k, BW=540MiB/s (566MB/s)(31.7GiB/60001msec)
    slat (nsec): min=60, max=7060, avg=91.35, stdev= 9.58
    clat (usec): min=6, max=23244, avg= 7.01, stdev= 8.08
     lat (usec): min=6, max=23245, avg= 7.10, stdev= 8.08
    clat percentiles (nsec):
     |  1.00th=[ 6816],  5.00th=[ 6880], 10.00th=[ 6880], 20.00th=[ 6944],
     | 30.00th=[ 6944], 40.00th=[ 6944], 50.00th=[ 6944], 60.00th=[ 6944],
     | 70.00th=[ 7008], 80.00th=[ 7008], 90.00th=[ 7008], 95.00th=[ 7264],
     | 99.00th=[ 7776], 99.50th=[ 9408], 99.90th=[13632], 99.95th=[13760],
     | 99.99th=[16192]
   bw (  KiB/s): min=487264, max=556344, per=100.00%, avg=553311.80, stdev=9104.91, samples=119
   iops        : min=121816, max=139086, avg=138327.95, stdev=2276.23, samples=119
  lat (usec)   : 10=99.55%, 20=0.45%, 50=0.01%
  lat (msec)   : 50=0.01%
  cpu          : usr=99.98%, sys=0.00%, ctx=268, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=8298319,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=540MiB/s (566MB/s), 540MiB/s-540MiB/s (566MB/s-566MB/s), io=31.7GiB (34.0GB), run=60001-60001msec

The table below summarizes the results of the six jobs.

	xNVMe io_uring backend w/hipri	io_uring ioengine w/hipri	xNVMe io_uring_cmd backend	io_uring_cmd ioengine	xNVMe io_uring_cmd backend w/sqthread_poll	io_uring_cmd ioengine w/sqthread_poll
Mean total latency (ns)	6425	6395	8507	8440	7100	7040
IOPS	152K	153K	115K	116K	138K	139K

CPU usr	5.91%	4.99%	7.87%	7.06%	99.98%	99.98%
CPU sys	94.09%	95.00%	21.32%	21.48%	0.00%	0.00%
Context switches	231	219	6886793	6930490	268	265

xNVMe's overhead is 30 ns (6425 ns vs 6395 ns total latency) for io_uring with polled completions. For io_uring_cmd with default options, xNVMe's overhead is 67 ns (8507 ns vs 8440 ns total latency). Finally, for io_uring_cmd with submission queue polling, xNVMe's overhead is 60 ns (7100 ns vs 7040 ns total latency). These differences are all in the same ballpark as what we observed in Part 1.

These results also provide another opportunity to see the effect of submission queue polling, this time for the io_uring_cmd ioengine and backend. With sqthread_poll=1, latency falls by approximately 1400ns and IOPS increases by slightly more than 22K. However, CPU utilization increases as well. Without polling, fio user and system CPU usage totaled about 30 percent, but with polling enabled user CPU usage was nearly 100 percent. Note also the difference in the number of context switches. Without polling there were over six million context switches but when polling is enabled, the number of context switches is negligible.

From kernel 6.1 onwards Linux will have polled completion (hipri) support for uring passthrough commands (io_uring_cmd). This was not available in the kernel we used for testing.

SPDK backend

xNVMe bundles the SPDK userspace NVMe driver with the xNVMe library by default. So if you have built and installed xNVMe with the default options then the SPDK driver should also be available. The xNVMe user-space section provides an in-depth guide for using SPDK with xNVMe. We will cover some of the highlights here.

Unbinding and binding

With the kernel and hardware configured appropriately users can use the xnvme-driver script to unbind devices from the kernel NVMe driver and bind them to the vfio-pci driver for use by SPDK.

root@ubuntu:~# xnvme-driver
0000:24:00.0 (8086 4140): nvme -> vfio-pci
 
Current user memlock limit: 1989 MB
 
This is the maximum amount of memory you will be
able to use with DPDK and VFIO if run as current user.
To change this, please adjust limits.conf memlock limit for current user.

The above command unbinds the device from the kernel NVMe driver and binds the device to the vfio-pci driver for use by SPDK.

By default the xnvme-driver script will allocate 2048 MB of memory using hugepages.

This can be modified to reserve 4096MB as follows:

HUGEMEM=4096 xnvme-driver

To unbind devices from vfio-pci and return them to the kernel NVMe driver run:

root@ubuntu:~# xnvme-driver reset
0000:24:00.0 (8086 4140): vfio-pci -> nvme

Selecting NVMe devices for xNVMe's SPDK backend requires specifying the options filename (PCIe address) and xnvme_dev_nsid (namespace). As fio splits filenames on ":", colons must be escaped when specifying the PCIe address. For a command-line option this can be done by specifying two backslash characters. To specify the filename (PCIe address) inside a job file, only a single backslash is required to escape the colon.

Example

Let us compare the xNVMe ioengine using the SPDK backend against the SPDK external fio ioengine from the SPDK repository. The below job file xnvme-spdk.fio will be used.

[test]
thread=1
time_based
size=500G
runtime=1m
numjobs=1
rw=randread
bs=4k
iodepth=1
cpus_allowed=1
norandommap

To use the SPDK external ioengine follow the detailed guide at spdk_fio_plugin. However, note that building the external SPDK ioengine with a build of fio that includes the xNVMe ioengine results in the circular dependency error message below.

 fio: IFUNC symbol 'spdk_u64log2' referenced in '../spdk/build/fio/spdk_nvme' is defined in the executable and creates an unsatisfiable circular dependency.

To avoid this, build SPDK and its accompanying external fio ioengine with a fio build that does not include the xNVMe ioengine by adding the --disable-xnvme option when running fio's configure script as in the following:

./configure --disable-xnvme

Here is the output with the SPDK external ioengine and the xnvme ioengine using the SPDK backend:

SPDK external ioengine

$ LD_PRELOAD=../spdk/build/fio/spdk_nvme fio xnvme-spdk.fio --ioengine=spdk --filename="trtype=PCIe traddr=0000.24.00.0 ns=1"
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=spdk, iodepth=1
fio-3.32
Starting 1 thread
TELEMETRY: No legacy callbacks, legacy socket not created
Jobs: 1 (f=0): [f(1)][100.0%][r=661MiB/s][r=169k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=374623: Thu Oct  6 05:10:52 2022
  read: IOPS=170k, BW=665MiB/s (697MB/s)(39.0GiB/60000msec)
    slat (nsec): min=69, max=4140, avg=76.87, stdev= 6.57
    clat (nsec): min=5090, max=23010, avg=5521.69, stdev=177.06
     lat (nsec): min=5170, max=23090, avg=5598.56, stdev=177.11
    clat percentiles (nsec):
     |  1.00th=[ 5472],  5.00th=[ 5472], 10.00th=[ 5472], 20.00th=[ 5472],
     | 30.00th=[ 5472], 40.00th=[ 5472], 50.00th=[ 5536], 60.00th=[ 5536],
     | 70.00th=[ 5536], 80.00th=[ 5536], 90.00th=[ 5536], 95.00th=[ 5600],
     | 99.00th=[ 5792], 99.50th=[ 6048], 99.90th=[ 8640], 99.95th=[ 9664],
     | 99.99th=[10944]
   bw (  KiB/s): min=677232, max=682512, per=100.00%, avg=680885.71, stdev=2277.69, samples=119
   iops        : min=169308, max=170628, avg=170221.43, stdev=569.42, samples=119
  lat (usec)   : 10=99.97%, 20=0.03%, 50=0.01%
  cpu          : usr=100.00%, sys=0.00%, ctx=237, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10212059,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=665MiB/s (697MB/s), 665MiB/s-665MiB/s (697MB/s-697MB/s), io=39.0GiB (41.8GB), run=60000-60000msec

SPDK backend (xNVMe ioengine)

$ fio xnvme-spdk.fio --ioengine=xnvme --filename=0000\\:24\\:00.0 --xnvme_dev_nsid=1
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=676MiB/s][r=173k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=359864: Thu Oct  6 05:03:12 2022
  read: IOPS=173k, BW=674MiB/s (707MB/s)(39.5GiB/60000msec)
    slat (nsec): min=59, max=6170, avg=74.41, stdev= 9.60
    clat (nsec): min=230, max=23570, avg=5427.06, stdev=193.68
     lat (nsec): min=5040, max=23640, avg=5501.47, stdev=193.65
    clat percentiles (nsec):
     |  1.00th=[ 5280],  5.00th=[ 5344], 10.00th=[ 5344], 20.00th=[ 5344],
     | 30.00th=[ 5344], 40.00th=[ 5344], 50.00th=[ 5408], 60.00th=[ 5408],
     | 70.00th=[ 5408], 80.00th=[ 5408], 90.00th=[ 5664], 95.00th=[ 5664],
     | 99.00th=[ 5728], 99.50th=[ 5984], 99.90th=[ 7840], 99.95th=[ 9280],
     | 99.99th=[10560]
   bw (  KiB/s): min=685056, max=692544, per=100.00%, avg=690275.50, stdev=2826.15, samples=119
   iops        : min=171264, max=173136, avg=172568.87, stdev=706.54, samples=119
  lat (nsec)   : 250=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=99.98%, 20=0.02%, 50=0.01%
  cpu          : usr=100.00%, sys=0.00%, ctx=220, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10353900,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=674MiB/s (707MB/s), 674MiB/s-674MiB/s (707MB/s-707MB/s), io=39.5GiB (42.4GB), run=60000-60000msec

	SPDK	xNVMe with SPDK backend
Mean total latency (ns)	5599	5501
IOPS	170K	173K
CPU usr	100%	100%
CPU sys	0%	0%
Context switches	237	220

The average latencies are similar at 5599 ns for the SPDK external ioengine and 5501 ns for the xNVMe ioengine with the SPDK backend. For this workload xNVMe with the SPDK backend has marginally higher IOPS than the SPDK external ioengine (173K vs 170K). The CPU utilization and context switch statistics for both cases are similar.

Part 3

This concludes Part 2 of our fio xNVMe ioengine blog post series. In Part 3 we plan to cover topics such as NVMeoF and Windows support.

Notes

Fio's libaio ioengine and xNVMe's libaio backend use a system call to wait for a minimum number of I/O's to complete. The upper bound for the wait time is specified by timespec. The libaio ioengine and xNVMe's libaio backend respectively provide userspace_reap and hipri flags to actively poll for completions. This provides higher IOPS and lower latencies at the cost of higher CPU utilization. Interested readers can find a comparison of the libaio ioengine with userspace_reap and xNVMe's libaio backed with hipri here.
For the xNVMe ioengine with the io_uring backend we also measured performance with both the hipri and sqthread_poll options enabled. This combination produced unexpectedly poor performance which is under investigation.
For more background information on topics discussed here see these references:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly