std: add io_uring library #6356

jorangreef · 2020-09-16T13:16:16Z

This brings io_uring helper methods to Zig for kernels >= 5.4.

We follow liburing's design decisions so that anyone who is comfortable with
liburing (https://unixism.net/loti/ref-liburing/index.html) will feel at home.

Thanks to @daurnimator for the first draft.

Refs: #3083
Signed-off-by: Joran Dirk Greef joran@coil.com

@daurnimator

This brings io_uring helper methods to Zig for kernels >= 5.4. We follow liburing's design decisions so that anyone who is comfortable with liburing (https://unixism.net/loti/ref-liburing/index.html) will feel at home. Thanks to @daurnimator for the first draft. Refs: #3083 Signed-off-by: Joran Dirk Greef <joran@coil.com>

jorangreef · 2020-09-16T13:17:58Z

This is my first time coding in Zig, and it's been great. Would appreciate as many eyes on this as possible.

jorangreef · 2020-09-16T13:21:27Z

Code coverage should be close to 100%, with the exception of queue_accept().

lib/std/io_uring.zig

jorangreef · 2020-09-16T14:23:07Z

The build is failing at linux.io_uring_setup() with errno=1, i.e. EPERM. This is probably a Docker seccomp issue where the relevant syscalls need to be whitelisted.

Rocknest

Im pretty sure that access through unnamed unions can be solved with compile time reflection or if its not enough then with a builtin. By the way c style x->y can be translated into Zig simply as x.y instead of x.*.y

lib/std/io_uring.zig

lib/std/std.zig

lib/std/io_uring.zig

jorangreef · 2020-09-17T18:32:45Z

By the way c style x->y can be translated into Zig simply as x.y instead of x.*.y

Thanks for the tip @Rocknest, done!

Another by the way, both x->y and x.y always bugged me in C. Awesome to see Zig solves it.

andrewrk · 2020-09-17T20:03:51Z

This looks very useful, but also looks like a pretty apt candidate for being a third party package. It's quite clean, only depending on the std lib, and the std lib has no dependencies on it. (The addition of IORING_SQ_CQ_OVERFLOW alone would be merged immediately). I didn't look too closely, but I get the impression that it makes some implementation decisions on behalf of the API user, which makes it both (1) useful and (2) good candidate for being a third party package. For example, I think when we rework the std lib event loop implementation to additionally support io_uring, it will likely duplicate parts of this code rather than using it strictly as an API user.

Any objections to maintaining this outside the std lib?

FireFox317 · 2020-09-17T20:36:44Z

Well, but if you consider that the Linux kernel is moving a lot of it's syscalls to this new ioring interface, then it should be part of the Zig std, since this also the case for the current syscall interface.

andrewrk · 2020-09-17T20:47:11Z

Can you explain that a little bit more? We already have the syscalls in the zig std lib:

zig/lib/std/os/linux.zig

Lines 1194 to 1204 in 3672a18

    
           pub fn io_uring_setup(entries: u32, p: *io_uring_params) usize { 
        
               return syscall2(.io_uring_setup, entries, @ptrToInt(p)); 
        
           } 
        
           pub fn io_uring_enter(fd: i32, to_submit: u32, min_complete: u32, flags: u32, sig: ?*sigset_t) usize { 
        
               return syscall6(.io_uring_enter, @bitCast(usize, @as(isize, fd)), to_submit, min_complete, flags, @ptrToInt(sig), NSIG / 8); 
        
           } 
        
           pub fn io_uring_register(fd: i32, opcode: IORING_REGISTER, arg: ?*const c_void, nr_args: u32) usize { 
        
               return syscall4(.io_uring_register, @bitCast(usize, @as(isize, fd)), @enumToInt(opcode), @ptrToInt(arg), nr_args); 
        
           }

andrewrk · 2020-09-17T20:48:42Z

oh I see, you're saying that newer syscalls are being added only exposed via io_uring. So a convenient way to call those syscalls would be needed.

FireFox317 · 2020-09-17T20:50:48Z

Yeah so basically that allows to put syscalls like read, write etc in the io_ring queue, and then the kernel process this queue and executes the syscalls, while the process can do other stuff (asynchronous syscalls basically). There is some article regarding this @daurminator might know what I'm referring to.

Edit: I think it is this link: https://lwn.net/Articles/810414/

jorangreef · 2020-09-18T07:46:10Z

It's quite clean

Thanks @andrewrk!

I didn't look too closely, but I get the impression that it makes some implementation decisions on behalf of the API user, which makes it both (1) useful and (2) good candidate for being a third party package.

No, in fact, the idea was to follow the interface and implementation decisions taken by liburing. liburing is not just any third party package but the defacto userland implementation of io_uring maintained by Jens Axboe, also serving as the test suite for the kernel.

This PR contains only the core of what you would need to use io_uring safely, with correct memory barriers and consideration for SQ and CQ overflow and different poll modes, but without exposing the entire surface area of liburing. This is the bare minimum. The io_uring syscalls are not enough.

I literally worked through the kernel source and liburing's source full time over three weeks, so I don't think we take any implementation decisions beyond liburing, except for copy_cqes(), which has an open issue in liburing already and which I plan to submit there. But if you look closely and think we do, please let me know and I can always unravel them!

For example, I think when we rework the std lib event loop implementation to additionally support io_uring, it will likely duplicate parts of this code rather than using it strictly as an API user.

Again, this is almost exactly what the event loop would need and nothing more. For example, with this, you could drop the ugly std lib code needed for the I/O threadpool on linux and make single-threaded mode event loops non-blocking to fix #1908, also solving #5962.

Any objections to maintaining this outside the std lib?

No, but io_uring is the future of I/O in linux. I believe it makes sense to have a first-class io_uring implementation in the std lib, and furthermore something that follows liburing's design decisions.

lib/std/io_uring.zig

Rocknest · 2020-09-21T19:46:37Z

@jorangreef make every field in the struct a union. That would be future proof. Be creative instead of waiting that some feature gets added into zig so you can do it the C way.

jorangreef · 2020-09-22T06:05:59Z

@Rocknest the new pattern does not require #985.

Rocknest · 2020-09-22T13:16:36Z

It does not, but it is ugly, it does not accomplish what you claim it to be able.

Ensures that the wakeup flag is read after the tail pointer has been written. It's important to use memory load acquire semantics for the flags read, otherwise the application and the kernel might not agree on the consistency of the wakeup flag, leading to I/O starvation. Refs: axboe/liburing@6768ddc Refs: axboe/liburing#219

@daurnimator

Decouples SQE queueing and SQE prepping methods to allow for non-sequential SQE allocation schemes as suggested by @daurnimator. Adds essential SQE prepping methods from liburing to reduce boilerplate. Removes non-essential .link_with_next_sqe() and .use_registered_fd().

Removes non-essential .hardlink_with_next_sqe() and .drain_previous_sqes().

andrewrk · 2020-10-04T11:27:21Z

I'm looking forward to reviewing this within a couple days, now that I finished the big branch I was focusing on :-)

jorangreef · 2020-10-04T11:42:21Z

For anyone interested in how this performs, the Coil team put together a range of file system and networking benchmarks, comparing syscalls through io_uring with blocking syscalls or epoll, and specifically benchmarking various Zig implementations as well as C contenders:

https://github.com/coilhq/tiger-beetle/tree/master/demos/io_uring

Some highlights:

2x write/fsync/read syscall throughput for sector-sized IOPs
an improvement over epoll for networking when the server is under load

Thanks to @MasterQ32 for writing the blocking networking echo server candidate.

Please take these benchmarks with a pinch of salt and let us know what can be improved!

jorangreef · 2020-10-04T11:52:38Z

This now supports the io_uring syscall equivalents of everything required by zig/lib/std/event/loop.zig:

open
openat
close
read
readv
pread
preadv
write
writev
pwritev

With the exception of faccessat since I think that is not yet available in the kernel for io_uring.

But we also go further with networking syscalls that no longer need epoll thanks to IORING_FEAT_FAST_POLL:

accept
connect
send
recv

And tests for everything.

…FDCWD

daurnimator · 2020-10-04T14:12:58Z

lib/std/os/linux/io_uring.zig

+    pub fn cq_advance(self: *IO_Uring, count: u32) void {
+        if (count > 0) {
+            // Ensure the kernel only sees the new head value after the CQEs have been read.
+            @atomicStore(u32, self.cq.head, self.cq.head.* +% count, .Release);


Might want an atomic rmw here?

I believe this is already exactly what @axboe does in liburing? Could you explain why you would want something else? If so, then that's probably a bug in liburing.

To understand why we use @atomicStore here:

The cq.head pointer is owned by the application, not the kernel.

The way this works is that:

the kernel only ever pushes to the end of the CQ ring by incrementing the cq.tail pointer, which is owned by the kernel, and

the application only ever shifts from the front of the CQ ring by incrementing the cq.head pointer, again owned by the application.

Thus, the kernel only reads cq.head (and never writes), and the application only reads cq.tail (and never writes). It's symmetric, and the same logic is true for the SQ ring, but inverted.

This means that the application is free to read and then increment cq.head here anytime without an atomic read/modify/write, since the application is the only process that will write to cq.head when it shifts from the queue.

The reason then that we use the @atomicStore here is because the CPU can reoder memory accesses, i.e. the kernel might read the newly written cq.head and then overwrite CQEs whose memory we are still reading.

What we are saying is that the kernel should only see the store to cq.head after the CQEs involved have been read.

lib/std/os/linux/io_uring.zig

If an older kernel fails the `openat` test because of `AT_FDCWD` then we don't want to skip the `close` test.

asafyish · 2020-10-07T10:38:45Z

This is super important for writing a web server that can take 1st place in techempower benchmarks.

andrewrk · 2020-10-29T19:25:55Z

Thanks @jorangreef and everyone who helped review. My goal is to do whatever fixups are needed to this today and get it merged into master branch.

andrewrk · 2020-10-29T22:20:18Z

Nice, this is already mergeable. Great work everyone.

jorangreef · 2020-10-30T07:07:55Z

Thanks @andrewrk. Awesome.

@daurnimator

As per: lib/libc/musl/arch/mips/bits/syscall.h.in ...and as promised: ziglang#6356 (comment) Thanks @daurnimator again for the help with ziglang#6356.

joachimschmidt557 reviewed Sep 16, 2020

View reviewed changes

lib/std/io_uring.zig Outdated Show resolved Hide resolved

joachimschmidt557 reviewed Sep 16, 2020

View reviewed changes

lib/std/io_uring.zig Outdated Show resolved Hide resolved

jorangreef added 2 commits September 16, 2020 18:51

Add short license and copyright notice

6f09796

Check kernel support for single_mmap, accept, and read/write

491a434

Rocknest reviewed Sep 16, 2020

View reviewed changes

daurnimator reviewed Sep 16, 2020

View reviewed changes

jorangreef added 5 commits September 17, 2020 19:37

Use != 0 for bitwise flag conditions

ac1d9f7

Remove comment

21c8136

Add IORING_SQ_CQ_OVERFLOW to std/os/bits/linux.zig

d966fe6

Use std.builtin

e33c466

Use x.y for C-style x->y instead of x.*.y

8b030a6

@ptrCast fds.ptr to *const c_void for io_uring_register()

ee59319

jorangreef commented Sep 19, 2020

View reviewed changes

lib/std/io_uring.zig Outdated Show resolved Hide resolved

Do not register /dev/zero as an fd when testing queue_readv()

5df0d28

jorangreef commented Sep 19, 2020

View reviewed changes

lib/std/io_uring.zig Outdated Show resolved Hide resolved

jorangreef added 4 commits September 19, 2020 15:07

Test IORING_REGISTER_FILES but avoid sparse fd sets

b2a54b9

Fix std @import

09f2f4a

Add splice_fd_in to io_uring_sqe and future-proof for anonymous unions

873d1c8

Move to std/os/linux

31533eb

jorangreef mentioned this pull request Sep 21, 2020

handle impossible errors from the kernel with error.Unexpected instead of unreachable #6389

Open

Define SPLICE, PROVIDE_BUFFERS, REMOVE_BUFFERS and TEE opcodes and flags

5f99d2c

jorangreef added 8 commits October 3, 2020 14:34

Handle EBADFD (ring fd in bad state) in enter()

95def89

Add IORING_FEAT_POLL_32BITS

c5b4fca

Expose available kernel features

61ec6cb

Limit entries to u12, add errors for invalid entries, use mem.zeroInit

e32c7d0

Use load relaxed semantics when reading the SQPOLL wakeup flag

3d2de6c

Add openat(), close(), connect(), send(), recv(), as well as tests

6a53f4b

Removes non-essential .hardlink_with_next_sqe() and .drain_previous_sqes().

jorangreef added 2 commits October 4, 2020 15:11

Improve openat/accept test debugging

9091fcb

Skip openat test only for older kernels that do not fully support AT_…

72bdfa5

…FDCWD

daurnimator requested changes Oct 4, 2020

View reviewed changes

jorangreef added 3 commits October 4, 2020 16:57

Use const wherever possible

958ff08

Split openat/close test into two separate tests

9be2941

If an older kernel fails the `openat` test because of `AT_FDCWD` then we don't want to skip the `close` test.

Test the range of user_data bits

e9ba12f

andrewrk merged commit a41c0b6 into ziglang:master Oct 29, 2020

jorangreef added a commit to jorangreef/zig that referenced this pull request Oct 30, 2020

Add more mips syscall numbers

9423ed1

As per: lib/libc/musl/arch/mips/bits/syscall.h.in ...and as promised: ziglang#6356 (comment) Thanks @daurnimator again for the help with ziglang#6356.

jorangreef mentioned this pull request Oct 30, 2020

Add more mips syscall numbers #6875

Merged

ringabout mentioned this pull request Feb 17, 2021

async: we need io_uring for Nim timotheecour/Nim#592

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std: add io_uring library #6356

std: add io_uring library #6356

jorangreef commented Sep 16, 2020

jorangreef commented Sep 16, 2020

jorangreef commented Sep 16, 2020

jorangreef commented Sep 16, 2020 •

edited

Rocknest left a comment

jorangreef commented Sep 17, 2020

andrewrk commented Sep 17, 2020

FireFox317 commented Sep 17, 2020

andrewrk commented Sep 17, 2020

andrewrk commented Sep 17, 2020

FireFox317 commented Sep 17, 2020 •

edited

jorangreef commented Sep 18, 2020 •

edited

Rocknest commented Sep 21, 2020

jorangreef commented Sep 22, 2020

Rocknest commented Sep 22, 2020

andrewrk commented Oct 4, 2020

jorangreef commented Oct 4, 2020

jorangreef commented Oct 4, 2020

daurnimator Oct 4, 2020

jorangreef Oct 4, 2020

jorangreef Oct 4, 2020

asafyish commented Oct 7, 2020

andrewrk commented Oct 29, 2020

andrewrk commented Oct 29, 2020

jorangreef commented Oct 30, 2020

std: add io_uring library #6356

std: add io_uring library #6356

Conversation

jorangreef commented Sep 16, 2020

jorangreef commented Sep 16, 2020

jorangreef commented Sep 16, 2020

jorangreef commented Sep 16, 2020 • edited

Rocknest left a comment

Choose a reason for hiding this comment

jorangreef commented Sep 17, 2020

andrewrk commented Sep 17, 2020

FireFox317 commented Sep 17, 2020

andrewrk commented Sep 17, 2020

andrewrk commented Sep 17, 2020

FireFox317 commented Sep 17, 2020 • edited

jorangreef commented Sep 18, 2020 • edited

Rocknest commented Sep 21, 2020

jorangreef commented Sep 22, 2020

Rocknest commented Sep 22, 2020

andrewrk commented Oct 4, 2020

jorangreef commented Oct 4, 2020

jorangreef commented Oct 4, 2020

daurnimator Oct 4, 2020

Choose a reason for hiding this comment

jorangreef Oct 4, 2020

Choose a reason for hiding this comment

jorangreef Oct 4, 2020

Choose a reason for hiding this comment

asafyish commented Oct 7, 2020

andrewrk commented Oct 29, 2020

andrewrk commented Oct 29, 2020

jorangreef commented Oct 30, 2020

jorangreef commented Sep 16, 2020 •

edited

FireFox317 commented Sep 17, 2020 •

edited

jorangreef commented Sep 18, 2020 •

edited