Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std: add io_uring library #6356

Merged
merged 50 commits into from Oct 29, 2020
Merged

std: add io_uring library #6356

merged 50 commits into from Oct 29, 2020

Conversation

jorangreef
Copy link
Sponsor Contributor

This brings io_uring helper methods to Zig for kernels >= 5.4.

We follow liburing's design decisions so that anyone who is comfortable with
liburing (https://unixism.net/loti/ref-liburing/index.html) will feel at home.

Thanks to @daurnimator for the first draft.

Refs: #3083
Signed-off-by: Joran Dirk Greef joran@coil.com

This brings io_uring helper methods to Zig for kernels >= 5.4.

We follow liburing's design decisions so that anyone who is comfortable with
liburing (https://unixism.net/loti/ref-liburing/index.html) will feel at home.

Thanks to @daurnimator for the first draft.

Refs: #3083
Signed-off-by: Joran Dirk Greef <joran@coil.com>
@jorangreef
Copy link
Sponsor Contributor Author

This is my first time coding in Zig, and it's been great. Would appreciate as many eyes on this as possible.

@jorangreef
Copy link
Sponsor Contributor Author

Code coverage should be close to 100%, with the exception of queue_accept().

@jorangreef
Copy link
Sponsor Contributor Author

jorangreef commented Sep 16, 2020

The build is failing at linux.io_uring_setup() with errno=1, i.e. EPERM. This is probably a Docker seccomp issue where the relevant syscalls need to be whitelisted.

Copy link
Contributor

@Rocknest Rocknest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im pretty sure that access through unnamed unions can be solved with compile time reflection or if its not enough then with a builtin. By the way c style x->y can be translated into Zig simply as x.y instead of x.*.y

lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/std.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
@jorangreef
Copy link
Sponsor Contributor Author

By the way c style x->y can be translated into Zig simply as x.y instead of x.*.y

Thanks for the tip @Rocknest, done!

Another by the way, both x->y and x.y always bugged me in C. Awesome to see Zig solves it.

@andrewrk
Copy link
Member

This looks very useful, but also looks like a pretty apt candidate for being a third party package. It's quite clean, only depending on the std lib, and the std lib has no dependencies on it. (The addition of IORING_SQ_CQ_OVERFLOW alone would be merged immediately). I didn't look too closely, but I get the impression that it makes some implementation decisions on behalf of the API user, which makes it both (1) useful and (2) good candidate for being a third party package. For example, I think when we rework the std lib event loop implementation to additionally support io_uring, it will likely duplicate parts of this code rather than using it strictly as an API user.

Any objections to maintaining this outside the std lib?

@FireFox317
Copy link
Contributor

Well, but if you consider that the Linux kernel is moving a lot of it's syscalls to this new ioring interface, then it should be part of the Zig std, since this also the case for the current syscall interface.

@andrewrk
Copy link
Member

Can you explain that a little bit more? We already have the syscalls in the zig std lib:

zig/lib/std/os/linux.zig

Lines 1194 to 1204 in 3672a18

pub fn io_uring_setup(entries: u32, p: *io_uring_params) usize {
return syscall2(.io_uring_setup, entries, @ptrToInt(p));
}
pub fn io_uring_enter(fd: i32, to_submit: u32, min_complete: u32, flags: u32, sig: ?*sigset_t) usize {
return syscall6(.io_uring_enter, @bitCast(usize, @as(isize, fd)), to_submit, min_complete, flags, @ptrToInt(sig), NSIG / 8);
}
pub fn io_uring_register(fd: i32, opcode: IORING_REGISTER, arg: ?*const c_void, nr_args: u32) usize {
return syscall4(.io_uring_register, @bitCast(usize, @as(isize, fd)), @enumToInt(opcode), @ptrToInt(arg), nr_args);
}

@andrewrk
Copy link
Member

oh I see, you're saying that newer syscalls are being added only exposed via io_uring. So a convenient way to call those syscalls would be needed.

@FireFox317
Copy link
Contributor

FireFox317 commented Sep 17, 2020

Yeah so basically that allows to put syscalls like read, write etc in the io_ring queue, and then the kernel process this queue and executes the syscalls, while the process can do other stuff (asynchronous syscalls basically). There is some article regarding this @daurminator might know what I'm referring to.

Edit: I think it is this link: https://lwn.net/Articles/810414/

@jorangreef
Copy link
Sponsor Contributor Author

jorangreef commented Sep 18, 2020

It's quite clean

Thanks @andrewrk!

I didn't look too closely, but I get the impression that it makes some implementation decisions on behalf of the API user, which makes it both (1) useful and (2) good candidate for being a third party package.

No, in fact, the idea was to follow the interface and implementation decisions taken by liburing. liburing is not just any third party package but the defacto userland implementation of io_uring maintained by Jens Axboe, also serving as the test suite for the kernel.

This PR contains only the core of what you would need to use io_uring safely, with correct memory barriers and consideration for SQ and CQ overflow and different poll modes, but without exposing the entire surface area of liburing. This is the bare minimum. The io_uring syscalls are not enough.

I literally worked through the kernel source and liburing's source full time over three weeks, so I don't think we take any implementation decisions beyond liburing, except for copy_cqes(), which has an open issue in liburing already and which I plan to submit there. But if you look closely and think we do, please let me know and I can always unravel them!

For example, I think when we rework the std lib event loop implementation to additionally support io_uring, it will likely duplicate parts of this code rather than using it strictly as an API user.

Again, this is almost exactly what the event loop would need and nothing more. For example, with this, you could drop the ugly std lib code needed for the I/O threadpool on linux and make single-threaded mode event loops non-blocking to fix #1908, also solving #5962.

Any objections to maintaining this outside the std lib?

No, but io_uring is the future of I/O in linux. I believe it makes sense to have a first-class io_uring implementation in the std lib, and furthermore something that follows liburing's design decisions.

lib/std/io_uring.zig Outdated Show resolved Hide resolved
lib/std/io_uring.zig Outdated Show resolved Hide resolved
@Rocknest
Copy link
Contributor

@jorangreef make every field in the struct a union. That would be future proof. Be creative instead of waiting that some feature gets added into zig so you can do it the C way.

@jorangreef
Copy link
Sponsor Contributor Author

@Rocknest the new pattern does not require #985.

@Rocknest
Copy link
Contributor

It does not, but it is ugly, it does not accomplish what you claim it to be able.

Ensures that the wakeup flag is read after the tail pointer has been
written. It's important to use memory load acquire semantics for the
flags read, otherwise the application and the kernel might not agree on
the consistency of the wakeup flag, leading to I/O starvation.

Refs: axboe/liburing@6768ddc
Refs: axboe/liburing#219
Decouples SQE queueing and SQE prepping methods to allow for non-sequential
SQE allocation schemes as suggested by @daurnimator.

Adds essential SQE prepping methods from liburing to reduce boilerplate.

Removes non-essential .link_with_next_sqe() and .use_registered_fd().
Removes non-essential .hardlink_with_next_sqe() and .drain_previous_sqes().
@andrewrk
Copy link
Member

andrewrk commented Oct 4, 2020

I'm looking forward to reviewing this within a couple days, now that I finished the big branch I was focusing on :-)

@jorangreef
Copy link
Sponsor Contributor Author

For anyone interested in how this performs, the Coil team put together a range of file system and networking benchmarks, comparing syscalls through io_uring with blocking syscalls or epoll, and specifically benchmarking various Zig implementations as well as C contenders:

https://github.com/coilhq/tiger-beetle/tree/master/demos/io_uring

Some highlights:

  • 2x write/fsync/read syscall throughput for sector-sized IOPs
  • an improvement over epoll for networking when the server is under load

Thanks to @MasterQ32 for writing the blocking networking echo server candidate.

Please take these benchmarks with a pinch of salt and let us know what can be improved!

@jorangreef
Copy link
Sponsor Contributor Author

This now supports the io_uring syscall equivalents of everything required by zig/lib/std/event/loop.zig:

open
openat
close
read
readv
pread
preadv
write
writev
pwritev

With the exception of faccessat since I think that is not yet available in the kernel for io_uring.

But we also go further with networking syscalls that no longer need epoll thanks to IORING_FEAT_FAST_POLL:

accept
connect
send
recv

And tests for everything.

pub fn cq_advance(self: *IO_Uring, count: u32) void {
if (count > 0) {
// Ensure the kernel only sees the new head value after the CQEs have been read.
@atomicStore(u32, self.cq.head, self.cq.head.* +% count, .Release);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want an atomic rmw here?

Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is already exactly what @axboe does in liburing? Could you explain why you would want something else? If so, then that's probably a bug in liburing.

Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To understand why we use @atomicStore here:

The cq.head pointer is owned by the application, not the kernel.

The way this works is that:

  1. the kernel only ever pushes to the end of the CQ ring by incrementing the cq.tail pointer, which is owned by the kernel, and
  2. the application only ever shifts from the front of the CQ ring by incrementing the cq.head pointer, again owned by the application.

Thus, the kernel only reads cq.head (and never writes), and the application only reads cq.tail (and never writes). It's symmetric, and the same logic is true for the SQ ring, but inverted.

This means that the application is free to read and then increment cq.head here anytime without an atomic read/modify/write, since the application is the only process that will write to cq.head when it shifts from the queue.

The reason then that we use the @atomicStore here is because the CPU can reoder memory accesses, i.e. the kernel might read the newly written cq.head and then overwrite CQEs whose memory we are still reading.

What we are saying is that the kernel should only see the store to cq.head after the CQEs involved have been read.

lib/std/os/linux/io_uring.zig Show resolved Hide resolved
lib/std/os/linux/io_uring.zig Show resolved Hide resolved
lib/std/os/linux/io_uring.zig Show resolved Hide resolved
lib/std/os/linux/io_uring.zig Show resolved Hide resolved
lib/std/os/linux/io_uring.zig Outdated Show resolved Hide resolved
If an older kernel fails the `openat` test because of `AT_FDCWD`
then we don't want to skip the `close` test.
@asafyish
Copy link

asafyish commented Oct 7, 2020

This is super important for writing a web server that can take 1st place in techempower benchmarks.

@andrewrk
Copy link
Member

Thanks @jorangreef and everyone who helped review. My goal is to do whatever fixups are needed to this today and get it merged into master branch.

@andrewrk
Copy link
Member

Nice, this is already mergeable. Great work everyone.

@andrewrk andrewrk merged commit a41c0b6 into ziglang:master Oct 29, 2020
@jorangreef
Copy link
Sponsor Contributor Author

Thanks @andrewrk. Awesome.

jorangreef added a commit to jorangreef/zig that referenced this pull request Oct 30, 2020
As per: lib/libc/musl/arch/mips/bits/syscall.h.in

...and as promised: ziglang#6356 (comment)

Thanks @daurnimator again for the help with ziglang#6356.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants