Use eventfd for wakeup and enable io_uring performance flags#166
Conversation
07f1949 to
dc2593b
Compare
Benchmark results (Linux 6.19, Ruby 3.4.8, dedicated machine)Methodology: two git worktrees built independently — Wakeup microbenchmark (
|
| Backend | main | PR | Δ |
|---|---|---|---|
| URing | 9.32 µs | 10.21 µs | +9% |
| EPoll | 11.19 µs | 10.84 µs | noise |
| Select | 21.06 µs | 18.36 µs | noise |
Idle wakeup cost (calling wakeup() when selector is not blocking):
| Backend | main | PR |
|---|---|---|
| URing | 1.54 µs | 1.53 µs |
The ~1 µs roundtrip gap vs NOP is the irreducible cost of the eventfd write going through the kernel and completing a pending async read, rather than a NOP arriving at io_uring_wait_cqe_timeout directly. DEFER_TASKRUN closes most of this gap (was ~2.6 µs without it).
HTTP benchmark (benchmark/server/event.rb, fiber-per-connection)
| main (NOP) | PR | Δ | |
|---|---|---|---|
| Run 1 | 14,854 | 14,821 | |
| Run 2 | 14,085 | 14,592 | |
| Run 3 | 14,171 | 14,240 | |
| Run 4 | 13,774 | 14,768 | |
| Run 5 | 14,057 | 14,544 | |
| Run 6 | 14,717 | 14,627 | |
| Run 7 | 14,965 | 14,566 | |
| Average | 14,375 req/s | 14,594 req/s | +1.5% |
The DEFER_TASKRUN flag benefits the entire completion path, not just wakeup — which is why the HTTP throughput improves even though the pure wakeup latency is similar.
8e7e932 to
c9ba733
Compare
| // io_uring_enter with IORING_ENTER_GETEVENTS (without blocking) to flush that | ||
| // work into the CQ so the non-blocking select_process_completions below sees them. | ||
| if (selector->ring.flags & IORING_SETUP_DEFER_TASKRUN) { | ||
| io_uring_get_events(&selector->ring); |
There was a problem hiding this comment.
I don't think you need to do this, it should be done for you by io_uring_wait_cqe_timeout already
There was a problem hiding this comment.
If we do need to do this it's worth trying IORING_SETUP_TASKRUN_FLAG so you don't need this unconditional syscall. But I don't think we need this
There was a problem hiding this comment.
The io_uring_get_events call here is at the top of select() — before any blocking wait — to flush deferred completions for the non-blocking select_process_completions peek that runs next:
io_uring_submit_flush(selector);
#ifdef IORING_SETUP_DEFER_TASKRUN
if (selector->ring.flags & IORING_SETUP_DEFER_TASKRUN) {
io_uring_get_events(&selector->ring); // <- here
}
#endif
int ready = IO_Event_Selector_ready_flush(&selector->backend);
int result = select_process_completions(selector); // non-blocking peek
if (!ready && !result && !selector->backend.ready) {
// ... only now do we optionally enter the blocking io_uring_wait_cqe_timeout ...
}io_uring_wait_cqe_timeout does flush deferred work, but only when we actually go into the blocking wait. The non-blocking peek above runs first, and with DEFER_TASKRUN it would see an empty CQ even though completions have happened — fibers would stall until the next blocking wait forced the kernel into GETEVENTS. The upfront io_uring_get_events is what makes those completions visible to the peek path.
That said, you're right that an unconditional io_uring_enter(GETEVENTS) syscall every select() iteration is wasteful when nothing is actually deferred. Going to take your second suggestion and gate on IORING_SETUP_TASKRUN_FLAG so we only call into the kernel when IORING_SQ_TASKRUN is set — will follow up with a benchmark on hana once it lands.
There was a problem hiding this comment.
Done in 9c85783. Enabled IORING_SETUP_TASKRUN_FLAG alongside DEFER_TASKRUN and gated the io_uring_get_events call on *ring.sq.kflags & IORING_SQ_TASKRUN (relaxed atomic load — no syscall).
Benchmark on hana (Linux 6.19.11, Ruby 3.4.9, dedicated):
| Workload | Baseline (48e1437) |
With TASKRUN_FLAG (9c85783) |
Δ |
|---|---|---|---|
select(0) with no pending work (200k tight loop, n=5) |
~570 ns/call (1.75M/s) | ~58 ns/call (17M/s) | −90% / ~10× |
| Cross-thread wakeup roundtrip (n=12) | 10.75 µs ± 0.44 | 10.66 µs ± 0.74 | within noise |
Idle wakeup (n=12) |
1.61 µs ± 0.19 | 1.55 µs ± 0.11 | within noise |
strace -e io_uring_enter against 10 selector.select(0) calls with nothing pending:
| Before | After |
|---|---|
io_uring_enter(IORING_ENTER_GETEVENTS) × 10 |
0 |
The wakeup microbenchmark doesn't move because every iteration has a deferred eventfd-read completion in flight — IORING_SQ_TASKRUN is set, so we still do the (necessary) get_events. The big win is for select() iterations that genuinely have nothing to flush, which the tight-loop microbenchmark isolates.
261821b to
0e6f2fe
Compare
… syscall. When DEFER_TASKRUN is active the kernel holds completions as deferred task work and the CQ stays empty until something calls into io_uring with GETEVENTS. We were calling io_uring_get_events() unconditionally at the top of select() to flush deferred work into the CQ before the non-blocking peek — a real io_uring_enter syscall every iteration regardless of whether anything was actually deferred. IORING_SETUP_TASKRUN_FLAG (kernel 5.19+, always available alongside DEFER_TASKRUN) asks the kernel to set IORING_SQ_TASKRUN in the SQ flags whenever task work is pending. A relaxed atomic load is enough to check, so we can skip the get_events syscall when nothing is deferred. Suggested by @tavianator in #166. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the cross-thread NOP-SQE wakeup with an async read on an `IO_Event_Interrupt` (eventfd on Linux, pipe elsewhere) so the waking thread never touches the ring's submission queue. This unlocks two kernel performance flags: - `IORING_SETUP_SINGLE_ISSUER` (kernel 6.0+): tells the kernel only the owner thread will ever submit SQEs, letting it skip internal SQ locking on every submit. - `IORING_SETUP_DEFER_TASKRUN` (kernel 6.1+, requires `SINGLE_ISSUER`): defers io_uring task work to the application thread rather than firing `TWA_SIGNAL` and running it at arbitrary userspace re-entries. Removes the cross-CPU IPI and produces deterministic completion ordering. Wakeup path: 1. Before each blocking wait the owner thread submits an async read on the interrupt descriptor (`io_uring_cqe_get_data(cqe) == &selector->interrupt` identifies it). 2. `wakeup()` on any thread just calls `IO_Event_Interrupt_signal()` — a plain `write()` that never touches the SQ. 3. The kernel completes the read, `io_uring_wait_cqe_timeout` returns, the read atomically consumed the counter so there is no separate drain step. `DEFER_TASKRUN` means the kernel holds completions as deferred task work and the CQ stays empty until something enters io_uring with `GETEVENTS`. To keep the non-blocking peek in `select()` correct, we explicitly call `io_uring_get_events()` at the top of `select()` so deferred completions are flushed into the CQ before `select_process_completions()` runs. Both flags are guarded by `#ifdef` and degrade gracefully on older kernels / liburing. Co-authored-by: Cursor <cursoragent@cursor.com>
With `DEFER_TASKRUN` we were calling `io_uring_get_events()` unconditionally at the top of `select()` — a real `io_uring_enter(GETEVENTS)` syscall every iteration regardless of whether there was anything deferred to flush. `IORING_SETUP_TASKRUN_FLAG` (kernel 5.19+, always available alongside `DEFER_TASKRUN`) asks the kernel to set `IORING_SQ_TASKRUN` in `sq.flags` whenever task work is pending. A relaxed atomic load on the mmap'd flag word is enough to check — no syscall — so we only enter the kernel when there is actually deferred work to flush. Suggested by @tavianator in #166. Microbenchmark (hana, Linux 6.19.11, Ruby 3.4.9): tight `select(0)` loop with nothing pending drops from ~570 ns/call to ~58 ns/call (~10×). `strace` over 10 calls: 10 `io_uring_enter(GETEVENTS)` syscalls → 0. The wakeup roundtrip is unchanged because that path always has a deferred eventfd-read completion in flight (so the flag is set and we still flush — correctly). Co-authored-by: Cursor <cursoragent@cursor.com>
9c85783 to
7bd340f
Compare
After "Handle short io_uring submissions" (b57476a) moved the `selector->pending += 1` into `io_get_sqe()`, the manual increment in `select_internal_without_gvl` introduced by socketry#166 became a double-count. The duplicated count left `pending > 0` after `io_uring_wait_cqe_timeout` drained the SQ internally, sending the next `io_uring_submit_flush` into a busy loop (submit returns 0 because the ring is empty, but `pending > 0` keeps the new while-loop spinning). Co-authored-by: Cursor <cursoragent@cursor.com>
The current cross-thread
wakeup()submits a NOP SQE directly into the ring's submission queue from the waking thread. That cross-thread SQ access prevents enabling several kernel performance flags that require single-issuer semantics, and forces a sub-optimal completion path.This PR rewires
wakeup()to use aneventfd(Linux) / pipe (other Unix) viaIO_Event_Interrupt, so the waking thread never touches the ring. With that constraint satisfied, three kernel flags become safe to set, each gated by#ifdefand degrading gracefully on older kernels / liburing.Changes
eventfd-based wakeupio_uring_wait_cqe_timeout, the owner thread submits a one-shot asyncreadon the interrupt descriptor (tagged with&selector->interrupt).wakeup()— called from any thread — callsIO_Event_Interrupt_signal(), a plainwrite()that bumps the eventfd counter. No ring access from the waking thread.wait_cqe_timeoutreturns, the read atomically consumed the counter, so no separate drain step is needed.IORING_SETUP_SINGLE_ISSUER(kernel 6.0+)Tells the kernel only the owner thread will ever submit SQEs, letting it skip internal SQ locking on every submit call.
IORING_SETUP_DEFER_TASKRUN(kernel 6.1+, requiresSINGLE_ISSUER)Defers io_uring task work to the application thread instead of running it via
TWA_SIGNALat arbitrary userspace re-entries. Eliminates the cross-CPU IPI and the unpredictable preemption window, and lets all completions drain in a single kernel pass at the nextGETEVENTSboundary.Because the CQ stays empty until something explicitly enters io_uring with
GETEVENTS,select()callsio_uring_get_events()at the top — before the non-blockingselect_process_completions()peek — so deferred completions are visible without forcing a blocking wait.IORING_SETUP_TASKRUN_FLAG(kernel 5.19+) — follow-upThe unconditional
io_uring_get_events()is a real syscall everyselect()iteration.TASKRUN_FLAGasks the kernel to setIORING_SQ_TASKRUNinsq.flagswhenever task work is pending; we read the bit via a relaxed atomic load on the mmap'd flag word and only enter the kernel when there is something to flush. Suggested by @tavianator in review.Benchmark CI
benchmark.yamlwas missingliburing-dev, so benchmarks silently fell back to EPoll. Added the apt step so the URing benchmark job actually exercises URing.Benchmark
hana.oriontransfer.co.nz, Linux 6.19.11, Ruby 3.4.9, dedicated machine.
Wakeup latency (
benchmark/selector_wakeup.rb):mainThe ~1 µs roundtrip gap vs the NOP wakeup is the irreducible cost of going through
eventfd+ the kernel completing a pending async read;DEFER_TASKRUNcloses most of what would otherwise be a ~2.6 µs gap.TASKRUN_FLAG isolation — tight
select(0)loop with nothing pending (200k iterations, n=5):DEFER_TASKRUNonlyTASKRUN_FLAGstrace -e io_uring_enterover 10 such calls: 10 syscalls → 0.HTTP (
benchmark/server/event.rb, 8 connections, 2s wrk runs, 7 reps averaged):mainThe HTTP improvement comes mostly from
DEFER_TASKRUN(task work no longer interrupts the loop at arbitrary points); the wakeup path itself is slightly slower, but it's amortised across every event.Testing
The existing
wakeupcorrectness tests intest/io/event/selector.rbcover the cross-thread wakeup path. CI runs them on Ubuntu withliburing-devinstalled. The benchmark workflow now also runs against URing so we get a continuous signal.