Skip to content

Use eventfd for wakeup and enable io_uring performance flags#166

Merged
samuel-williams-shopify merged 2 commits into
mainfrom
uring-eventfd-wakeup
May 12, 2026
Merged

Use eventfd for wakeup and enable io_uring performance flags#166
samuel-williams-shopify merged 2 commits into
mainfrom
uring-eventfd-wakeup

Conversation

@samuel-williams-shopify
Copy link
Copy Markdown
Contributor

@samuel-williams-shopify samuel-williams-shopify commented May 8, 2026

The current cross-thread wakeup() submits a NOP SQE directly into the ring's submission queue from the waking thread. That cross-thread SQ access prevents enabling several kernel performance flags that require single-issuer semantics, and forces a sub-optimal completion path.

This PR rewires wakeup() to use an eventfd (Linux) / pipe (other Unix) via IO_Event_Interrupt, so the waking thread never touches the ring. With that constraint satisfied, three kernel flags become safe to set, each gated by #ifdef and degrading gracefully on older kernels / liburing.

Changes

eventfd-based wakeup

  1. Before each blocking io_uring_wait_cqe_timeout, the owner thread submits a one-shot async read on the interrupt descriptor (tagged with &selector->interrupt).
  2. wakeup() — called from any thread — calls IO_Event_Interrupt_signal(), a plain write() that bumps the eventfd counter. No ring access from the waking thread.
  3. The kernel completes the read, wait_cqe_timeout returns, the read atomically consumed the counter, so no separate drain step is needed.

IORING_SETUP_SINGLE_ISSUER (kernel 6.0+)

Tells the kernel only the owner thread will ever submit SQEs, letting it skip internal SQ locking on every submit call.

IORING_SETUP_DEFER_TASKRUN (kernel 6.1+, requires SINGLE_ISSUER)

Defers io_uring task work to the application thread instead of running it via TWA_SIGNAL at arbitrary userspace re-entries. Eliminates the cross-CPU IPI and the unpredictable preemption window, and lets all completions drain in a single kernel pass at the next GETEVENTS boundary.

Because the CQ stays empty until something explicitly enters io_uring with GETEVENTS, select() calls io_uring_get_events() at the top — before the non-blocking select_process_completions() peek — so deferred completions are visible without forcing a blocking wait.

IORING_SETUP_TASKRUN_FLAG (kernel 5.19+) — follow-up

The unconditional io_uring_get_events() is a real syscall every select() iteration. TASKRUN_FLAG asks the kernel to set IORING_SQ_TASKRUN in sq.flags whenever task work is pending; we read the bit via a relaxed atomic load on the mmap'd flag word and only enter the kernel when there is something to flush. Suggested by @tavianator in review.

Benchmark CI

benchmark.yaml was missing liburing-dev, so benchmarks silently fell back to EPoll. Added the apt step so the URing benchmark job actually exercises URing.

Benchmark

hana.oriontransfer.co.nz, Linux 6.19.11, Ruby 3.4.9, dedicated machine.

Wakeup latency (benchmark/selector_wakeup.rb):

Backend main This PR
URing cross-thread roundtrip 9.32 µs 10.21 µs
URing idle wakeup 1.54 µs 1.53 µs

The ~1 µs roundtrip gap vs the NOP wakeup is the irreducible cost of going through eventfd + the kernel completing a pending async read; DEFER_TASKRUN closes most of what would otherwise be a ~2.6 µs gap.

TASKRUN_FLAG isolation — tight select(0) loop with nothing pending (200k iterations, n=5):

ns/call M calls/s
DEFER_TASKRUN only ~570 ns 1.75
+ TASKRUN_FLAG ~58 ns 17

strace -e io_uring_enter over 10 such calls: 10 syscalls → 0.

HTTP (benchmark/server/event.rb, 8 connections, 2s wrk runs, 7 reps averaged):

main This PR
Average req/s 14,375 14,594
Δ +1.5%

The HTTP improvement comes mostly from DEFER_TASKRUN (task work no longer interrupts the loop at arbitrary points); the wakeup path itself is slightly slower, but it's amortised across every event.

Testing

The existing wakeup correctness tests in test/io/event/selector.rb cover the cross-thread wakeup path. CI runs them on Ubuntu with liburing-dev installed. The benchmark workflow now also runs against URing so we get a continuous signal.

@samuel-williams-shopify samuel-williams-shopify force-pushed the uring-eventfd-wakeup branch 4 times, most recently from 07f1949 to dc2593b Compare May 9, 2026 07:07
@samuel-williams-shopify
Copy link
Copy Markdown
Contributor Author

Benchmark results (Linux 6.19, Ruby 3.4.8, dedicated machine)

Methodology: two git worktrees built independently — main (NOP wakeup) vs this PR (IO_Event_Interrupt async read + SINGLE_ISSUER + DEFER_TASKRUN). Each HTTP run is 2s wrk, 7 runs averaged.

Wakeup microbenchmark (benchmark/selector_wakeup.rb)

Cross-thread roundtrip: time from wakeup() call on another thread to select() returning on the owner thread.

Backend main PR Δ
URing 9.32 µs 10.21 µs +9%
EPoll 11.19 µs 10.84 µs noise
Select 21.06 µs 18.36 µs noise

Idle wakeup cost (calling wakeup() when selector is not blocking):

Backend main PR
URing 1.54 µs 1.53 µs

The ~1 µs roundtrip gap vs NOP is the irreducible cost of the eventfd write going through the kernel and completing a pending async read, rather than a NOP arriving at io_uring_wait_cqe_timeout directly. DEFER_TASKRUN closes most of this gap (was ~2.6 µs without it).

HTTP benchmark (benchmark/server/event.rb, fiber-per-connection)

main (NOP) PR Δ
Run 1 14,854 14,821
Run 2 14,085 14,592
Run 3 14,171 14,240
Run 4 13,774 14,768
Run 5 14,057 14,544
Run 6 14,717 14,627
Run 7 14,965 14,566
Average 14,375 req/s 14,594 req/s +1.5%

The DEFER_TASKRUN flag benefits the entire completion path, not just wakeup — which is why the HTTP throughput improves even though the pure wakeup latency is similar.

@samuel-williams-shopify samuel-williams-shopify force-pushed the uring-eventfd-wakeup branch 2 times, most recently from 8e7e932 to c9ba733 Compare May 9, 2026 08:00
Comment thread ext/io/event/selector/uring.c Outdated
Comment thread ext/io/event/selector/uring.c Outdated
// io_uring_enter with IORING_ENTER_GETEVENTS (without blocking) to flush that
// work into the CQ so the non-blocking select_process_completions below sees them.
if (selector->ring.flags & IORING_SETUP_DEFER_TASKRUN) {
io_uring_get_events(&selector->ring);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to do this, it should be done for you by io_uring_wait_cqe_timeout already

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do need to do this it's worth trying IORING_SETUP_TASKRUN_FLAG so you don't need this unconditional syscall. But I don't think we need this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The io_uring_get_events call here is at the top of select() — before any blocking wait — to flush deferred completions for the non-blocking select_process_completions peek that runs next:

io_uring_submit_flush(selector);

#ifdef IORING_SETUP_DEFER_TASKRUN
if (selector->ring.flags & IORING_SETUP_DEFER_TASKRUN) {
    io_uring_get_events(&selector->ring);   // <- here
}
#endif

int ready = IO_Event_Selector_ready_flush(&selector->backend);
int result = select_process_completions(selector);   // non-blocking peek

if (!ready && !result && !selector->backend.ready) {
    // ... only now do we optionally enter the blocking io_uring_wait_cqe_timeout ...
}

io_uring_wait_cqe_timeout does flush deferred work, but only when we actually go into the blocking wait. The non-blocking peek above runs first, and with DEFER_TASKRUN it would see an empty CQ even though completions have happened — fibers would stall until the next blocking wait forced the kernel into GETEVENTS. The upfront io_uring_get_events is what makes those completions visible to the peek path.

That said, you're right that an unconditional io_uring_enter(GETEVENTS) syscall every select() iteration is wasteful when nothing is actually deferred. Going to take your second suggestion and gate on IORING_SETUP_TASKRUN_FLAG so we only call into the kernel when IORING_SQ_TASKRUN is set — will follow up with a benchmark on hana once it lands.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 9c85783. Enabled IORING_SETUP_TASKRUN_FLAG alongside DEFER_TASKRUN and gated the io_uring_get_events call on *ring.sq.kflags & IORING_SQ_TASKRUN (relaxed atomic load — no syscall).

Benchmark on hana (Linux 6.19.11, Ruby 3.4.9, dedicated):

Workload Baseline (48e1437) With TASKRUN_FLAG (9c85783) Δ
select(0) with no pending work (200k tight loop, n=5) ~570 ns/call (1.75M/s) ~58 ns/call (17M/s) −90% / ~10×
Cross-thread wakeup roundtrip (n=12) 10.75 µs ± 0.44 10.66 µs ± 0.74 within noise
Idle wakeup (n=12) 1.61 µs ± 0.19 1.55 µs ± 0.11 within noise

strace -e io_uring_enter against 10 selector.select(0) calls with nothing pending:

Before After
io_uring_enter(IORING_ENTER_GETEVENTS) × 10 0

The wakeup microbenchmark doesn't move because every iteration has a deferred eventfd-read completion in flight — IORING_SQ_TASKRUN is set, so we still do the (necessary) get_events. The big win is for select() iterations that genuinely have nothing to flush, which the tight-loop microbenchmark isolates.

samuel-williams-shopify added a commit that referenced this pull request May 12, 2026
… syscall.

When DEFER_TASKRUN is active the kernel holds completions as deferred task work and the CQ stays empty until something calls into io_uring with GETEVENTS. We were calling io_uring_get_events() unconditionally at the top of select() to flush deferred work into the CQ before the non-blocking peek — a real io_uring_enter syscall every iteration regardless of whether anything was actually deferred.

IORING_SETUP_TASKRUN_FLAG (kernel 5.19+, always available alongside DEFER_TASKRUN) asks the kernel to set IORING_SQ_TASKRUN in the SQ flags whenever task work is pending. A relaxed atomic load is enough to check, so we can skip the get_events syscall when nothing is deferred.

Suggested by @tavianator in #166.

Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify and others added 2 commits May 12, 2026 13:32
Replace the cross-thread NOP-SQE wakeup with an async read on an `IO_Event_Interrupt` (eventfd on Linux, pipe elsewhere) so the waking thread never touches the ring's submission queue. This unlocks two kernel performance flags:

- `IORING_SETUP_SINGLE_ISSUER` (kernel 6.0+): tells the kernel only the owner thread will ever submit SQEs, letting it skip internal SQ locking on every submit.
- `IORING_SETUP_DEFER_TASKRUN` (kernel 6.1+, requires `SINGLE_ISSUER`): defers io_uring task work to the application thread rather than firing `TWA_SIGNAL` and running it at arbitrary userspace re-entries. Removes the cross-CPU IPI and produces deterministic completion ordering.

Wakeup path:

1. Before each blocking wait the owner thread submits an async read on the interrupt descriptor (`io_uring_cqe_get_data(cqe) == &selector->interrupt` identifies it).
2. `wakeup()` on any thread just calls `IO_Event_Interrupt_signal()` — a plain `write()` that never touches the SQ.
3. The kernel completes the read, `io_uring_wait_cqe_timeout` returns, the read atomically consumed the counter so there is no separate drain step.

`DEFER_TASKRUN` means the kernel holds completions as deferred task work and the CQ stays empty until something enters io_uring with `GETEVENTS`. To keep the non-blocking peek in `select()` correct, we explicitly call `io_uring_get_events()` at the top of `select()` so deferred completions are flushed into the CQ before `select_process_completions()` runs. Both flags are guarded by `#ifdef` and degrade gracefully on older kernels / liburing.

Co-authored-by: Cursor <cursoragent@cursor.com>
With `DEFER_TASKRUN` we were calling `io_uring_get_events()` unconditionally at the top of `select()` — a real `io_uring_enter(GETEVENTS)` syscall every iteration regardless of whether there was anything deferred to flush.

`IORING_SETUP_TASKRUN_FLAG` (kernel 5.19+, always available alongside `DEFER_TASKRUN`) asks the kernel to set `IORING_SQ_TASKRUN` in `sq.flags` whenever task work is pending. A relaxed atomic load on the mmap'd flag word is enough to check — no syscall — so we only enter the kernel when there is actually deferred work to flush.

Suggested by @tavianator in #166.

Microbenchmark (hana, Linux 6.19.11, Ruby 3.4.9): tight `select(0)` loop with nothing pending drops from ~570 ns/call to ~58 ns/call (~10×). `strace` over 10 calls: 10 `io_uring_enter(GETEVENTS)` syscalls → 0. The wakeup roundtrip is unchanged because that path always has a deferred eventfd-read completion in flight (so the flag is set and we still flush — correctly).

Co-authored-by: Cursor <cursoragent@cursor.com>
@samuel-williams-shopify samuel-williams-shopify changed the title Use eventfd for wakeup and enable IORING_SETUP_SINGLE_ISSUER Use eventfd for wakeup and enable io_uring performance flags May 12, 2026
@samuel-williams-shopify samuel-williams-shopify merged commit 6c39050 into main May 12, 2026
54 of 66 checks passed
@samuel-williams-shopify samuel-williams-shopify deleted the uring-eventfd-wakeup branch May 12, 2026 04:54
samuel-williams-shopify added a commit to tavianator/io-event that referenced this pull request May 12, 2026
After "Handle short io_uring submissions" (b57476a) moved the `selector->pending += 1` into `io_get_sqe()`, the manual increment in `select_internal_without_gvl` introduced by socketry#166 became a double-count.  The duplicated count left `pending > 0` after `io_uring_wait_cqe_timeout` drained the SQ internally, sending the next `io_uring_submit_flush` into a busy loop (submit returns 0 because the ring is empty, but `pending > 0` keeps the new while-loop spinning).

Co-authored-by: Cursor <cursoragent@cursor.com>
@samuel-williams-shopify samuel-williams-shopify added this to the v1.16.0 milestone May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants