Use eventfd for wakeup and enable io_uring performance flags by samuel-williams-shopify · Pull Request #166 · socketry/io-event

samuel-williams-shopify · 2026-05-08T04:16:04Z

The current cross-thread wakeup() submits a NOP SQE directly into the ring's submission queue from the waking thread. That cross-thread SQ access prevents enabling several kernel performance flags that require single-issuer semantics, and forces a sub-optimal completion path.

This PR rewires wakeup() to use an eventfd (Linux) / pipe (other Unix) via IO_Event_Interrupt, so the waking thread never touches the ring. With that constraint satisfied, three kernel flags become safe to set, each gated by #ifdef and degrading gracefully on older kernels / liburing.

Changes

`eventfd`-based wakeup

Before each blocking io_uring_wait_cqe_timeout, the owner thread submits a one-shot async read on the interrupt descriptor (tagged with &selector->interrupt).
wakeup() — called from any thread — calls IO_Event_Interrupt_signal(), a plain write() that bumps the eventfd counter. No ring access from the waking thread.
The kernel completes the read, wait_cqe_timeout returns, the read atomically consumed the counter, so no separate drain step is needed.

`IORING_SETUP_SINGLE_ISSUER` (kernel 6.0+)

Tells the kernel only the owner thread will ever submit SQEs, letting it skip internal SQ locking on every submit call.

`IORING_SETUP_DEFER_TASKRUN` (kernel 6.1+, requires `SINGLE_ISSUER`)

Defers io_uring task work to the application thread instead of running it via TWA_SIGNAL at arbitrary userspace re-entries. Eliminates the cross-CPU IPI and the unpredictable preemption window, and lets all completions drain in a single kernel pass at the next GETEVENTS boundary.

Because the CQ stays empty until something explicitly enters io_uring with GETEVENTS, select() calls io_uring_get_events() at the top — before the non-blocking select_process_completions() peek — so deferred completions are visible without forcing a blocking wait.

`IORING_SETUP_TASKRUN_FLAG` (kernel 5.19+) — follow-up

The unconditional io_uring_get_events() is a real syscall every select() iteration. TASKRUN_FLAG asks the kernel to set IORING_SQ_TASKRUN in sq.flags whenever task work is pending; we read the bit via a relaxed atomic load on the mmap'd flag word and only enter the kernel when there is something to flush. Suggested by @tavianator in review.

Benchmark CI

benchmark.yaml was missing liburing-dev, so benchmarks silently fell back to EPoll. Added the apt step so the URing benchmark job actually exercises URing.

Benchmark

hana.oriontransfer.co.nz, Linux 6.19.11, Ruby 3.4.9, dedicated machine.

Wakeup latency (benchmark/selector_wakeup.rb):

Backend	`main`	This PR
URing cross-thread roundtrip	9.32 µs	10.21 µs
URing idle wakeup	1.54 µs	1.53 µs

The ~1 µs roundtrip gap vs the NOP wakeup is the irreducible cost of going through eventfd + the kernel completing a pending async read; DEFER_TASKRUN closes most of what would otherwise be a ~2.6 µs gap.

TASKRUN_FLAG isolation — tight select(0) loop with nothing pending (200k iterations, n=5):

	ns/call	M calls/s
`DEFER_TASKRUN` only	~570 ns	1.75
+ `TASKRUN_FLAG`	~58 ns	17

strace -e io_uring_enter over 10 such calls: 10 syscalls → 0.

HTTP (benchmark/server/event.rb, 8 connections, 2s wrk runs, 7 reps averaged):

	`main`	This PR
Average req/s	14,375	14,594
Δ		+1.5%

The HTTP improvement comes mostly from DEFER_TASKRUN (task work no longer interrupts the loop at arbitrary points); the wakeup path itself is slightly slower, but it's amortised across every event.

Testing

The existing wakeup correctness tests in test/io/event/selector.rb cover the cross-thread wakeup path. CI runs them on Ubuntu with liburing-dev installed. The benchmark workflow now also runs against URing so we get a continuous signal.

samuel-williams-shopify · 2026-05-09T07:11:23Z

Benchmark results (Linux 6.19, Ruby 3.4.8, dedicated machine)

Methodology: two git worktrees built independently — main (NOP wakeup) vs this PR (IO_Event_Interrupt async read + SINGLE_ISSUER + DEFER_TASKRUN). Each HTTP run is 2s wrk, 7 runs averaged.

Wakeup microbenchmark (`benchmark/selector_wakeup.rb`)

Cross-thread roundtrip: time from wakeup() call on another thread to select() returning on the owner thread.

Backend	main	PR	Δ
URing	9.32 µs	10.21 µs	+9%
EPoll	11.19 µs	10.84 µs	noise
Select	21.06 µs	18.36 µs	noise

Idle wakeup cost (calling wakeup() when selector is not blocking):

Backend	main	PR
URing	1.54 µs	1.53 µs

The ~1 µs roundtrip gap vs NOP is the irreducible cost of the eventfd write going through the kernel and completing a pending async read, rather than a NOP arriving at io_uring_wait_cqe_timeout directly. DEFER_TASKRUN closes most of this gap (was ~2.6 µs without it).

HTTP benchmark (`benchmark/server/event.rb`, fiber-per-connection)

	main (NOP)	PR	Δ
Run 1	14,854	14,821
Run 2	14,085	14,592
Run 3	14,171	14,240
Run 4	13,774	14,768
Run 5	14,057	14,544
Run 6	14,717	14,627
Run 7	14,965	14,566
Average	14,375 req/s	14,594 req/s	+1.5%

The DEFER_TASKRUN flag benefits the entire completion path, not just wakeup — which is why the HTTP throughput improves even though the pure wakeup latency is similar.

tavianator · 2026-05-11T19:42:13Z

+	// io_uring_enter with IORING_ENTER_GETEVENTS (without blocking) to flush that
+	// work into the CQ so the non-blocking select_process_completions below sees them.
+	if (selector->ring.flags & IORING_SETUP_DEFER_TASKRUN) {
+		io_uring_get_events(&selector->ring);


I don't think you need to do this, it should be done for you by io_uring_wait_cqe_timeout already

If we do need to do this it's worth trying IORING_SETUP_TASKRUN_FLAG so you don't need this unconditional syscall. But I don't think we need this

The io_uring_get_events call here is at the top of select() — before any blocking wait — to flush deferred completions for the non-blocking select_process_completions peek that runs next:

io_uring_submit_flush(selector); #ifdef IORING_SETUP_DEFER_TASKRUN if (selector->ring.flags & IORING_SETUP_DEFER_TASKRUN) { io_uring_get_events(&selector->ring); // <- here } #endif int ready = IO_Event_Selector_ready_flush(&selector->backend); int result = select_process_completions(selector); // non-blocking peek if (!ready && !result && !selector->backend.ready) { // ... only now do we optionally enter the blocking io_uring_wait_cqe_timeout ... }

io_uring_wait_cqe_timeout does flush deferred work, but only when we actually go into the blocking wait. The non-blocking peek above runs first, and with DEFER_TASKRUN it would see an empty CQ even though completions have happened — fibers would stall until the next blocking wait forced the kernel into GETEVENTS. The upfront io_uring_get_events is what makes those completions visible to the peek path.

That said, you're right that an unconditional io_uring_enter(GETEVENTS) syscall every select() iteration is wasteful when nothing is actually deferred. Going to take your second suggestion and gate on IORING_SETUP_TASKRUN_FLAG so we only call into the kernel when IORING_SQ_TASKRUN is set — will follow up with a benchmark on hana once it lands.

Done in 9c85783. Enabled IORING_SETUP_TASKRUN_FLAG alongside DEFER_TASKRUN and gated the io_uring_get_events call on *ring.sq.kflags & IORING_SQ_TASKRUN (relaxed atomic load — no syscall).

Benchmark on hana (Linux 6.19.11, Ruby 3.4.9, dedicated):

Workload Baseline (48e1437) With TASKRUN_FLAG (9c85783) Δ

select(0) with no pending work (200k tight loop, n=5) ~570 ns/call (1.75M/s) ~58 ns/call (17M/s) −90% / ~10×

Cross-thread wakeup roundtrip (n=12) 10.75 µs ± 0.44 10.66 µs ± 0.74 within noise

Idle wakeup (n=12) 1.61 µs ± 0.19 1.55 µs ± 0.11 within noise

strace -e io_uring_enter against 10 selector.select(0) calls with nothing pending:

Before After

io_uring_enter(IORING_ENTER_GETEVENTS) × 10 0

The wakeup microbenchmark doesn't move because every iteration has a deferred eventfd-read completion in flight — IORING_SQ_TASKRUN is set, so we still do the (necessary) get_events. The big win is for select() iterations that genuinely have nothing to flush, which the tight-loop microbenchmark isolates.

@tavianator

… syscall. When DEFER_TASKRUN is active the kernel holds completions as deferred task work and the CQ stays empty until something calls into io_uring with GETEVENTS. We were calling io_uring_get_events() unconditionally at the top of select() to flush deferred work into the CQ before the non-blocking peek — a real io_uring_enter syscall every iteration regardless of whether anything was actually deferred. IORING_SETUP_TASKRUN_FLAG (kernel 5.19+, always available alongside DEFER_TASKRUN) asks the kernel to set IORING_SQ_TASKRUN in the SQ flags whenever task work is pending. A relaxed atomic load is enough to check, so we can skip the get_events syscall when nothing is deferred. Suggested by @tavianator in #166. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace the cross-thread NOP-SQE wakeup with an async read on an `IO_Event_Interrupt` (eventfd on Linux, pipe elsewhere) so the waking thread never touches the ring's submission queue. This unlocks two kernel performance flags: - `IORING_SETUP_SINGLE_ISSUER` (kernel 6.0+): tells the kernel only the owner thread will ever submit SQEs, letting it skip internal SQ locking on every submit. - `IORING_SETUP_DEFER_TASKRUN` (kernel 6.1+, requires `SINGLE_ISSUER`): defers io_uring task work to the application thread rather than firing `TWA_SIGNAL` and running it at arbitrary userspace re-entries. Removes the cross-CPU IPI and produces deterministic completion ordering. Wakeup path: 1. Before each blocking wait the owner thread submits an async read on the interrupt descriptor (`io_uring_cqe_get_data(cqe) == &selector->interrupt` identifies it). 2. `wakeup()` on any thread just calls `IO_Event_Interrupt_signal()` — a plain `write()` that never touches the SQ. 3. The kernel completes the read, `io_uring_wait_cqe_timeout` returns, the read atomically consumed the counter so there is no separate drain step. `DEFER_TASKRUN` means the kernel holds completions as deferred task work and the CQ stays empty until something enters io_uring with `GETEVENTS`. To keep the non-blocking peek in `select()` correct, we explicitly call `io_uring_get_events()` at the top of `select()` so deferred completions are flushed into the CQ before `select_process_completions()` runs. Both flags are guarded by `#ifdef` and degrade gracefully on older kernels / liburing. Co-authored-by: Cursor <cursoragent@cursor.com>

@tavianator

With `DEFER_TASKRUN` we were calling `io_uring_get_events()` unconditionally at the top of `select()` — a real `io_uring_enter(GETEVENTS)` syscall every iteration regardless of whether there was anything deferred to flush. `IORING_SETUP_TASKRUN_FLAG` (kernel 5.19+, always available alongside `DEFER_TASKRUN`) asks the kernel to set `IORING_SQ_TASKRUN` in `sq.flags` whenever task work is pending. A relaxed atomic load on the mmap'd flag word is enough to check — no syscall — so we only enter the kernel when there is actually deferred work to flush. Suggested by @tavianator in #166. Microbenchmark (hana, Linux 6.19.11, Ruby 3.4.9): tight `select(0)` loop with nothing pending drops from ~570 ns/call to ~58 ns/call (~10×). `strace` over 10 calls: 10 `io_uring_enter(GETEVENTS)` syscalls → 0. The wakeup roundtrip is unchanged because that path always has a deferred eventfd-read completion in flight (so the flag is set and we still flush — correctly). Co-authored-by: Cursor <cursoragent@cursor.com>

After "Handle short io_uring submissions" (b57476a) moved the `selector->pending += 1` into `io_get_sqe()`, the manual increment in `select_internal_without_gvl` introduced by socketry#166 became a double-count. The duplicated count left `pending > 0` after `io_uring_wait_cqe_timeout` drained the SQ internally, sending the next `io_uring_submit_flush` into a busy loop (submit returns 0 because the ring is empty, but `pending > 0` keeps the new while-loop spinning). Co-authored-by: Cursor <cursoragent@cursor.com>

samuel-williams-shopify force-pushed the uring-eventfd-wakeup branch 4 times, most recently from 07f1949 to dc2593b Compare May 9, 2026 07:07

samuel-williams-shopify force-pushed the uring-eventfd-wakeup branch 2 times, most recently from 8e7e932 to c9ba733 Compare May 9, 2026 08:00

tavianator reviewed May 11, 2026

View reviewed changes

Comment thread ext/io/event/selector/uring.c Outdated

tavianator reviewed May 11, 2026

View reviewed changes

samuel-williams-shopify force-pushed the uring-eventfd-wakeup branch from 261821b to 0e6f2fe Compare May 12, 2026 04:05

samuel-williams-shopify and others added 2 commits May 12, 2026 13:32

samuel-williams-shopify force-pushed the uring-eventfd-wakeup branch from 9c85783 to 7bd340f Compare May 12, 2026 04:33

samuel-williams-shopify changed the title ~~Use eventfd for wakeup and enable IORING_SETUP_SINGLE_ISSUER~~ Use eventfd for wakeup and enable io_uring performance flags May 12, 2026

samuel-williams-shopify merged commit 6c39050 into main May 12, 2026
54 of 66 checks passed

samuel-williams-shopify deleted the uring-eventfd-wakeup branch May 12, 2026 04:54

samuel-williams-shopify mentioned this pull request May 12, 2026

io_uring improvements #167

Merged

3 tasks

samuel-williams-shopify added this to the v1.16.0 milestone May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use eventfd for wakeup and enable io_uring performance flags#166

Use eventfd for wakeup and enable io_uring performance flags#166
samuel-williams-shopify merged 2 commits into
mainfrom
uring-eventfd-wakeup

samuel-williams-shopify commented May 8, 2026 •

edited

Loading

Uh oh!

samuel-williams-shopify commented May 9, 2026

Uh oh!

Uh oh!

tavianator May 11, 2026

Uh oh!

tavianator May 11, 2026

Uh oh!

samuel-williams-shopify May 12, 2026

Uh oh!

samuel-williams-shopify May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Workload	Baseline (`48e1437`)	With `TASKRUN_FLAG` (`9c85783`)	Δ
`select(0)` with no pending work (200k tight loop, n=5)	~570 ns/call (1.75M/s)	~58 ns/call (17M/s)	−90% / ~10×
Cross-thread wakeup roundtrip (n=12)	10.75 µs ± 0.44	10.66 µs ± 0.74	within noise
Idle `wakeup` (n=12)	1.61 µs ± 0.19	1.55 µs ± 0.11	within noise

Uh oh!

Conversation

samuel-williams-shopify commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

eventfd-based wakeup

IORING_SETUP_SINGLE_ISSUER (kernel 6.0+)

IORING_SETUP_DEFER_TASKRUN (kernel 6.1+, requires SINGLE_ISSUER)

IORING_SETUP_TASKRUN_FLAG (kernel 5.19+) — follow-up

Benchmark CI

Benchmark

Testing

Uh oh!

samuel-williams-shopify commented May 9, 2026

Benchmark results (Linux 6.19, Ruby 3.4.8, dedicated machine)

Wakeup microbenchmark (benchmark/selector_wakeup.rb)

HTTP benchmark (benchmark/server/event.rb, fiber-per-connection)

Uh oh!

Uh oh!

tavianator May 11, 2026

Choose a reason for hiding this comment

Uh oh!

tavianator May 11, 2026

Choose a reason for hiding this comment

Uh oh!

samuel-williams-shopify May 12, 2026

Choose a reason for hiding this comment

Uh oh!

samuel-williams-shopify May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samuel-williams-shopify commented May 8, 2026 •

edited

Loading

`eventfd`-based wakeup

`IORING_SETUP_SINGLE_ISSUER` (kernel 6.0+)

`IORING_SETUP_DEFER_TASKRUN` (kernel 6.1+, requires `SINGLE_ISSUER`)

`IORING_SETUP_TASKRUN_FLAG` (kernel 5.19+) — follow-up

Wakeup microbenchmark (`benchmark/selector_wakeup.rb`)

HTTP benchmark (`benchmark/server/event.rb`, fiber-per-connection)