Skip to content

Add io_close to TestScheduler#171

Open
samuel-williams-shopify wants to merge 3 commits intomainfrom
test-ruby-gc-bug
Open

Add io_close to TestScheduler#171
samuel-williams-shopify wants to merge 3 commits intomainfrom
test-ruby-gc-bug

Conversation

@samuel-williams-shopify
Copy link
Copy Markdown
Contributor

@samuel-williams-shopify samuel-williams-shopify commented May 9, 2026

Problem

In Ruby 4.1, IO#close calls fptr_finalize_flush which falls through to maygvl_close with RB_NOGVL_OFFLOAD_SAFE when the scheduler's io_close hook is absent or returns Qundef. This causes rb_fiber_scheduler_blocking_operation_wait to be called.

With a worker-pool scheduler (like TestScheduler) this triggers a fiber switch inside blocking_operation_wait. The GC can then compact and move the blocking_operation VALUE while the calling fiber is suspended — crashing in get_blocking_operation with a wrong-type segfault.

Root cause

TestScheduler didn't implement the io_close scheduler hook. Without it, Ruby falls back to maygvl_closerb_fiber_scheduler_blocking_operation_wait, which is unnecessary because the selector already handles async close via io_uring_prep_close.

Fix

Add io_close to TestScheduler delegating to @selector.io_close. This makes rb_fiber_scheduler_io_close return truthy, Ruby sets done=1, and maygvl_close is skipped entirely — blocking_operation_wait is never called for IO closes.

Investigation

The crash was diagnosed by:

  1. Building Ruby from source with the same flags as ruby-dev-builder (--enable-shared --disable-install-doc --enable-yjit)
  2. Confirming it crashes with test/io/event/selector/io_close.rb present (from PR Fix io_close crash when called with an Integer fd #169) and passes without it — the IO close tests change the heap layout enough for GC compaction to reliably hit the crash window
  3. Eliminating the sus-fixtures-benchmark gem as a factor (red herring)
  4. Tracing the crash via the C stack to fptr_finalize_flush → rb_nogvl(RB_NOGVL_OFFLOAD_SAFE) → rb_fiber_scheduler_blocking_operation_wait
  5. Confirming the fix makes all combinations pass against unfixed Ruby

See also: ruby/ruby#16908 — a latent GC bug in rb_fiber_scheduler_blocking_operation_wait (the blocking_operation VALUE is not a precise GC root during rb_funcall, so fiber-switching schedulers can see it collected/moved). This fix avoids triggering that path; the Ruby PR provides defense-in-depth for other schedulers.

samuel-williams-shopify added a commit to samuel-williams-shopify/ruby that referenced this pull request May 9, 2026
…eration_wait

rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can
cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When
the fiber is suspended, the C frame of rb_fiber_scheduler_blocking_operation_wait
is no longer active. In optimised builds (-O3 --enable-shared), blocking_operation
may be held only in a machine register not saved/scanned by the conservative GC,
allowing it to be collected. get_blocking_operation() at line 1104 then reads
freed/reused memory, crashing with rb_unexpected_object_type.

Confirmed by reproducing the crash using:
  ./configure --enable-shared --disable-install-doc --enable-yjit cppflags=-DENABLE_PATH_CHECK=0

RB_GC_GUARD(blocking_operation) after rb_funcall forces the compiler to keep
the VALUE on the stack (volatile read), ensuring the GC always finds it.

See: socketry/io-event#170
     socketry/io-event#171
Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify added a commit to samuel-williams-shopify/ruby that referenced this pull request May 9, 2026
…eration_wait

rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can
cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When
the fiber is suspended, blocking_operation may only be in a machine register
not scanned by the conservative GC, allowing collection. Confirmed by
reproducing the crash (segfault in get_blocking_operation) with:
  ./configure --enable-shared --disable-install-doc --enable-yjit
RB_GC_GUARD forces the VALUE onto the stack ensuring the GC always finds it.

See: socketry/io-event#171
Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify added a commit to samuel-williams-shopify/ruby that referenced this pull request May 9, 2026
…eration_wait

rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can
cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When
the fiber is suspended, blocking_operation may only be in a machine register
not scanned by the conservative GC, allowing collection. Confirmed by
reproducing the crash (segfault in get_blocking_operation) with:
  ./configure --enable-shared --disable-install-doc --enable-yjit
RB_GC_GUARD forces the VALUE onto the stack ensuring the GC always finds it.

See: socketry/io-event#171
Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify added a commit to samuel-williams-shopify/ruby that referenced this pull request May 9, 2026
…eration_wait

rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can
cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When
the fiber is suspended, blocking_operation may only be in a machine register
not scanned by the conservative GC, allowing collection. Confirmed by
reproducing the crash (segfault in get_blocking_operation) with:
  ./configure --enable-shared --disable-install-doc --enable-yjit
RB_GC_GUARD forces the VALUE onto the stack ensuring the GC always finds it.

See: socketry/io-event#171
Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify added a commit to samuel-williams-shopify/ruby that referenced this pull request May 10, 2026
…eration_wait

rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can
cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When
the fiber is suspended, blocking_operation may not be reachable via the
conservative GC scan of the suspended fiber's C stack.

rb_gc_register_address pins blocking_operation in the global GC root list,
which is always walked regardless of fiber state. The address is kept
registered through the last implicit use of the VALUE — including all accesses
via the raw  C pointer derived from it — so that a compacting GC
cannot move the object and leave  dangling.

Confirmed by reproducing the crash in io-event CI:
  ./configure --enable-shared --disable-install-doc --enable-yjit
See: socketry/io-event#171
     ruby#16908

Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify added a commit to samuel-williams-shopify/ruby that referenced this pull request May 10, 2026
…eration_wait

rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can
cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When
the fiber is suspended, blocking_operation may not be reachable via the
conservative GC scan of the suspended fiber's C stack.

rb_gc_register_address pins blocking_operation in the global GC root list,
which is always walked regardless of fiber state. The address is kept
registered through the last implicit use of the VALUE — including all accesses
via the raw  C pointer derived from it — so that a compacting GC
cannot move the object and leave  dangling.

Confirmed by reproducing the crash in io-event CI:
  ./configure --enable-shared --disable-install-doc --enable-yjit
See: socketry/io-event#171
     ruby#16908

Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify added a commit to samuel-williams-shopify/ruby that referenced this pull request May 10, 2026
…eration_wait

Use rb_gc_register_address to pin blocking_operation as a precise GC root
during rb_funcall. The scheduler's blocking_operation_wait may cause a fiber
switch via rb_fiber_scheduler_block, which suspends the calling fiber. The
conservative GC does not find the VALUE on the suspended fiber's C stack
(possibly due to it being in a machine register not captured in the saved
context), so the object can be collected or moved without updating the local
VALUE. rb_gc_register_address ensures the object is a precise root that is
always found and properly handled by both the regular and compacting GC.
rb_gc_unregister_address is called after the last use of the raw
pointer (which is derived from blocking_operation) to avoid a dangling
registered address.

Confirmed by io-event CI which reliably crashes without this fix and passes
with it: socketry/io-event#171

Co-authored-by: Cursor <cursoragent@cursor.com>
@samuel-williams-shopify samuel-williams-shopify force-pushed the test-ruby-gc-bug branch 2 times, most recently from f1696e5 to 31cc39f Compare May 10, 2026 02:13
In Ruby 4.1, IO#close without a scheduler io_close hook falls through to
maygvl_close with RB_NOGVL_OFFLOAD_SAFE, which calls
rb_fiber_scheduler_blocking_operation_wait. With a worker-pool scheduler
this causes a fiber switch, and GC compaction can then move the
blocking_operation VALUE — crashing in get_blocking_operation.

Fix: add io_close to TestScheduler delegating to @selector.io_close so
Ruby marks done=1 and skips maygvl_close entirely.

Co-authored-by: Cursor <cursoragent@cursor.com>
@samuel-williams-shopify samuel-williams-shopify changed the title Test io-event against Ruby GC bug (blocking_operation safety) Add io_close to TestScheduler May 10, 2026
samuel-williams-shopify and others added 2 commits May 10, 2026 12:24
Three related fixes:

1. GC compaction safety for raw C pointers (the main bug)
   worker_pool_call extracted a raw C pointer from the blocking_operation
   TypedData object before the fiber switch. The compacting GC could move
   that object while the calling fiber was suspended, leaving a stale pointer
   that the worker thread would then dereference — crashing. Fix: store the
   Ruby VALUE alongside the raw pointer and register all four Work struct
   VALUEs (blocking_operation_value, scheduler, blocker, fiber) as precise
   GC roots. The compacting GC updates these in-place when objects move.
   The worker thread re-extracts the raw pointer from the updated VALUE.

2. RUBY_TYPED_WB_PROTECTED + compact function
   Without WB_PROTECTED, RB_OBJ_WRITE (already present at worker creation)
   installed no write barrier. Adding the flag makes it functional, and a
   new compact function updates worker->thread via rb_gc_location.

3. rb_gc_mark_movable for thread objects
   Changed from rb_gc_mark (which pins) to rb_gc_mark_movable so threads
   can be moved by the compacting GC.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant