Add io_close to TestScheduler#171
Open
samuel-williams-shopify wants to merge 3 commits intomainfrom
Open
Conversation
samuel-williams-shopify
added a commit
to samuel-williams-shopify/ruby
that referenced
this pull request
May 9, 2026
…eration_wait rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When the fiber is suspended, the C frame of rb_fiber_scheduler_blocking_operation_wait is no longer active. In optimised builds (-O3 --enable-shared), blocking_operation may be held only in a machine register not saved/scanned by the conservative GC, allowing it to be collected. get_blocking_operation() at line 1104 then reads freed/reused memory, crashing with rb_unexpected_object_type. Confirmed by reproducing the crash using: ./configure --enable-shared --disable-install-doc --enable-yjit cppflags=-DENABLE_PATH_CHECK=0 RB_GC_GUARD(blocking_operation) after rb_funcall forces the compiler to keep the VALUE on the stack (volatile read), ensuring the GC always finds it. See: socketry/io-event#170 socketry/io-event#171 Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify
added a commit
to samuel-williams-shopify/ruby
that referenced
this pull request
May 9, 2026
…eration_wait rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When the fiber is suspended, blocking_operation may only be in a machine register not scanned by the conservative GC, allowing collection. Confirmed by reproducing the crash (segfault in get_blocking_operation) with: ./configure --enable-shared --disable-install-doc --enable-yjit RB_GC_GUARD forces the VALUE onto the stack ensuring the GC always finds it. See: socketry/io-event#171 Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify
added a commit
to samuel-williams-shopify/ruby
that referenced
this pull request
May 9, 2026
…eration_wait rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When the fiber is suspended, blocking_operation may only be in a machine register not scanned by the conservative GC, allowing collection. Confirmed by reproducing the crash (segfault in get_blocking_operation) with: ./configure --enable-shared --disable-install-doc --enable-yjit RB_GC_GUARD forces the VALUE onto the stack ensuring the GC always finds it. See: socketry/io-event#171 Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify
added a commit
to samuel-williams-shopify/ruby
that referenced
this pull request
May 9, 2026
…eration_wait rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When the fiber is suspended, blocking_operation may only be in a machine register not scanned by the conservative GC, allowing collection. Confirmed by reproducing the crash (segfault in get_blocking_operation) with: ./configure --enable-shared --disable-install-doc --enable-yjit RB_GC_GUARD forces the VALUE onto the stack ensuring the GC always finds it. See: socketry/io-event#171 Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify
added a commit
to samuel-williams-shopify/ruby
that referenced
this pull request
May 10, 2026
…eration_wait rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When the fiber is suspended, blocking_operation may not be reachable via the conservative GC scan of the suspended fiber's C stack. rb_gc_register_address pins blocking_operation in the global GC root list, which is always walked regardless of fiber state. The address is kept registered through the last implicit use of the VALUE — including all accesses via the raw C pointer derived from it — so that a compacting GC cannot move the object and leave dangling. Confirmed by reproducing the crash in io-event CI: ./configure --enable-shared --disable-install-doc --enable-yjit See: socketry/io-event#171 ruby#16908 Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify
added a commit
to samuel-williams-shopify/ruby
that referenced
this pull request
May 10, 2026
…eration_wait rb_funcall(scheduler, :blocking_operation_wait, 1, blocking_operation) can cause a fiber switch if the scheduler calls rb_fiber_scheduler_block. When the fiber is suspended, blocking_operation may not be reachable via the conservative GC scan of the suspended fiber's C stack. rb_gc_register_address pins blocking_operation in the global GC root list, which is always walked regardless of fiber state. The address is kept registered through the last implicit use of the VALUE — including all accesses via the raw C pointer derived from it — so that a compacting GC cannot move the object and leave dangling. Confirmed by reproducing the crash in io-event CI: ./configure --enable-shared --disable-install-doc --enable-yjit See: socketry/io-event#171 ruby#16908 Co-authored-by: Cursor <cursoragent@cursor.com>
samuel-williams-shopify
added a commit
to samuel-williams-shopify/ruby
that referenced
this pull request
May 10, 2026
…eration_wait Use rb_gc_register_address to pin blocking_operation as a precise GC root during rb_funcall. The scheduler's blocking_operation_wait may cause a fiber switch via rb_fiber_scheduler_block, which suspends the calling fiber. The conservative GC does not find the VALUE on the suspended fiber's C stack (possibly due to it being in a machine register not captured in the saved context), so the object can be collected or moved without updating the local VALUE. rb_gc_register_address ensures the object is a precise root that is always found and properly handled by both the regular and compacting GC. rb_gc_unregister_address is called after the last use of the raw pointer (which is derived from blocking_operation) to avoid a dangling registered address. Confirmed by io-event CI which reliably crashes without this fix and passes with it: socketry/io-event#171 Co-authored-by: Cursor <cursoragent@cursor.com>
f1696e5 to
31cc39f
Compare
In Ruby 4.1, IO#close without a scheduler io_close hook falls through to maygvl_close with RB_NOGVL_OFFLOAD_SAFE, which calls rb_fiber_scheduler_blocking_operation_wait. With a worker-pool scheduler this causes a fiber switch, and GC compaction can then move the blocking_operation VALUE — crashing in get_blocking_operation. Fix: add io_close to TestScheduler delegating to @selector.io_close so Ruby marks done=1 and skips maygvl_close entirely. Co-authored-by: Cursor <cursoragent@cursor.com>
31cc39f to
10807b5
Compare
Three related fixes: 1. GC compaction safety for raw C pointers (the main bug) worker_pool_call extracted a raw C pointer from the blocking_operation TypedData object before the fiber switch. The compacting GC could move that object while the calling fiber was suspended, leaving a stale pointer that the worker thread would then dereference — crashing. Fix: store the Ruby VALUE alongside the raw pointer and register all four Work struct VALUEs (blocking_operation_value, scheduler, blocker, fiber) as precise GC roots. The compacting GC updates these in-place when objects move. The worker thread re-extracts the raw pointer from the updated VALUE. 2. RUBY_TYPED_WB_PROTECTED + compact function Without WB_PROTECTED, RB_OBJ_WRITE (already present at worker creation) installed no write barrier. Adding the flag makes it functional, and a new compact function updates worker->thread via rb_gc_location. 3. rb_gc_mark_movable for thread objects Changed from rb_gc_mark (which pins) to rb_gc_mark_movable so threads can be moved by the compacting GC. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In Ruby 4.1,
IO#closecallsfptr_finalize_flushwhich falls through tomaygvl_closewithRB_NOGVL_OFFLOAD_SAFEwhen the scheduler'sio_closehook is absent or returnsQundef. This causesrb_fiber_scheduler_blocking_operation_waitto be called.With a worker-pool scheduler (like
TestScheduler) this triggers a fiber switch insideblocking_operation_wait. The GC can then compact and move theblocking_operationVALUE while the calling fiber is suspended — crashing inget_blocking_operationwith a wrong-type segfault.Root cause
TestSchedulerdidn't implement theio_closescheduler hook. Without it, Ruby falls back tomaygvl_close→rb_fiber_scheduler_blocking_operation_wait, which is unnecessary because the selector already handles async close viaio_uring_prep_close.Fix
Add
io_closetoTestSchedulerdelegating to@selector.io_close. This makesrb_fiber_scheduler_io_closereturn truthy, Ruby setsdone=1, andmaygvl_closeis skipped entirely —blocking_operation_waitis never called for IO closes.Investigation
The crash was diagnosed by:
ruby-dev-builder(--enable-shared --disable-install-doc --enable-yjit)test/io/event/selector/io_close.rbpresent (from PR Fix io_close crash when called with an Integer fd #169) and passes without it — the IO close tests change the heap layout enough for GC compaction to reliably hit the crash windowsus-fixtures-benchmarkgem as a factor (red herring)fptr_finalize_flush → rb_nogvl(RB_NOGVL_OFFLOAD_SAFE) → rb_fiber_scheduler_blocking_operation_waitSee also: ruby/ruby#16908 — a latent GC bug in
rb_fiber_scheduler_blocking_operation_wait(theblocking_operationVALUE is not a precise GC root duringrb_funcall, so fiber-switching schedulers can see it collected/moved). This fix avoids triggering that path; the Ruby PR provides defense-in-depth for other schedulers.