Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a real Ruby mutex in rb_io_close_wait_list #7884

Merged

Conversation

KJTsanaktsidis
Copy link
Contributor

@KJTsanaktsidis KJTsanaktsidis commented Jun 1, 2023

Because a thread calling IO#close now blocks in a native condvar wait, it's possible for there to be no threads left to actually handle incoming signals/ubf calls/etc.

This manifested as failing tests on Solaris 10 (SPARC), because:

  • One thread called IO#close, which sent a SIGVTALRM to the other thread to interrupt it, and then waited on the condvar to be notified that the reading thread was done.
  • One thread was calling IO#read, but it hadn't yet reached the actual call to select(2) when the SIGVTALRM arrived, so it never unblocked itself.

This results in a deadlock.

The fix is to use a real Ruby mutex for the close lock; that way, the closing thread goes into sigwait-sleep and can keep trying to interrupt the select(2) thread.

See the discussion in: #7865


In that linked PR above, I also said "nogvl_wait_for should wait not only on the actual FD that was requested, but also on the sigwait FD (if it's available).". I think I want to do that as well. But this patch seems sufficient to solve the deadlock I introduced, and the tests pass on the Solaris VM i'm using.

CC @mame and @ioquatix

Because a thread calling IO#close now blocks in a native condvar wait,
it's possible for there to be _no_ threads left to actually handle
incoming signals/ubf calls/etc.

This manifested as failing tests on Solaris 10 (SPARC), because:

* One thread called IO#close, which sent a SIGVTALRM to the other
  thread to interrupt it, and then waited on the condvar to be notified
  that the reading thread was done.
* One thread was calling IO#read, but it hadn't yet reached the actual
  call to select(2) when the SIGVTALRM arrived, so it never unblocked
  itself.

This results in a deadlock.

The fix is to use a real Ruby mutex for the close lock; that way, the
closing thread goes into sigwait-sleep and can keep trying to interrupt
the select(2) thread.

See the discussion in: ruby#7865
rb_native_mutex_initialize(&busy->mu);
rb_native_cond_initialize(&busy->cv);
wakeup_mutex = rb_mutex_new();
RBASIC_CLEAR_CLASS(wakeup_mutex); /* hide from ObjectSpace */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to use a Ruby mutex like this inside thread.c/io.c? It works, and does what I want, but I don't know if this kind of thing is considered wrong for some reason?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay, and I'd even say you don't need to worry about exposing it to object space.

@ioquatix ioquatix merged commit edee9b6 into ruby:master Jun 1, 2023
94 of 96 checks passed
@ioquatix
Copy link
Member

ioquatix commented Jun 1, 2023

Thanks for your work on this difficult issue.

@ko1
Copy link
Contributor

ko1 commented Jun 1, 2023

It doesn't work on Ractors so I'll change them later.

@KJTsanaktsidis
Copy link
Contributor Author

Is the issue that two Ractors could be doing IO to the same fd (e.g. by both creating an IO with IO.for_fd), and the mutex would wind up being shared between Ractors unsafely? Would the fix for that be to just protect the rb_io_close_wait_list with the RB_VM_LOCK_ENTER() lock?

@luanzeba
Copy link
Contributor

luanzeba commented Jun 1, 2023

Thanks for this change. We were seeing deadlocks when running cc698c6 but this change fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants