New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mutex: Raise a ThreadError when detecting a fiber deadlock #6680
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
@@ -194,7 +194,7 @@ def test_queue_pop_waits | |||
end | |||
|
|||
def test_mutex_deadlock | |||
error_pattern = /No live threads left. Deadlock\?/ | |||
error_pattern = /lock already owned by another fiber/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the No live threads left. Deadlock?
error is actually an eFatal IIRC.
But I think using ThreadError as you did seems fine too.
There is some other detection too using that message though, maybe it's good to be consistent with it in the exception class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we also need to respect GET_THREAD()->vm->thread_ignore_deadlock
(which defaults to false) like in rb_check_deadlock()
, i.e., the lock could be interrupted by a signal and unblock things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the
No live threads left. Deadlock?
error is actually an eFatal IIRC.
Interesting, the other deadlock related errors in this function are all rb_eThreadError
.
Probably we also need to respect GET_THREAD()->vm->thread_ignore_deadlock
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, the other deadlock related errors in this function are all
rb_eThreadError
.
Right, ThreadError seems fair then (especially since it's quite similar to the "deadlock; recursive locking" case).
7940290
to
f1e895b
Compare
[Bug #19105] If no fiber scheduler is registered and the fiber that owns the lock and the one that try to acquire it both belong to the same thread, we're in a deadlock case.
f1e895b
to
73885b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also suggest adding a test using Fiber.blocked{}
. We don't always go through the scheduler code path even if a scheduler is present, if the fiber is blocking. I believe in this case, it should still error out, but without writing the code & test myself, I am not 100% certain.
I need to read on that one. I'm not familiar with |
|
Would it be possible to get a 3.0 and/or 3.1 release with this fix in it? This is blocking our upgrade from 2.7. Happy to help how I can! |
@technicalpickles The ticket is already marked for backport. But I don't see how this would unblock an upgrade, it will raise instead of deadlock, but that means the code trying to relock the Mutex on the same thread still won't work correctly. |
@eregon Apologies! I'm not familiar with how backporting and releases happen. Thanks for confirming! This will help us flush out where deadlock is happening and help confirm when we've fixed it. |
mutex: Raise a ThreadError when detecting a fiber deadlock (#6680) [Bug #19105] If no fiber scheduler is registered and the fiber that owns the lock and the one that try to acquire it both belong to the same thread, we're in a deadlock case. Co-authored-by: Jean Boussier <byroot@ruby-lang.org> --- test/fiber/test_mutex.rb | 22 +++++++++++++++++++++- thread_sync.c | 4 ++++ 2 files changed, 25 insertions(+), 1 deletion(-)
This broke rspec-support, It has implementation of reentrant mutex and corresponding spec. Any hint how to deal with it? |
@ojab could you provide a small reproduction script? I'll happily look at it. |
Also isn't |
It is AFAIK. But RSpec tries to support very very old Ruby versions (1.8), see rspec/rspec-support#552 (comment). I think we just need to change the RSpec spec here, so it accepts that outcome (in addition to the existing one). Another option would be to use Self-note: link to my PR fixing ReentrantMutex on Ruby 3.0: rspec/rspec-support#503 |
Based on the RSpec thread the reason seem to be that they don't want to require anything from |
Right, we'd need to make |
@casperisfine hopefully reproducer is not needed since class ReentrantMutex
def initialize
@owner = nil
@count = 0
@mutex = Mutex.new
end
def synchronize
enter
yield
ensure
exit
end
private
def enter
@mutex.lock unless @mutex.owned?
@count += 1
end
def exit
unless @mutex.owned?
raise ThreadError, "Attempt to unlock a mutex which is locked by another thread/fiber"
end
@count -= 1
@mutex.unlock if @count == 0
end
end Could you please elaborate why fiber would be stuck? Something like mutex = ReentrantMutex.new
mutex.synchronize do
f = Fiber.new do
mutex.synchronize { do_stuff }
end
f.resume
do_other_stuff
end looks reasonable for me because we're not trying to acquire the mutex for the second time. And I guess it could work if AFAIU currently there is no way to know if we could lock or unlock the mutex right now without causing ThreadError. |
@ojab your repro already fails on 3.0.3. What cause it to fail is not this change but https://bugs.ruby-lang.org/issues/17827 |
oh, right, now I got that rspec-support spec is hanging because there is no |
It raises `ThreadError` since ruby/ruby#6680
It raises `ThreadError` since ruby/ruby#6680
It raises `ThreadError` since ruby/ruby#6680
It raises `ThreadError` since ruby/ruby#6680
Please accept my apologies if I'm saying some complete nonsense. I wonder if it's an irresolvable deadlock? PS It seems to be a difference when the mutex is locked by the root Fiber, and there are no other Fibers, it might never be resumed. I understand the tradeoff of a possibility of that Fiber may get stuck, instead of failing quickly with a |
@pirj The description already mentions that case with the Fiber scheduler, there is no change when there is a Fiber scheduler. |
[Bug #19105]
If no fiber scheduler is registered and the fiber that owns the lock and the one that try to acquire it
both belong to the same thread, we're in a deadlock case.
Ref: https://bugs.ruby-lang.org/issues/17827#note-10
cc @eregon @ioquatix WDYT?