Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutex: Raise a ThreadError when detecting a fiber deadlock #6680

Merged
merged 1 commit into from Nov 8, 2022

Conversation

casperisfine
Copy link
Contributor

@casperisfine casperisfine commented Nov 5, 2022

[Bug #19105]

If no fiber scheduler is registered and the fiber that owns the lock and the one that try to acquire it
both belong to the same thread, we're in a deadlock case.

Ref: https://bugs.ruby-lang.org/issues/17827#note-10

cc @eregon @ioquatix WDYT?

Copy link
Member

@eregon eregon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

test/fiber/test_mutex.rb Outdated Show resolved Hide resolved
@@ -194,7 +194,7 @@ def test_queue_pop_waits
end

def test_mutex_deadlock
error_pattern = /No live threads left. Deadlock\?/
error_pattern = /lock already owned by another fiber/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the No live threads left. Deadlock? error is actually an eFatal IIRC.
But I think using ThreadError as you did seems fine too.

There is some other detection too using that message though, maybe it's good to be consistent with it in the exception class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we also need to respect GET_THREAD()->vm->thread_ignore_deadlock (which defaults to false) like in rb_check_deadlock(), i.e., the lock could be interrupted by a signal and unblock things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the No live threads left. Deadlock? error is actually an eFatal IIRC.

Interesting, the other deadlock related errors in this function are all rb_eThreadError.

Probably we also need to respect GET_THREAD()->vm->thread_ignore_deadlock

Done.

Copy link
Member

@eregon eregon Nov 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, the other deadlock related errors in this function are all rb_eThreadError.

Right, ThreadError seems fair then (especially since it's quite similar to the "deadlock; recursive locking" case).

[Bug #19105]

If no fiber scheduler is registered and the fiber that
owns the lock and the one that try to acquire it
both belong to the same thread, we're in a deadlock case.
Copy link
Member

@ioquatix ioquatix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also suggest adding a test using Fiber.blocked{}. We don't always go through the scheduler code path even if a scheduler is present, if the fiber is blocking. I believe in this case, it should still error out, but without writing the code & test myself, I am not 100% certain.

@casperisfine
Copy link
Contributor Author

I would also suggest adding a test using Fiber.blocked{}.

I need to read on that one. I'm not familiar with Fiber.blocked.

@ioquatix
Copy link
Member

ioquatix commented Nov 6, 2022

Fiber.blocked disables the fiber scheduler hooks (by disabling the fiber scheduler for the duration of that call).

@ioquatix ioquatix merged commit eacedcf into ruby:master Nov 8, 2022
@technicalpickles
Copy link
Contributor

Would it be possible to get a 3.0 and/or 3.1 release with this fix in it? This is blocking our upgrade from 2.7. Happy to help how I can!

@eregon
Copy link
Member

eregon commented Nov 8, 2022

@technicalpickles The ticket is already marked for backport. But I don't see how this would unblock an upgrade, it will raise instead of deadlock, but that means the code trying to relock the Mutex on the same thread still won't work correctly.

@technicalpickles
Copy link
Contributor

@eregon Apologies! I'm not familiar with how backporting and releases happen. Thanks for confirming!

This will help us flush out where deadlock is happening and help confirm when we've fixed it.

@casperisfine casperisfine deleted the monitor-deadlock branch November 8, 2022 21:17
@casperisfine
Copy link
Contributor Author

Yeah, backporting is tracked on Redmine, and handle by the release maintainer for each version:

Capture d’écran 2022-11-09 à 06 18 09

A new version will be cut eventually, but probably not anytime soon. I don't know how hard it is with your infra, but I suggest applying the patch yourself when you compile your Ruby.

matzbot pushed a commit that referenced this pull request Nov 13, 2022
	mutex: Raise a ThreadError when detecting a fiber deadlock (#6680)

	[Bug #19105]

	If no fiber scheduler is registered and the fiber that
	owns the lock and the one that try to acquire it
	both belong to the same thread, we're in a deadlock case.

	Co-authored-by: Jean Boussier <byroot@ruby-lang.org>
	---
	 test/fiber/test_mutex.rb | 22 +++++++++++++++++++++-
	 thread_sync.c            |  4 ++++
	 2 files changed, 25 insertions(+), 1 deletion(-)
@ojab
Copy link

ojab commented Nov 17, 2022

This broke rspec-support, It has implementation of reentrant mutex and corresponding spec.
Right now it fails on @mutex.lock unless @mutex.owned? inside the Fiber because now it raises an exception instead of waiting for the lock.

Any hint how to deal with it?

@casperisfine
Copy link
Contributor Author

@ojab could you provide a small reproduction script? I'll happily look at it.

@casperisfine
Copy link
Contributor Author

Also isn't ReentrantMutex basically a Monitor?

@eregon
Copy link
Member

eregon commented Nov 17, 2022

Also isn't ReentrantMutex basically a Monitor?

It is AFAIK. But RSpec tries to support very very old Ruby versions (1.8), see rspec/rspec-support#552 (comment).

I think we just need to change the RSpec spec here, so it accepts that outcome (in addition to the existing one).
That specs would be stuck forever except it gets itself unstuck via Thread#raise.
It seems a pretty artificial case which makes sense as a test but not in real code.

Another option would be to use Thread.ignore_deadlock = true in that spec, that's designed for the case a deadlock is solved by Thread#raise or a signal, so it seems fair enough.

Self-note: link to my PR fixing ReentrantMutex on Ruby 3.0: rspec/rspec-support#503

@casperisfine
Copy link
Contributor Author

RSpec tries to support very very old Ruby versions (1.8)

Monitor was present in 1.4; https://github.com/ruby/ruby/blame/a36e0c78c90917c4d5cc78f67b3808913795f264/lib/monitor.rb

Based on the RSpec thread the reason seem to be that they don't want to require anything from stblib to not impact the tested code.

@eregon
Copy link
Member

eregon commented Nov 17, 2022

Right, we'd need to make Monitor core then for RSpec to use it. But anyway RSpec::Support::ReentrantMutex is fine and fairly simple (in fact much simpler than Monitor).

@ojab
Copy link

ojab commented Nov 18, 2022

@casperisfine hopefully reproducer is not needed since ReentrantMutex is pretty simple:

class ReentrantMutex
  def initialize
    @owner = nil
    @count = 0
    @mutex = Mutex.new
  end

  def synchronize
    enter
    yield
  ensure
    exit
  end

  private

  def enter
    @mutex.lock unless @mutex.owned?
    @count += 1
  end

  def exit
    unless @mutex.owned?
      raise ThreadError, "Attempt to unlock a mutex which is locked by another thread/fiber"
    end
    @count -= 1
    @mutex.unlock if @count == 0
  end
end

Could you please elaborate why fiber would be stuck? Something like

mutex = ReentrantMutex.new

mutex.synchronize do 
  f = Fiber.new do 
    mutex.synchronize { do_stuff }
  end
  f.resume
  do_other_stuff
end

looks reasonable for me because we're not trying to acquire the mutex for the second time. And I guess it could work if Mutex#owned? would return true inside Fiber.new, which is not the case now.

AFAIU currently there is no way to know if we could lock or unlock the mutex right now without causing ThreadError.

@casperisfine
Copy link
Contributor Author

@ojab your repro already fails on 3.0.3. What cause it to fail is not this change but https://bugs.ruby-lang.org/issues/17827

@ojab
Copy link

ojab commented Nov 18, 2022

oh, right, now I got that rspec-support spec is hanging because there is no f.resume there. Thanks for explanation and sorry for the noise.

ojab added a commit to ojab/rspec-support that referenced this pull request Nov 18, 2022
ojab added a commit to ojab/rspec-support that referenced this pull request Nov 18, 2022
pirj pushed a commit to ojab/rspec-support that referenced this pull request Nov 19, 2022
pirj pushed a commit to ojab/rspec-support that referenced this pull request Nov 19, 2022
@pirj
Copy link

pirj commented Dec 29, 2022

Please accept my apologies if I'm saying some complete nonsense. I wonder if it's an irresolvable deadlock?
What if an attempt to lock the mutex owned by a different Fiber of the same Thread would just yield?

PS It seems to be a difference when the mutex is locked by the root Fiber, and there are no other Fibers, it might never be resumed.

I understand the tradeoff of a possibility of that Fiber may get stuck, instead of failing quickly with a ThreadError.

@eregon
Copy link
Member

eregon commented Dec 29, 2022

@pirj The description already mentions that case with the Fiber scheduler, there is no change when there is a Fiber scheduler.
When there is no Fiber scheduler it's an illegal change semantically to change Fiber on Mutex#lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
7 participants