Skip to content

Loading…

Thread#backtrace deadlock number 2 #1952

Closed
thedarkone opened this Issue · 3 comments

2 participants

@thedarkone

Since #1940 has now been fixed, the code in #1888 stopped throwing exceptions, so deadlocks were no longer reproducable. I've pushed a simple code change that adds an explicit raise call, so the deadlocks are still trigger-able on the current Rubinius master.

The code is now in a new thread_safe/rbx-deadlock branch.

The bug is not 100% reproducible, so multiple test runs are now necessary (luckily they are really short this time around).

Relevant information from the previous ticket:

Rubinius version: rurubinius 2.0.0dev (1.8.7 897570e8 yyyy-mm-dd JI) [x86_64-apple-darwin10.2.0] (1.9 mode doesn't help).

Rubinius gets stuck during a run of the test_cache_loops.rb stress test of the thread_safe gem. It looks as a some kind of deadlock as the process drops to 0% CPU usage. Since there's no way to get a thread dump from a running Rubinius I'm using a poor mans one implemented in pure ruby: here and here. This usually works, but in this case the VM goes into infinite loop (using single core at 100%) on the t.backtrace.join("\n") line.

What happens is that during a test run an exception is thrown by compute_if_absent here which is caught by this code as soon as it tries to get a e.backtrace that thread gets deadlocked (Bug A) then all other worker threads finish and rubinius goes to 0% CPU usage. There is a watchdog thread that that wakes up every 20 seconds to check on worker threads making progress, it discovers that no progress is being made prints out _compute_if_absent_loop_outer: STUCK!!!!, dumps what the concurrent cache looks like and then proceeds to iterate over all existing threads and tries to print their backtraces and gets stuck right at the first thread (#<Thread:0x2f98 id=1 sleep>) going into infinite loop chewing 100% of single core (Bug B).

To reproduce:
git clone git://github.com/thedarkone/thread_safe.git
cd thread_safe
git checkout rbx-deadlock
ruby -rubygems -Ilib ./test/test_cache_loops.rb

For a speedier run the THREAD_COUNT const can be somewhat bumped down, I'm still getting full reproducibility with THREAD_COUNT = 8 on a dual core machine.

@dbussink dbussink added a commit that referenced this issue
@dbussink dbussink Fix race condition when shutting down
When we request an auxilary thread to shutdown, maybe another thread is
still waiting to do GC. Therefore we need to guard here so we don't
deadlock when somebody requested a GC but we're also shutting down.

Related to #1952
3fc1656
@dbussink
Rubinius member

Do you still see the issue after this latest change? I don't get a hang anymore here locally.

@thedarkone

It is still hanging for me:

$ ../../rubinius/bin/rbx --version -Ilib ./test/test_cache_loops.rb
rubinius 2.0.0dev (1.8.7 3fc16567 yyyy-mm-dd JI) [x86_64-apple-darwin10.2.0]
Loaded suite ./test/test_cache_loops
Started
_add_remove_loop_outer <--- gets deadlocked on `Exception#backtrace`, CPU usage drops to 0%
_add_remove_loop_outer: 0
_add_remove_loop_outer: STUCK!!!!
table.size 16, size_control: 12 adder: #<ThreadSafe::Util::Adder:0x1f3c @__base=#<Rubinius::AtomicReference:0x1f40> @__busy=#<Rubinius::AtomicReference:0x1f44> @__cells=#<Rubinius::AtomicReference:0x1f48>>[0, ]
nil
nil
<...snip>
nil
nil
#<Thread:0x1fd4 id=1 sleep> <---- rbx's CPU usage goes to 100%

I also get this crash: https://gist.github.com/79e295f51de96d56ec01

@thedarkone

@dbussink good news, the deadlocks appear to be completely gone on the current 15acea6 master!

There is still a very hard to reproduce (about 1 in 50 runs) VM crash: https://gist.github.com/bc68e3cf3e45f9498db9 that still remains.

There also seems to be an concurrent autoload bug (as indicated by the NoMethodError: undefined method `get' on ThreadSafe::Util::XorShiftRandom (Module) error), rbx's autoload should be thread-safe right?

As to the deadlock fix, git bisecting tells me this:

  • it started to complain about Mutex invalid (Thread corrupt?) (usually printing this 2 - 4x times) at some point, though I don't have the exact commit,
  • then 640480d replaced the deadlocks with pretty frequent VM crashes,
  • 15acea6 fixed everything, except that very hard to repro crash mentioned above.

[EDIT] fixed commit SHAs

@dbussink dbussink added a commit that closed this issue
@dbussink dbussink Fix mutex busy when shutting down a thread
When a thread was cleaned up, it could be that the VM lock object was
still locked in VM::wakeup. We now lock around the discard code to grab
this lock.

This means we have to change the code that releases contention, since if
it's called when the VM lock is still taken, it might deadlock.

Fixes #1952
5cc67c6
@dbussink dbussink closed this in 5cc67c6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.