Completely fix deadlocks of Thread#raise #1885

Merged
merged 1 commit into from Sep 27, 2012

3 participants

@ryoqun
Rubinius member

For details see the commit message.

I have a question. As far as I understand, waiting_channel_ seems to be doing a bad thing. So I just removed it. And there is no spec failures. However, is it really OK to remove it? Should I resolve to other approaches?? Also I can't tell very well why there exists waiting_channel for what kind of use cases, first of all. I'd like to be enlightened :)

@dbussink
Rubinius member

Actually one of the things we probably should do is not use Channel as a locking mechanism there anymore, but switch to using an object as a lock and use Rubinius.lock / Rubinius.unlock / Rubinius.synchronized. These work pretty much the same as how it works in Java with using an object as a lock.

The reason that this is still using Channel is that when this was written, we didn't have the concept of being able to use an object as a lock yet.

@ryoqun
Rubinius member

Really thanks for the explanation!

Then after merging this, I will transition from Channel to Rubinius.lock and others. Is that OK? Or should I do the transition first?

@dbussink
Rubinius member

I don't whether first doing this would be worth it then. Especially if this is adding primitives etc, that should be removed afterwards. I'd rather focus effort on making a good implementation using the mechanisms we have available now :).

Please do the transition as a pull request too, so we can have some more eyeballs on it before we merge it.

@travisbot

This pull request fails (merged 03eb5e65 into f926ae4).

@ryoqun
Rubinius member

I see. Then I'll open a new pull request for the transition and update this pull request after that. :D

@dbussink
Rubinius member

Do you want to update this pull request or should we open a new one for this? Or is this no longer an issue because of the already merged locking changes?

@ryoqun ryoqun Completely fix deadlocks of Thread#raise
In some cases, an exception isn't correctly raised by Thread#raise. If that is
the case, Thread#join on that thread leads to the deadlock.

Consider the following example:

  thread = Thread.new do
  end
  thread.raise(RuntimeError) # interrupts thread incorrectly
  begin
    thread.join # thread is dead, but thread.join never returns; deadlock
  rescue
  end

There are two problems in the current code:

  1. raise_prim in Thread#raise isn't inside Rubinius.{lock,unlock}.
  2. The thread execution cleanup doesn't lock correctly.

In the end, there should be only the following three good timings:

  Note: 'prepare' means 'prepare to raise the new thread from the main threaed'

               Time
               ------------------------------------------------------->
  main thread  prepare - wakeup |
  new thread                    | call (interrupted) | lock - cleanup

               Time
               ------------------------------------------------------->
  main thread       | prepare - wakeup |
  new thread   call |                  | lock - cleanup

               Time
               ------------------------------------------------------->
  main thread                        | prepare (no-op) - wakeup (no-op)
  new thread   call | lock - cleanup |

To fix the two problems, this commit does the followings respectively:

  1. Put raise_prim in Thread#raise under Rubinius.{lock,unlock}.
  2. Introduce Rubinius.uninterrupted_lock and Rubinius.check_interrupts to
     make the thread execution cleanup correctly lock.
227e976
@ryoqun
Rubinius member

@dbussink This isn't resolved by the locking changes.

Sorry for being late. I was doing the updating stuff slowly. That's because while I was updating and reviewing again, I noticed three problems:

  1. I'm fairly sure that my logical thinking is correct for the need of the move of raise_prim exc, however, I couldn't reproduce the verifiable situation for that. The timings requirement is too strict... If I'm missing something, please let me know.

  2. Well, there are still deadlocks in the extreme cases. To be particular, I'm aware that if thread's ID exceeds around 2^24, a deadlock happens because we're using cramped thread ID inside ObjectHeader, whose bit width is exactly 24bit, thus causing overflow. And Thread ID just keeps being incremented as new threads are created. To fix this, probably we should recycle unused thread IDs.

  3. There seems some kind of performance degradation within Rubinius.{lock,unlock} compared to Channel#{receive,send}.

I don't think these problems are worth blocking this pull request. Maybe we are better of merging this and tackle these later on... :)

@dbussink
Rubinius member

Please don't consider it being late, we welcome any contribution :). Just me being curious / impatient to see where this was going :).

The performance regression sounds strange to me. The Rubinius.lock / unlock path should be pretty fast, especially when in the non contented case. We have optimized that case, so it should perform well. Sounds like something we'd want to investigate separately to see where that regression comes from then.

We indeed have a problem with the thread id reuse. I've looked into this and was able to hack in some basic version of being able to cheaply to do this. I didn't finish this work though. We actually have the same problem with #object_id, that can overflow too. Might be a good idea to finish this to allow thread id and object id reuse.

@ryoqun
Rubinius member

You're right for the performance regression. What I found is that Rubinius.lock / unlock is slower than Channel in the contended case.. I created a script to show this (https://gist.github.com/3728542) and here is the result (https://gist.github.com/3728559).

I found this from a convoluted code of very tight loop to find deadlocks. So, I think this is not problem in the wild.

Also, thanks for the info of the id reuse issue!

@dbussink dbussink merged commit ad9cf4a into rubinius:master Sep 27, 2012

1 check failed

Details default The Travis build failed
@dbussink
Rubinius member

Ok, merged this. I do think that we should look into why Rubinius.lock is slower than Channel, but that's a different issue.

@dbussink
Rubinius member

I wonder if this build failure is related to this or to the changes to use Rubinius.lock:

http://travis-ci.org/#!/rubinius/rubinius/jobs/2579709

@ryoqun
Rubinius member

Hehe, thanks for merging!

I'll look into this if this isn't solved yet.

@dbussink
Rubinius member

I only just noticed it after a build failure alert from Travis

@ryoqun
Rubinius member

I couldn't reproduce the error on my local machine... Also, I couldn't know more than that pthread_mutex_lock failed by looking at the source code of the error...

I think we need to see this occurs frequently on travis...

@dbussink
Rubinius member

Hmm, looks like i've caught a deadlock:

https://gist.github.com/76847b577ccce22c95a5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment