Possible deadlock when expiring #60

eric · 2011-08-30T17:40:15Z

I was running some tests and ran into what looks like a deadlock while expiring.

Here's a dump of all the threads: https://gist.github.com/ded3fce993ec91bbc2bf

Here's my kestrel config: https://gist.github.com/a757e1eb26198e56e989

jeffstyr · 2011-09-01T16:28:50Z

I have a fix for this (I think it's the same issue)--I ran into it recently as well. I'll get the analysis/fix written up today.

robey · 2011-09-01T17:59:05Z

doh, i think i even fixed this once before. i see the problem: flushExpired is grabbing a lock it doesn't really need.

robey · 2011-09-01T18:02:36Z

i put a potential fix on branch "deadlock". jeff, is that similar to what you did?

jeffstyr · 2011-09-01T19:55:45Z

Heh it's funny, I have that exact same commit off on a branch (came up with it while looking into this too), but I don't think that fixes this problem actually (though it's a good change anyway). In this case the deadlock involves the lock on the PersistentQueue, not the one on the QueueCollection.

Here's my analysis:

The problem comes from three different things colliding, basically:

If you have syncJournal set (not zero and not Duration.MaxValue), then after writing to the journal you block on a future waiting for the fsync to happen. You are holding a lock on the PersistentQueue (appropriately) while this is happening.
If the expiration timer fires at this point, it's going to block on the PersistentQueue lock (again, appropriately), since the above thread is holding the lock.
Since HashedWheelTimer is single-threaded, getting blocked in 2 is preventing it from ever doing the fsync. (Even though expiration has nothing to do with fsync-ing, it's the same single timer thread which is doing both jobs.)

So thread 1 is holding a lock and blocking on a CountDownLatch, and thread 2 is blocked on that lock and it's the thread that would decrement the CountDownLatch.

It's easy to reproduce actually--if you set syncJournal to larger value that expirationTimerFrequency (say, 2 seconds and 1 second, respectively), then if you do a few adds and then a few removes, you'll hit it right away. (That not the typical config setup of course, but it makes it reproduce basically every time, rather than sporadically.)

My fix was to comment out the future() line in Journal.write(); this parallels the similar commented-out line in PersistentQueue.add(). (This is on the "sync-journal-deadlock-fix" branch on my jeffstyr/kestrel fork.) Before commenting this out, I think we're in the peculiar position of blocking on the sync for removes but not for adds. (IIRC that's what I concluded--that's why you get stuck in remove but not in add.)

I think that's the right fix, but an alternative fix is just to use a multi-threaded timer. I have an implementation of that on my "scheduled-thread-pool-executor" branch, and it seems to work too, but it also seems like a bigger change. That implements a org.jboss.netty.util.Timer via a java.util.concurrent.ScheduledThreadPoolExecutor (to replace org.jboss.netty.util.HashedWheelTimer). I'm not sure of the point of HashedWheelTimer, so I don't know what the other implications might be.

jeffstyr · 2011-09-01T19:58:27Z

I'll send pull requests for those two branches, to make them easier to get to.

robey · 2011-09-02T00:55:05Z

oof, that certainly explains things.

HashedWheelTimer is just a relatively fast timer -- it sleeps a constant time (10 msec in our case) between each check, so it fuzzes out accuracy in exchange for speed.

the ScheduledThreadPoolExecutor looks like a pretty big change, you're right. :)

maybe the expiration timer should be in its own thread. then it wouldn't conflict with the timer events that are just doing an fsync.

jeffstyr · 2011-09-02T01:25:51Z

Yeah that would work too I'd expect.

In concept I'm not sure sure if it makes sense to ever wait on a future for journal fsync. My thinking is that if you, say, set the sync period to be something like 1 minute, with the intention of picking a compromise of performance v. durability favoring the performance side, you'd actually end up going slower than with a fsync period of 0 (if anything waits on the future), because each transaction would wait (up to) 1 minute for its fsync. I'd think that any non-0 fsync period implies not waiting on the fsync, with ever-higher values giving you ever-higher performance. (Right now I think that intermediate values between 0 and Duration.MaxValue actually end up being slower that either extreme--if you wait on the future anywhere.)

Does that make sense?

(I wasn't sure if HashedWheelTimer is supposed to perform better than the java.util.concurrent timer mechanisms, or if it just predated them. It certainly sounds fancy though.) But yeah the ScheduledThreadPoolExecutor was mostly an experiment to see if it would work.

robey · 2011-09-02T01:33:24Z

easily reproduced this by configuring queue "slow" to have

syncJournal = 10.milliseconds

and running

$ sbt "flood -q slow -n 100"

eric · 2011-09-02T01:43:49Z

I believe I agree that I wouldn't expect to be waiting on an fsync if I set it to non-0.

robey · 2011-09-02T01:50:36Z

it's true that the journal shouldn't force you to wait for the fsync on remove since it doesn't do that on add.

robey · 2011-09-02T02:36:42Z

okay, i moved the timer to a separate thread, and fixed the remaining journal operations to avoid waiting if the fsync is happening on a timer.

the timer thread move fixed the "flood" test above for me. the fsync change is a little riskier, but i convinced myself that this is remainder code from before you could set fsync on a timer, so it shouldn't hurt anything.

the branch is "futures". i'd love comments/review.

eric · 2011-09-02T06:50:40Z

These changes make a lot of sense to me. I like that the journal fsyncing decision is now consistently made in PeriodicSyncFile.

👍

jeffstyr · 2011-09-02T07:33:59Z

It all looks good to me. And PeriodicBackgroundProcess sounds cool, I'll have to check that out.

(On a side note I just noticed com.twitter.util.ScheduledThreadPoolTimer. It's not the same thing I did, since it doesn't implement the Netty timer interface, but similar in spirit.)

jeffstyr · 2011-09-02T17:21:50Z

One question/thought: How does PeriodicBackgroundProcess play when you are using PersistentQueue in the embedded case (kestrel-as-a-library)?

Actually, looking at the code I guess in the embedded case you don't get background (while-queue-idle) expiration currently anyway, which I think makes sense.

robey · 2011-09-02T19:41:44Z

yeah, i think the background thread only makes sense if you have a whole server. my assumption is that people using kestrel as a library are really just using PersistentQueue directly.

robey · 2011-09-02T21:32:16Z

merged!

eric · 2011-09-02T21:39:21Z

Sweet. Is this going to be 2.1.1?

robey · 2011-09-02T22:50:43Z

definitely.

jeffstyr · 2011-09-12T19:11:06Z

BTW I ran into a deadlock (https://gist.github.com/1212071) which I'm pretty sure is fixed by your first fix above (on branch "deadlock", about not holding onto a lock during QueueCollection.flushExpired()). We'll see if it stops happening now (I hadn't deployed that fix yet).

The way it manifested was that kestrel would run out of FDs (because Netty kept accepting connections but the processor threads where blocked, I guess).

robey closed this as completed Sep 2, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible deadlock when expiring #60

Possible deadlock when expiring #60

eric commented Aug 30, 2011

jeffstyr commented Sep 1, 2011

robey commented Sep 1, 2011

robey commented Sep 1, 2011

jeffstyr commented Sep 1, 2011

jeffstyr commented Sep 1, 2011

robey commented Sep 2, 2011

jeffstyr commented Sep 2, 2011

robey commented Sep 2, 2011

eric commented Sep 2, 2011

robey commented Sep 2, 2011

robey commented Sep 2, 2011

eric commented Sep 2, 2011

jeffstyr commented Sep 2, 2011

jeffstyr commented Sep 2, 2011

robey commented Sep 2, 2011

robey commented Sep 2, 2011

eric commented Sep 2, 2011

robey commented Sep 2, 2011

jeffstyr commented Sep 12, 2011

Possible deadlock when expiring #60

Possible deadlock when expiring #60

Comments

eric commented Aug 30, 2011

jeffstyr commented Sep 1, 2011

robey commented Sep 1, 2011

robey commented Sep 1, 2011

jeffstyr commented Sep 1, 2011

jeffstyr commented Sep 1, 2011

robey commented Sep 2, 2011

jeffstyr commented Sep 2, 2011

robey commented Sep 2, 2011

eric commented Sep 2, 2011

robey commented Sep 2, 2011

robey commented Sep 2, 2011

eric commented Sep 2, 2011

jeffstyr commented Sep 2, 2011

jeffstyr commented Sep 2, 2011

robey commented Sep 2, 2011

robey commented Sep 2, 2011

eric commented Sep 2, 2011

robey commented Sep 2, 2011

jeffstyr commented Sep 12, 2011