Skip to content

Make connection pool fair #6488

Closed
wants to merge 8 commits into from

4 participants

@pmahoney

This is a second attempt of #6416

It makes the connection pool "fair" with respect to waiting threads. I've done some more measurements here: http://polycrystal.org/2012/05/24/activerecord_connection_pool_fairness.html The patch is also cleaned up compared to the first attempt; the code is much more readable.

It includes some test fixes from @yahonda that this patch triggered (though the failures seem unrelated to the code)

I am still getting test failures, but I see the same failures against master: https://gist.github.com/2788538 And none of these seem related to the connection pool.

@jrochkind

Awesome. This definitely deals with some troubles i've been having.

  1. We don't need strict fairness: if two connections become
    available at the same time, it's fine of two threads that were
    waiting acquire the connections out of order.

    What keeps you from being strictly fair? Strict fairness (or close to it barring edge cases) would work out even better for me, although this will still be a huge improvement. Per our previous conversation where I think you said that if multiple threads were waiting, #signal would always wake up the one that was waiting the longest (verified in both jruby and mri?) -- what prevents strict fairness from being implemented? (The kind of fairness enforced here is still much better, and possibly good enough even for my use case, just curious if it can be improved yet further).

    • I notice even though your comments say you don't care about strict fairness -- your test actually does verify strict order with the order test, no? Are the comments outdated, strict fairness really is being guaranteed by the test?
  2. What's the @cond.broadcast needed for? What was failing without this that required this? Related to first point above? I ask because previous implementations (3-2-stable as well as master) did not use a broadcast, but it didn't seem to cause any problems -- the threads that ended up waiting indefinitely in master previously were not caused by a lack of broadcast, they were caused by the situation you fixed with @num_wait and your semi-fair guarantee, as well as code that didn't keep track of total time waiting so threads would keep loop-waiting indefinitely when other threads 'stole' connections.

@pmahoney

I think you said that if multiple threads were waiting, #signal would always wake up the one that was waiting the longest (verified in both jruby and mri?) -- what prevents strict fairness from being implemented?

Yes, that is true. By "not strict" I mean that if two connections become available at the same time, and thread1 and thread2 are waiting in line, the order in which they re-acquire the monitor is not guaranteed (but thread3 will not be able to "steal" because @num_waiting check forces it to wait).

What's the @cond.broadcast needed for?

I don't see this in the diff. There was a broadcast in the original patch, but this new one should have removed it, unless I missed one.

@jrochkind

Aha, so if two connections become available more or less at once, it's not guaranteed whether thread1 or thread2 goes first, but they are both guaranteed to get a connection ahead of thread3? If that's so, that's plenty good enough.

I don't see this in the diff. There was an @broadcast in the original patch,

here is where I see it. I see now it's actually only in the #reap implementation. I don't trust the semantics of the reap implementation already, and lack of fairness when reaping (which if it works right ought to only apply to code that violates AR's contract in the first place) is not too much of a concern.

But I think it could be replaced by counting up how many times the reaper reaped, and doing that many signals instead of a broadcast, would that be better?

@jrochkind

I'm actually still confused about the fairness.

  • If thread1, thread2, and thread3 are all waiting (yep, thread3 is already waiting too)
  • and then two connections become avail at more or less the same time
  • are both thread1 and thread2 guaranteed to get a connection before thread3 (which was also waiting?)

your order check in the test seems to guarantee this in fact is true, I think?

@pmahoney

Ah. It was removed here. The combined diff for the pull request is better: https://github.com/rails/rails/pull/6488/files

... counting up how many times the reaper reaped ...

That's what @available.add checkout_new_connection if @available.any_waiting? (in #remove which is called by #reap) is supposed to do, though I admit I have not done any testing of the reaper. The reaper attempts to remove stale connections, so I attempt to then create new ones to replace those that have been removed. But what happens if someone checks in a presumed-leaked connection that has been removed? Ugh.

@jrochkind

ugh, sorry, ignore on broadcast, I see I wasn't looking at the final version, which has no broadcast at all in the reap. okay then.

still curious about nature of guarantees, but this is a good patch regardless, I think.

I actually run into this problem in MRI not jruby -- I'll try to run your test in MRI 1.9.3 next week, cause i'm curious -- i fully expect based on my experiences, it will show similar benefit in MRI.

@jrochkind

I am suspicious of the reaper in general, personally, although @tenderlove may disagree.

But I personally don't think the reaper does anything particularly useful atm, so making sure it does what it does properly for fairness... I dunno.

The reaper right now will reap only if a connection is still checked out, was last checked out more than @dead_connection_checkout seconds ago (default 5), and has been closed by the actual rdbms (I think that's what active? checks?)

Most rdbms have a very long timeout for idleness, MySQL by default (with AR mysql2) will wait hours before closing an idle connection. Which is in fact ordinarily what you want, I think?

So I'm not sure how the reaper does anything useful -- it won't reap a 'leaked' connection, under normal conditions, for many minutes or even hours after it was leaked (typo fixed).

I may be missing something? Maybe you're expected to significantly reduce the rdbm's idle timeout to make use of the reaper?

There are of course times when a connection may be closed because of network or server problems or rdbms restart, unrelated to leaked connections. But the reaper's not meant to deal with that, i don't think, and probably isn't the right way to anyway. (There's already an automatic_reconnect key for some of AR's adapters, although it's semantics aren't entirely clear).

Anyhow, this is really a different ticket, I just mention it before you dive into making things 'right' with the reaper, and because you seem to understand this stuff well and I hadn't gotten anyone else to consider or confirm or deny my suspicions about current reaper func yet. :)

@pmahoney

I'm actually still confused about the fairness.

If thread1, thread2, and thread3 are all waiting (yep, thread3 is already waiting too)
and then two connections become avail at more or less the same time
are both thread1 and thread2 guaranteed to get a connection before thread3 (which was also waiting?)

your order check in the test seems to guarantee this in fact is true, I think?

The test_checkout_fairness_by_group is a better test of this. What happens (I think) is that a ConditionVariable does guarantee that the longest waiting thread is the first to wake up. But the first thing a thread does after being woken up is re-acquire the monitor. It's this second action that is a free-for-all. So, yes, thread1 and thread2 will get the connection ahead of thread3 in your example, because thread3 will not be woken up.

@pmahoney

I just mention it before you dive into making things 'right' with the reaper

I was planning on just ignoring that :-P

@pmahoney

@jrochkind Oh, and thanks a bunch for taking a look at this. I greatly appreciate the second set of eyes.

@jrochkind

Okay, I think i understand the fairness issue, and it seems pretty damn good. Def understand the issue where it's unpredictable which thread will get the lock first -- that's what requires the @num_waiting in the first place. And I understand how that guards against a thread that wasn't waiting at all 'stealing' a connection from one that was. (Yes, I have had this problem too).

I think I'm understanding right that your code will be pretty close to fair -- if there are multiple threads waiting, there's no way the oldest thread waiting will get continually bumped in favor of newer waiters. The issue is only when N>1 threads are checked in at very close to the same time, and even then the first N waiters will all get connections before the N+1st and subsequent waiters. That seems totally good enough.

On the reaper.... man, looking at the mysql2 adapter code specifically, I don't think the reaper will ever reap anything, i can't figure out how a connection could ever not be active? even if the rdbms has closed it for idleness -- active? is implemented adapter-specific, but in mysql2 it seems to me a connection will always be active? unless manually disconnected.

That's really a different issue and for @tenderlove to consider I guess, since he added the code. Personally, I would not ever use the reaper at all, which fortunately is easy to do by making sure reap_frequency is unset.

@jrochkind

@pmahoney thanks a lot for doing it, man! I've been struggling with this stuff for a while, and not managing to solve it, and not managing to find anyone else to review my ideas/code for AR either! (I am not a committer, in case that was not clear, def not).

Concurrency is def confusing.

@rafaelfranca
Ruby on Rails member

@pmahoney I think you will need to squash your commits.

@tenderlove could you review this one?

@pmahoney

Here's a mostly squashed version: #6492

@pmahoney pmahoney closed this May 25, 2012
@pmahoney pmahoney referenced this pull request May 25, 2012
Merged

Fair connection pool2 #6492

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.