Permalink
Browse files

opening a connection will block if the pool is full

  • Loading branch information...
1 parent 9a4bfd0 commit b8f748269584c1475f6e06aec48fce627f603880 @tenderlove tenderlove committed Apr 16, 2012
@@ -2,6 +2,7 @@
require 'monitor'
require 'set'
require 'active_support/core_ext/module/deprecation'
+require 'timeout'
module ActiveRecord
# Raised when a connection could not be obtained within the connection
@@ -11,9 +12,6 @@ class ConnectionTimeoutError < ConnectionNotEstablished
# Raised when a connection pool is full and another connection is requested
class PoolFullError < ConnectionNotEstablished
- def initialize size, timeout
- super("Connection pool of size #{size} and timeout #{timeout}s is full")
- end
end
module ConnectionAdapters
@@ -94,6 +92,21 @@ def run
attr_accessor :automatic_reconnect, :timeout
attr_reader :spec, :connections, :size, :reaper
+ class Latch # :nodoc:
+ def initialize
+ @mutex = Mutex.new
+ @cond = ConditionVariable.new
+ end
+
+ def release
+ @mutex.synchronize { @cond.broadcast }
+ end
+
+ def await
+ @mutex.synchronize { @cond.wait @mutex }
+ end
+ end
+
# Creates a new ConnectionPool object. +spec+ is a ConnectionSpecification
# object which describes database connection information (e.g. adapter,
# host name, username, password, etc), as well as the maximum size for
@@ -115,6 +128,7 @@ def initialize(spec)
# default max pool size to 5
@size = (spec.config[:pool] && spec.config[:pool].to_i) || 5
+ @latch = Latch.new
@connections = []
@automatic_reconnect = true
end
@@ -139,8 +153,10 @@ def active_connection?
# #release_connection releases the connection-thread association
# and returns the connection to the pool.
def release_connection(with_id = current_connection_id)
- conn = @reserved_connections.delete(with_id)
- checkin conn if conn
+ synchronize do
+ conn = @reserved_connections.delete(with_id)
+ checkin conn if conn
+ end
end
# If a connection already exists yield it to the block. If no connection
@@ -205,23 +221,23 @@ def clear_stale_cached_connections! # :nodoc:
# Raises:
# - PoolFullError: no connection can be obtained from the pool.
def checkout
- # Checkout an available connection
- synchronize do
- # Try to find a connection that hasn't been leased, and lease it
- conn = connections.find { |c| c.lease }
-
- # If all connections were leased, and we have room to expand,
- # create a new connection and lease it.
- if !conn && connections.size < size
- conn = checkout_new_connection
- conn.lease
- end
+ loop do
+ # Checkout an available connection
+ synchronize do
+ # Try to find a connection that hasn't been leased, and lease it
+ conn = connections.find { |c| c.lease }
+
+ # If all connections were leased, and we have room to expand,
+ # create a new connection and lease it.
+ if !conn && connections.size < size
+ conn = checkout_new_connection
+ conn.lease
+ end
- if conn
- checkout_and_verify conn
- else
- raise PoolFullError.new(size, timeout)
+ return checkout_and_verify(conn) if conn
end
+
+ Timeout.timeout(@timeout, PoolFullError) { @latch.await }
end
end
@@ -238,6 +254,7 @@ def checkin(conn)
release conn
end
+ @latch.release
end
# Remove a connection from the connection pool. The connection will
@@ -250,6 +267,7 @@ def remove(conn)
# from the reserved hash will be a little easier.
release conn
end
+ @latch.release
end
# Removes dead connections from the pool. A dead connection can occur
@@ -262,6 +280,7 @@ def reap
remove conn if conn.in_use? && stale > conn.last_use && !conn.active?
end
end
+ @latch.release
end
private
@@ -96,6 +96,30 @@ def test_full_pool_exception
end
end
+ def test_full_pool_blocks
+ cs = @pool.size.times.map { @pool.checkout }
+ t = Thread.new { @pool.checkout }
+
+ # make sure our thread is in the timeout section
+ Thread.pass until t.status == "sleep"
+
+ connection = cs.first
+ connection.close
+ assert_equal connection, t.join.value
+ end
+
+ def test_removing_releases_latch
+ cs = @pool.size.times.map { @pool.checkout }
+ t = Thread.new { @pool.checkout }
+
+ # make sure our thread is in the timeout section
+ Thread.pass until t.status == "sleep"
+
+ connection = cs.first
+ @pool.remove connection
+ assert_respond_to t.join.value, :execute
+ end
+
def test_reap_and_active
@pool.checkout
@pool.checkout

8 comments on commit b8f7482

Contributor

jrochkind replied Apr 27, 2012

wait, did you just reintroduce the behavior we were arguing about before, where if a connection lease is requested and pool is full, client will wait up to some timeout before giving up, instead of giving up immediately?

If so, I will of course be pleased. But I may be misunderstanding? Thanks!

Owner

tenderlove replied Apr 27, 2012

Yes! I have been thinking about your arguments for a while, and I think you were right. So I added it back. :-)

Contributor

jrochkind replied Apr 27, 2012

Yeah, it looks like you did, awesome.

Two thoughts:

  1. The same @timeout value (from spec.config[:wait_timeout], default 5 seconds) is used for two distinct things here:
  2. The reaper. If a connection is leased, but the db has closed the connection, and it was leased more than @timeout seconds ago, reap it.
  3. Block on checkout. If you are trying to lease, and pool is full, wait @timeout seconds before giving up.

Does it make sense to use the same @timeout value for both things? Would you always, or even commonly, want the same @timeout for both? I think it would be common to want a longer timeout for the "reap" functionality than for the "wait for a connection" functionality.
2. The patch you accepted from me into 3-2-stable worked around a race condition having to do with how MRI schedules threads, where a thread might give up on waiting for it's connection long before the timeout. I think your implementation here is safe from this race condition, because of how you use loop do and Timeout.timeout in #checkout. But it's not tested. It's a difficult thing to test for, since it's such an unpredictable race condition. I did propose a very hacky test in my patch to 3-2-stable, I forget if my test made it into 3-2-stable or not. Would it make sense to take that test and include it here, or something else testing this?

Thanks man!

Contributor

jrochkind replied Apr 27, 2012

Actually, I think your code may not be safe from that race condition. I'll try to explain it again as simply as I can! (And this is something I did actually run into, not just theoretical).

I believe this will be true in both MRI and jruby.

  1. Thread A wants a connection. But the pool is full. So thread A waits.
  2. Say just half a second later, Thread B checks a connection back in, which signals the mutex that thread A can be woken up. Okay, thread A is 'schedulable' again.
  3. But the thread scheduler schedules thread X first, and thread X takes out that connection that was just freed.
  4. Now the thread scheduler gets around to scheduling thread A again, maybe only microseconds later, and threadA goes around it's loop again -- but it finds the connection pool is still full! Because thread X stole it! In 3.2.1, it would then raise a timeout error even though 5 seconds had not even come close to elapsing.

I did actually run into this in MRI. Then I did research, and understood why, and my research suggests should be the same problem in Java too. It occurs only under fairly heavy contention for the connections. MRI and Java both use a thread synchronization model where when a thread becomes schedulable again due to a semaphore/mutex, it is not guaranteed to be immediately scheduled -- it's merely put into the 'schedulable' set, and it is not predictable which thread from amongst the schedulable set will be scheduled first.

In your implementation here, rather than raising right away (long before @timeout) in this race condition, it's going to go through the loop again and Timeout.timeout(@timeout) again -- potentially waiting much longer than @timeout before giving up, in some cases waiting indefinitely.

The only way I could figure out how to deal with this was what my patch to 3-2-stable that you accepted did -- you have to actually keep track of how much time you've spent blocked waiting, and throw the timeout exception I.F.F. you find the pool is full and you've exceeded your timeout.

Even though it's inelegant, I think you will need a similar thing here, as it was needed in 3-2-stable. Again, this is something I really truly did run into (and spent a couple days debugging and researching to figure out what was going on), not just theoretical.

Owner

tenderlove replied Apr 27, 2012

I don't think the timeout should be considered a "total time waiting timeout". It should be considered a "every time I try to get a connection, I'll block for n seconds" timeout.

I think it's a potential liveness problem that is inherent to the system. If you're going to have more threads attempting to access the pool than you have connections available, there will always be thread liveness issues. I think the best approach is to document that issue and move on with our lives.

Contributor

jrochkind replied Apr 27, 2012

Okay, the problem though, is that under this implementation, a thread could wait indefinitely.

Even though you've set timeout to 5 seconds, the thread might wait 8 seconds, 10 seconds, 20 seconds, a minute, who knows, and never throw the exception.

Because each time it waits, it comes back from waiting in under 5 seconds, but then discovers there's still no connection availlable, and waits again. wash, rinse repeat.

I don't think that's acceptable behavior. I think the timeout has got to be an upper bound on how long the thing will wait.

If there was just plain no good way to dea with this, that might be one thing. But I think there is a decent way to deal with it, that my patch implemented.

In my patch, a thread might still give up (after 5 seconds), even if, were there a way to really make it "first in first out", it would have gotten a connection. That part is indeed a 'liveness problem inherent to the system' that just needs to be accepted. But allowing a thread to wait indefinitely, with no upper bound, for a connection -- that's something I think we can and should deal with.

Contributor

jrochkind replied Apr 27, 2012

seriously, I'm pretty darn sure about this.

The key detail is that when you wait on the mutex, you can sometimes wake up and find the connection is not actually available to you. From the waiting thread's point of view, it should never have been woken up, but it was.

If under some circumstances the thread can end up waiting with unbounded upper time limit -- and it can -- then that means the @timeout that is implemented is pretty unpredictable, and arguably might as well not be there. Why only let the thread mutex-wait for 5 seconds at a go, if under some circumstances it'll end up repeatedly doing that and waiting indefinitely? Might as well have no timeout at all.

I've spent way too much time dealing, investigating, thinking about this, I'm pretty sure that's what's going to happen. I can prepare a new patch for master that does what my patch did in 3-2-stable, if you don't want to do it, although your Latch abstraction adds an extra step, I'd want to understand the point of the Latch compared to just using the already existing mutex, before writing code dealing with it, and I don't now.

Contributor

jrochkind replied Apr 27, 2012

FWIW, the generic ConnectionPool gem uses the approach I'm suggesting is neccesary too, of keeping track of the deadline, and raising IFF the deadline is reached, otherwise continue looping and waiting. It's implementation is a bit more elegant than my 3-2-stable version, setting a deadline up front rather than keeping track of elapsed time. Link above is to line number of relevant implementation.

Please sign in to comment.