Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Fair connection pool2 #6492

Merged
merged 2 commits into from Jun 11, 2012

Conversation

Projects
None yet
5 participants
Contributor

pmahoney commented May 25, 2012

Third attempt of #6488

I apologize for the mess.

@rafaelfranca this is probalby more what you had in mind (I squashed all my commits, and left @yahonda's separate).

@tenderlove could you review this one?

Oh, I upgraded to ruby 1.9.3 (was on 1.9.2), and I can run all the sqlite3 and postgresql tests with no errors:

postgresql
Finished tests in 231.744746s, 14.9475 tests/s, 45.5501 assertions/s.
3464 tests, 10556 assertions, 0 failures, 0 errors, 32 skips

sqlite3
Finished tests in 179.844340s, 18.5438 tests/s, 56.3932 assertions/s.
3335 tests, 10142 assertions, 0 failures, 0 errors, 12 skips

yahonda and others added some commits May 23, 2012

@yahonda @pmahoney yahonda Cache metadata in advance to avoid extra sql statements while testing.
Reason: If metadata is not cached extra sql statements
will be executed, which causes failures tests with assert_queries().
40cfcac
@pmahoney pmahoney Make connection pool fair with respect to waiting threads.
The core of this fix is a threadsafe, fair Queue class.  It is
very similar to Queue in stdlib except that it supports waiting
with a timeout.

The issue this solves is that if several threads are contending for
database connections, an unfair queue makes is possible that a thread
will timeout even while other threads successfully acquire and release
connections.  A fair queue means the thread that has been waiting the
longest will get the next available connection.

This includes a few test fixes to avoid test ordering issues that
cropped up during development of this patch.
02b2335
Owner

rafaelfranca commented May 25, 2012

Awesome job.

cc/ @tenderlove

Contributor

jrochkind commented May 25, 2012

Recommend getting rid of PoolFullError and making them all ConnectionTimeoutError.

I think PoolFullError is a relic of @tenderlove's initial inclination to get rid of signal/wait entirely, and at that point got rid of ConnectionTimeout error entirely in favor of PoolFull.

If it's back, and with an actual robust working implementaiton (awesome job @pmahoney), I think they should be consolidated back to ConnectionTimeoutError -- there's no reason I can see for the distinction, currently the different circumstances in which one will be called instead of the other are circumstances the actual client calling #checkout has no way or need of knowing about, they're purely internal.

Contributor

jrochkind commented May 29, 2012

I've tried to monkey-patch merge this into my Rails 3.2.3 app, to see if it gets rid of my connectionpool problems.

Bit tricky to merge it into Rails 3.2.3, things have changed so much even before this patch. But I think I've done it.

I think it is creating fairer access to the pool, with less 'stolen' connections -- my average times waiting for checkout are going down.

However, I think I've still got a connectionpool leak in my app -- only after it's been under a production load for 20 minutes or so, all the threads start getting exceptions unable to get connections -- my guess is this is leaked connections, rather than a fairness issue.

If so, hard to say if it's caused by this patch, by other changes to master vs rails 3, or by my own app code (but I've spent a heck of a lot of time with my app code trying to keep it from leaking connections). Or by improperly merging the patch into rails 3.2.3 in my app. I was not getting quite this behavior (after 20 minutes of production mode, hardly anyone can get a connection) prior to merging in this new implementation. But I have merged it in improperly too, alas.

This stuff is very hard to debug.

So, I'm not sure what info this gives, if any. I was hoping to be able to say this patch solved all my problems.

Contributor

jrochkind commented May 30, 2012

I'm not sure if this is helpful or not, but I added some logging to my patched 3.2.3 app based on your implementation.

In acquire_connection, if it gets to the else branch (could not immediately check out without waiting), I log the num_waiting and the num_available.

I'm sometimes getting log lines that says the num waiting is 0, and the num_available is, say, 8. Then how did it get to that branch? The whole thing is in a syncronized block, so it shouldn't have even been able to get to that else clause. I'm not really sure what's going on, could certainly be a bug or idiosyncracy of my own code, not yours. Just still trying to use your code in my app to resolve it's problems.

Contributor

jrochkind commented May 30, 2012

Better news. The production app I was testing this in was missing an rdbms index. (The app is in the middle of a migration from Rails 2.1 to modern 3.2).

That led to some slow db queries that:

  • Resulted in some slow db action (and thus more simultaneous connections required) that I mistakenly guessed was related to this patch.
  • May also have triggered some edge case that resulted in the weird polling even when num_connections > num_available.

At any rate, I am currently testing this code (or a backported version of it monkeypatched into 3.2.3) in production app under production load -- and everything is currently looking great, def better than any previous Rails 3.x ConnectionPool implementation, for my particular app.

My custom backport/monkey-patched code also still has the 3.2.3 clear_stale_cached_connections! implementation, using the expensive Thread.list call, and with the 3.2.3 log when leaked connections are reclaimed. That log line is not being output, meaning there is no reason to believe there are any leaked connections in this app.

So I think this is good code. My personal experience with production-level loads for my particular app leads me to endorse this code going into master, and possibly being backported to 3.2.3.

@tenderlove tenderlove was assigned May 30, 2012

Contributor

pmahoney commented May 31, 2012

@jrochkind That's good news.

As for that other thing, num_connections > num_available, I'm not really sure what could be going on.

If I understand correctly, you have a condition (@queue.size > @num_waiting) evaluate to false (and thus poll with no timeout fails, sending you into the else branch) but, while still holding the lock, your log message says the condition now evaluates to true?

Contributor

jrochkind commented May 31, 2012

If I understand correctly, you have a condition (@queue.size > @num_waiting) evaluate to false (and thus poll with no timeout fails, sending you into the else branch) but, while still holding the lock, your log message says the condition now evaluates to true?

Yes, exactly.

I have added a log line here , that logs @available.size (number of available connections) as well as @available.num_waiting (had to add a method to Queue to make that number available).

That line should not be reached unless @available.poll with no argument returns nil. @available.poll with no argument should only return nil if no_wait_poll returns nil, which should only happen if NOT @queue.size > @num_waiting, as per can_remove_no_wait

Yet, my debug line sometimes outputs a @queue.size that's greater than @num_waiting.

I don't know that this is really causing a significant problem -- it may be that when this happens, okay, the waiting thread goes to @available.poll(@checkout_timeout) (the next line), but then ends up immediately getting a connection without a wait anyhow. I am not sure.

But it seems mysteriously unexplainable, and when weird mysterious unexplainable things happen with this kind of concurrency stuff, I get worried.

Contributor

pmahoney commented May 31, 2012

Can you reproduce it using ar_conn_pool_test.rb?

I have this patch applied:

index 61ff603..4439bbe 100644
--- a/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb
+++ b/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb
@@ -425,6 +425,8 @@ module ActiveRecord
           checkout_new_connection
         else
           t0 = Time.now
+          ret = @available.send(:can_remove_no_wait?)
+          puts "can remove no wait, but waiting anyway! #{ret.inspect}" if ret
           begin
             @available.poll(@checkout_timeout)
           rescue ConnectionTimeoutError

And I got no messages when running this with ruby-1.9.3-p194

bundle exec ruby ar_conn_pool_test.rb --threads 50 -n 100000

(Ugh, just realized the ar_conn_pool_test.rb on my website is a bit out of date... will update it later tonight.)
Here: http://polycrystal.org/pages/ar_conn_pool_test.html

Contributor

pmahoney commented Jun 1, 2012

@jrochkind I can't reproduce this with ar_conn_pool_test. I've tried increasing the variability and length of the simulated database work. I've ran multiple runs of 100,000 connection check in/out by a group of 50 threads, and I never see the puts statement shown above output anything.

Can you paste a diff showing how you've added debugging output to connection_pool.rb?

Contributor

jrochkind commented Jun 1, 2012

i'm in mri rather than jruby. Next week i will try to run the test
myself, as well as share my code. It's certainly possibly an artifact
of my monkey patching the functionality into 3.2.3.

On Jun 1, 2012, at 5:15 PM, pmahoney wrote:

@jrochkind I can't reproduce this with ar_conn_pool_test. I've
tried increasing the variability and length of the simulated
database work. I've ran multiple runs of 100,000 connection check
in/out by a group of 50 threads, and I never see the puts
statement shown above output anything.

Can you paste a diff showing how you've added debugging output to
connection_pool.rb?


Reply to this email directly or view it on GitHub:
#6492 (comment)

Contributor

jrochkind commented Jun 5, 2012

Sorry, I don't really have time to get to the bottom of it. We can perhaps just chalk it up to my weird frankenstein monkey patch into a 3.2.x app approach and possibly bugs I introduced there.

I do endorse this pull request, it definitely seems to be working better for me than anything previously. Wonder what @tenderlove thinks and if this is going to be accepted. Since your last version was initially merged before being rolled back because of unrelated testing scaffolding problems, presumably it will be? But would be nice to hear from @tenderlove and see what's up.

If you're curious, here is my frankenstein 3.2.x monkey patch approach of your technique, with additional logging added. https://gist.github.com/2875834

@rafaelfranca rafaelfranca added a commit that referenced this pull request Jun 11, 2012

@rafaelfranca rafaelfranca Merge pull request #6492 from pmahoney/fair-connection-pool2
Fair connection pool2

Conflicts:
	activerecord/test/cases/associations/eager_test.rb
174f36a
Owner

rafaelfranca commented Jun 11, 2012

hey guys, I merged this.

Thank you

@rafaelfranca rafaelfranca merged commit 02b2335 into rails:master Jun 11, 2012

Contributor

pmahoney commented Jun 14, 2012

Great! Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment