New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make connection pool fair with respect to waiting threads. #6416
Conversation
This is awesome. In general. Thanks for the detailed write-up. I'm sure @tenderlove and @jonleighton will want to check this out. |
Awesome! It looks great! 👍 |
Make connection pool fair with respect to waiting threads.
@pmahoney this commit broke the build with PostgreSQL. Could you take a look? |
…pool" This reverts commit d2901f0, reversing changes made to 525839f. Conflicts: activerecord/test/cases/connection_pool_test.rb Reason: This change broke the build (http://travis-ci.org/#!/rails/rails/builds/1391490) and we don't have any solution until now. I asked the author to try to fix it and open a new pull request.
I reverted d2901f0 since it broke the build one day ago and we don't have any solution until now. Please try to fix it and open a new pull request. |
That's... interesting. I can reproduce that locally, so that's good. I refactored the code a bit, still get the same errors, but now this happens: Run eager_test.rb alone, test passes:
Run with connection_pool.rb alone, test passes:
Run them together, tests pass:
But run the all the tests with the rake command, and I get those same failures. Any ideas? I suppose I can try running larger and larger subsets of the tests to try to narrow it down... Or better, find out what rake is doing that plain ruby isn't. (We've backported this patch to the 3.0 series, but this will be a blocker to us upgrading to 4.x, and I assume this will be true for many JRuby folks.) |
I've gotten the failing case down to this:
This also fails with sqlite3. Going to poke around a bit more... |
For what it's worth, that same command fails against master (cb92efb) with both sqlite3 and postgresql:
I think the problem is unrelated to my patch. It seems the tests are making assumption about what sort of table metadata has already been cached (sorry, I'm not too familiar with the inner workings of ActiveRecord)? If I modify the test |
Thanks for the investigation. I agree that your patch is not cause of these failures. |
Thanks for looking into this! I was about to open an issue, but I'll hold off unless you think otherwise? |
Hi, I think these 2 failures should be discussed in a separate issue. Please go ahead to open a new one. |
yep, this patch is fixing something we fixed slightly differnetly in 3-2-stable already, but master had diverged at that point, and is something I'm running into too. I don't completely understand the intended semantics of this patch, exactly what it's intended to fix how. But let me know if I can help, I'm becoming increasingly familiar with ConnectionPool's concurrency semantics (somewhat sadly :) ). Have you really managed to enforce "fairness" so when threads are waiting on a connection, it's first-in/first-out? I can't quite figure out how your code does that, but I'm impressed if it does, I've been racking my brain and unable to come up with any way to do that with ruby's own mutex semantics. Please let me know if/how I can help get this resolved (I'm interested in back-porting to 3.2.x, even if only in my own personal monkey patches, once it's in master. I am having trouble with these issues in 3.2) |
@jrochkind it's much easier to see what's going on in this refactoring: pmahoney@ff1adc0 This is what I'm going to resubmit after a few more cleanups. Take a look at the Queue class. Yes, it uses condition wait/signal. I have found (and have tests to prove it) that if you have, say 10 threads waiting on a condition var, then do I have also found that if 5 threads start waiting, then later a second group of 5 start waiting, then calling (For what it's worth, I did implement a strictly fair queue where each waiting thread put itself into an Array and on wakeup, made sure it was first in line, or else re-signaled and began waiting again). If that's true, then the way to prevent a new thread (one that has not been waiting) from "stealing" a connection from a thread that has been waiting is to check Do those test results conflict with things you have seen? |
@pmahoney Well, hm. Here's what I've seen (and even have a test I could dig up to demonstrate).
Now, the thread that 'stole' the mutex out of order may not have been a thread that was actually waiting. It may be that in between the signal (when thread1 becomes 'runnable'), and when thread1 is actually scheduled, some completely different thread (that was not previously waiting), manages to get scheduled, tries to a checkout -- finds that, hey, there's a connection available (because thread1 hasn't been scheduled yet so hasn't had a chance to take it), and grabs it without waiting at all. I believe that is in fact what was happening. If your code doesn't solve that issue --and I don't think it does -- then I'm not sure I understand what issue your code does solve, vs. the state before your patch. pre-your-patch, threads were already waiting on a ConditionVariable, and would premably already get woken up in order, right? Although your test results certainly seem to show that things are improved, I just don't understand why/how. :) (update q: Can you confirm that under your new code, as demonstrated in the tests, no thread will ever wait (much) longer than it's timeout? That's the most alarming feature of master pre-patch, I think we should make sure that never happens -- better to have a thread timeout because it had a connection 'stolen' then to have a thread that can wait without timing out with no upper bound). I had a test that sort of demonstrated this, but it was kind of a crazy hacky test -- the sort of race condition it is, the test basically needed to run like 100 threads, and maybe 5% of them or what have you would get their connection 'stolen'. The test demonstrated this under an old version of 3-2-stable, and it was demonstrated by some threads winding up raising the TimeoutException long before the @timeout value -- because they got woken up, but then found there was still no connection available, and in the code at that time, that would cause them to raise. But that exact test woudln't demonstrate against current 3-2-stable, because it's been patched so when this happens, the thread just goes to sleep again without waiting. But the underlying problem, of other threads coming in and stealing connections, is still there. The test would need to be rejiggered a bit to show that in current 3-2-stable of master (which works somewhat differently already) -- it's tricky to figure out how to demonstrate these kinds of race conditions, but I could try to work it out -- but I couldn't figure out any way to solve it. |
@pmahoney Ah, wait, I think I see how you've solved it. In my case, the thread 'stealing' the connection, let's call it ThreadX, wasn't waiting at all. It managed to get scheduled, and do a checkout, in between the signal and when the first waiting thread was actually scheduled. In your new fix, that ThreadX might still get scheduled in between the signal and when thread1 actually gets scheduled to get the connection it's due -- but it'll see that Genius! I still don't understand how your original patch, the one actually in this pull, managed to improve things -- although I believe your test results that it did, I just don't understand it! But I understand how your new fix manages to improve things quite a bit with |
Resubmitted here: #6488 |
Introduction: I'm using JRuby in a Java servlet container to run an application that uses ActiveRecord. This is a fully multithreaded environment where a thread pool services http requests, possibly in parallel. I've been having problems with timeouts when obtaining a database connection with only moderate concurrent requests.
(Reducing the size of the thread pool to match that of the connection pool may be one option, but I am running multiple apps in the servlet container all sharing one thread pool... So for now, the thread pool is 32 and the connection pool is 5).
I'm putting this patch out there to solicit comments. I think it needs more testing. I may have a chance this week to test in a real application, but for now I only have my two benchmark scripts ar_test.rb (for JRuby) and ar_test_mri.rb. Notably, I have not tested the behavior when connections are removed explicitly or removed by the reaper.
I also do not understand the distinction between PoolFullError and ConnectionTimeoutError. I also introduced an @available array of connections ready to be checked out. At the time this seemed like a reasonable idea, but I had to make some inelegant additions (essentially everywhere @connections is modified, @available must be also, e.g. #clear_reloadable_connections).
I apologize for the length of this report.
tl;dr the fairness patch reduces outliers in time to acquire a connection but needs review and testing
The test suite passes, tested on Linux/Ruby-1.9.2-p290.
This patch makes the connection pool "fair" in the first-come, first-served sense. It also avoids using the Timeout class (which spawns a dedicated thread for each connection checkout attempt that needs to to wait with a timeout).
I've added two tests that fail before applying this patch. The first ensures the connection pool is fair by queuing up several waiting threads and ensuring they acquire connections in order as connections are made available. The connections are trickled in one by one because we don't actually care when two connections become available that order is strictly preserved.
The second queues up two groups of waiting threads, then checks in enough connections for group 1 all at once. The test ensures that only the group 1 threads acquired connections.
A third test is the money test but was removed because it was unreliable. It attempted to test latency in a Java-servlet-like environment by setting up a thread pool and having each thread check connections in and out.
Instead, I used ar_test.rb to obtain histograms of the time per simulated request (checkout connection; sleep 0.01s; checkin connection).
Tests with ar_test.rb and JRuby 1.6.6
ActiveRecord 3.2.3, JRuby 1.6.6
ActiveRecord edge without fairness patch JRuby 1.6.6
ActiveRecord edge with fairness patch, JRuby 1.6.6
Tests with ar_test_mri.rb and JRuby 1.6.6
ActiveRecord 3.2.3, Ruby 1.9.2-p290
ActiveRecord edge without fairness patch, Ruby 1.9.2-p290
ActiveRecord edge with fairness patch, Ruby 1.9.2-p290
So in both Ruby 1.9.2 and JRuby, ActiveRecord 3.2.3 has perhaps a better median but more variation with a few requests taking significantly longer than the others.
I'm only including ar_test_mri.rb here. The ar_test.rb script has only had cosmetic changes and is available in this other pull request: #6398