Skip to content
This repository

ConnectionPool.with_connection does not always properly release connection #5330

Closed
jrochkind opened this Issue March 07, 2012 · 34 comments

3 participants

Jonathan Rochkind Aaron Patterson Andy Goldstein
Jonathan Rochkind

Okay, I think I'm finally getting to the bottom of my ActiveRecord concurrency issues, think I've actually found and identified the bug. Rails 3.2.2

ConnectionPool.with_connection is supposed to check out a connection assigned to the thread if one isn't already checked out; then check it back in after the yield if a new one was checked out by with_connection.

I believe it is broken. I have a demonstration Rails app I can put somewhere if someone needs it. But I guess I should really come up with a failing test, and perhaps a fix. In 3.2 stable branch?

I believe the problem is right here ( in 3.2.2 tagged snapshot of connection pool):

https://github.com/rails/rails/blob/c606fe2c6f1112d13b5ff01f3204e29cbdb09649/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L117

with_connection decides if it's a "fresh_connection" situation by calling active_connection? . Then it just yields the block -- if a connection wasn't already checked out to the thread, any ActiveRecord activity will automatically check one out. Then after, yield, it calls release_connection only if there was fresh_connection.

The problem is that the current implementation of active_connection? does not check to see if the current thread had a checked out connection. It checks to see if ANY thread had a checked out connection. https://github.com/rails/rails/blob/c606fe2c6f1112d13b5ff01f3204e29cbdb09649/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L100

So if some thread somewhere has an active connection, and then another thread uses #with_connection -- a new connection will be checked out for that thread, but NOT checked in at the end of with_connection.

Bug.

I think this is what's going on. I think @tenderlove may be the (only?) active committer that actually understands this stuff at all. Would GREATLY appreciate some feedback.

I can try to create a failing test case and even a patch (will probably require the introduction of a new method, #current_thread_connection? or something, since #active_connection? has the semantics it does now (it may not always have had them, but who knows if changing them now would break something else)). But I've never contributed to rails before, not sure about the structure of the tests and getting a setup that allows you to run tests and such, appreciate any pointers. Presumably I'd work off of the rails 3.2 stable branch for a patch?

In a local copy, I've already implemented a suggested fix, and it does make it pass my previously failing local test case (in a local rails app). Just need to figure out how to add the test to AR proper and run all AR tests to make sure it doesn't break any existing ones.

Jonathan Rochkind

suggested fix (which makes my local app demonstration test pass, but haven't figured out how to run all AR tests against it yet) involves new method in ConnectionPool, and change to with_connection to use it:

  # Check to see if there is an active connection for
  # current_connection_id, that is by default the current
  #  thread.
  def current_connection?
    @reserved_connections.has_key?(current_connection_id)
  end



  # If a connection already exists yield it to the block. If no connection
  # exists checkout a connection, yield it to the block, and checkin the
  # connection when finished.
  def with_connection
    connection_id = current_connection_id
    fresh_connection = true unless current_connection?
    yield connection
  ensure
    release_connection(connection_id) if fresh_connection
  end
Aaron Patterson
Owner

Yes, this definitely seems like a bug.

Would you like to post the app you have now? I can help you extract a failing test case suitable for adding to AR. We may want to change the way active_connections works. I'm not sure that it's useful in it's current incarnation. I don't really see the point of having a method that returns true if any thread has a connection.

Jonathan Rochkind

Thanks a lot @tenderlove. My current app is a bit of a mess, full of puts statements and debuggers and such. Might be easier for me just to submit the pull request with new test -- does that work for you?

I'm personally scared to try redefining #active_connection? -- it can easily break too many things, which seem fairly under-tested. suspect that a previous redefinition of active_connection? is what caused this bug in the first place although haven't reviewed commit logs. There may very well be code that does need to check to see if there's any currently active connection (belonging to any thread) in the pool, i can't say.

If you want to work on more of a refactor, that's of course cool, but perhaps first accept a patch with test that fixes things with as little refactor as possible, and then work on refactor without making test fail? I'll try to write the test at a high enough level that it fails/passes regardless of internal architecture, so long as the public ConnectionPool api remains the same.

I would love to get this fixed before the next 3.2.x release, in the meantime I'll have to monkey-patch in my app.

Aaron Patterson
Owner

Ya, submitting a pull request with a new test is fine! If you send a test along, I'll make sure we get this fixed before the next 3.2 release.

Jonathan Rochkind

I'd note that the inline comments on #active_connection? do more or less correctly describe it's semantics, suggesting that someone wrote it that way on purpose, making me more worried about wantonly changing it. "Check to see if there is an active connection in this connection pool."

git blame on it is you though, ha. 4211866

Jonathan Rochkind

and here's code that needs to wants to check if there are ANY active connections in a pool, and uses active_connection? to do so, would need to be re-written to use something else if active_connection? semantics changed. which could certainly be done.

https://github.com/rails/rails/blob/master/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L337

Jonathan Rochkind

sorry for being so wordy, i'll try to comment less. I have a test that I think demonstrates the problem, but having more trouble than I expected writing code to solve it, the test still fails despite the code i thought would fix it. Don't know if my test is invalid, there's yet another relevant bug hiding somewhere, or something else.

here's the test, not actually sure which test file it should go in, there's several related to connection_pool, but:

  def test_threaded_with_connection
    assert_equal 0, ActiveRecord::Base.connection_pool.connections.size

    main_thread_conn = ActiveRecord::Base.connection_pool.checkout

    assert_equal 1, ActiveRecord::Base.connection_pool.connections.size

    thread = Thread.new do
      ActiveRecord::Base.connection_pool.with_connection do
        Project.first # need to do something AR to trigger the checkout

        assert_equal 2, ActiveRecord::Base.connection_pool.connections.size
      end
      assert_equal 1, ActiveRecord::Base.connection_pool.connections.size
    end

    ActiveRecord::Base.connection_pool.checkin main_thread_conn

    assert_equal 0, ActiveRecord::Base.connection_pool.connections.size
  end

Hmm, looking at tests in pooled_connections_test.rb, it looks like my counts are all off by one, for some reason even before you've checked out any connections, AR always keeps one in 'reserve'? Okay, not sure what that's about, but I can rewrite the test to be agnostic about it.

(Also not sure if there's a way to get AR to run just one test file, running the whole sweet over and over gets tiresome)

Jonathan Rochkind jrochkind referenced this issue from a commit March 07, 2012
Commit has since been removed from the repository and is no longer available.
Jonathan Rochkind jrochkind referenced this issue from a commit March 07, 2012
Commit has since been removed from the repository and is no longer available.
Jonathan Rochkind

crap, i'm testing the wrong thing, not sure #size measures what I think. also forgot a thread.join. Getting confused. not quite sure how to write the test to check the right thing, but you see where it's going:

make sure that after with_connection is done, the connection really HAS been checked in. I know in unpatched rails master it has not, by actually checking @reserved_connections.has_key? current_connection_id which it should not, if the connections been checked in.

Okay, I've written a different approach to the test. The version I got working is hackier and more fragile, not really happy with it as a test, but it's the best I could come up with.

My new test DOES fail as expected in current code AND produce "Database connections will not be closed automatically" deprecation warnings in console (which makes sense; in fact, those deprecation warnings were really the only clue that led me to diagnose this bug in this first place, thanks.).

My new test does pass as expected, and without the deprecation warnings, with my suggested patch.

Rather than a pull request, i'll just show you the code/diff. This is based on 3-2-stable, it took me long enough to grok how this stuff works in 3-2-stable, I didn't even want to mess with figuring out if it's been refactored in master.

jrochkind@db16dda

What are the next steps, what do you think? (I still worry about changing the semantics of #active_connection? . At least not before rails4. It is a public method that plugins or user code may use and expect it to be how it is, in theory. I'd like to see a fix in the next rails 3.2.x)

Jonathan Rochkind jrochkind referenced this issue from a commit March 08, 2012
Commit has since been removed from the repository and is no longer available.
Jonathan Rochkind

ugh, got my git history all messed up. i'll start over and issue a proper pull request, at which point happy to take any feedback, thanks.

Aaron Patterson tenderlove closed this in cff19cf March 08, 2012
Aaron Patterson
Owner

I think we should probably fix the active_connection? method rather than introduce a new method. I think the method is just returning the wrong value.

It looks like the method is used in two places, in the active_connections? method and in the with_connection method. The active_connections? method is called by nothing except test code. I think it will be safe to just fix the method.

We need to slightly modify the check for an active connection. Closing a connection will just return it to the pool and mark it as "not in use". This means it's possible we have a connection in the reserved list, it just isn't being used.

I think we could add some clarity to the connection pool api by adding methods that let us answer these questions:

  • Does the pool have an open and in use connection for this thread? (currently active_connection?)
  • Does the pool have an open connection for this thread?

I think I've fixed this in cff19cf and d523504

Do you mind trying against edge 3-2?

Jonathan Rochkind

thanks! probably won't have a chance to try against edge 3-2 until Monday, will do so then.

Closing a connection will just return it to the pool and mark it as "not in use". This means it's possible we have a connection in the reserved list, it just isn't being used.

Huh, I've studied and played with this code for hours (days, weeks it seems like) and I missed this. I noticed #release_connection actually deleted the key from @reserved_connections, and so did #clear_stale_cached_connections! However, I never did get to the bottom of the indirection in #checkin to figure out what actually happened there, so I guess I missed something.

However, what you say is making me concerned about my assumptions about the ConnectionPool contract. I assumed that if I have a thread that does a #with_connection, at the end of the #with_connection the connection is released and available for other threads to use, even while the original Thread keeps going doing other non-AR stuff. Am I right about this or wrong about it? I count on it for efficient use of connections -- the thread may be doing other stuff, but the connection has been released and available for other threads to use too. But if the connection is still in @reserved_connections, this worries me that I may not be right, which has very negative implications for my performance characteristics. REALLY appreciate confirmation here. Might have to figure out how to write a test to be sure, but oh boy is it tricky to figure out how to write the right test here.

Aaron Patterson
Owner

No, with_connection will free it up for use by other threads. It deletes the connection from the reserved_connections list. I'm talking about a situation where the developer calls close on the connection, but does not explicitly free it up from the pool. Corresponding test is here.

I've added tests to prove that connections can move between threads here: 657095d and here: 94b2c8c

HTH!

Jonathan Rochkind
Rafael Mendonça França rafaelfranca referenced this issue from a commit in rafaelfranca/rails March 08, 2012
Aaron Patterson make active_connection? return true only if there is an open connecti…
…on in use for the current thread. fixes #5330
d523504
Aaron Patterson
Owner

Ya, connections currently means "all connections to the database, regardless of whether or not the connections are in use by some thread".

Thanks again for helping to debug this issue. I know it hasn't been easy! :(

Jonathan Rochkind

Ya, connections currently means "all connections to the database, regardless of whether or not the connections are in use by some thread".

Okay, so the reason this successfully gets the number of currently checked out connections is the in_use? check.

So connection.in_use? actually means if the connection is currently checked out to a client, not just if it's currently connected to the db. Can you explain where in_use? gets defined, and how it knows?

I was not expecting the actual connection to even know if it was currently checked out to a client -- I was expecting the ConnectionPool to keep track of this, but I was thinking seperation of concerns was such that the connection itself had no reason to know if it was currently checked out to a client, only if it was currently connected to actual database or not. Is there a good reason the connection (rather than the connection_pool) should have the responsibility of keeping track of whether it's currently checked out to a client or not, or is this just a legacy artifact of a less than ideal design?

It seems slightly worrying to me that both the ConnectionPool (currently via @reserved_connections) and the connection itself (via in_use?) need to keep track of whether a particular connection is currently checked out to a client -- seems like potential for them to get out of sync.

Aaron Patterson
Owner

So connection.in_use? actually means if the connection is currently checked out to a client, not just if it's currently connected to the db. Can you explain where in_use? gets defined, and how it knows?

When a connection gets checked out, the lease method is called on the connection here. The lease method sets the connection to be in_use and sets the time on the connection when it was leased.

The in_use and lease time attributes of the connection allow us to reap connections in case a user forgets to check one in. See the reap method here.

Without these attributes, if a thread dies (from an exception or something), the connection could possibly live on forever. I stole this model from connection pool designs I worked with in Java land. :-)

Jonathan Rochkind

Awesome, thanks. It still kinda seems to me that it's the ConnectionPool that oughta know about the 'lease', not the Connection, but cool. (the reaper is maybe new in 3.2, I don't recall it from pre-3.2 attempts to get an overview of what's up).

Is there a potential problem with in_use? and the ConnectionPool @reserved_connections getting out of sync?

If a client calls #close on a connection without telling the ConnectionPool, then it is no longer #in_use?/leased (I think), but it is still in @reserved_connections.

If that same thread does an AR operation later (and/or does a with_connection), ConnectionPool#connection will now decide that @reserved_connections[current_connection_id] already exists, and give it to the thread without calling a checkout on it.

But in the meantime, could have the ConnectionPool have already given that connection to a different thread, since it wasn't in_use? ?

Jonathan Rochkind

And now I'm concerned about this too -- we previously established that #connections is the list of currently checked out db connections, not the list of all connections open to the db, right? So why use #connections.size to decide if there's room to #checkout_new_connection, open a new connection to the db?

Let me know if my uninformed code review is annoying you and I should stop. This stuff is interesting to me, and after spending a lot of time debugging and writing exhaustive checks and tests for my own app, assuming the problem couldn't possibly be in the AR implementation itself (and then eventually realizing it was), I have a keen interest in understanding what's going on, but perhaps an over-willingness to think there's something going on wrong.

Aaron Patterson
Owner

Each connection has a reference back to the pool to which it belongs. Please see the source of Connection#close here.

As I said in this comment, the connections list is all connections the pool knows about, regardless of if the connection is in use by a thread or not. If the list contains a connection that is not in use, that connection will be leased and returned. The pool will only create a new connection if all connections in the pool are currently leased and the pool has room to expand. (Think of in_use? as a synonym for leased?. In fact, we may want to change the method name.)

EDIT: changed some wording for clarity.

Jonathan Rochkind

Aha. Okay, thanks a lot. I also realized I was getting confused between 3-2-stable and master, not sure exactly what if anything changed with regard to this.

Since connection.close calls connection_pool.checkin and connection_pool.checkin... I can't see how it does, but it I think it must remove the conn from @reserved_connections or there'd be other bugs, it seems like @reserved_connections must indeed remain in sync with #connections.find based approaches, they've got to return the same thing, I think.

Anyhow, okay, I'll give this is a rest for now anyway. I'll test 3-2-stable against my actual app on Monday or Tuesday when i'm back at my desk. Thanks very much.

Jonathan Rochkind

sorry, one more thing.

current active_connection? allows for the possibility that something may be in @reserved_connections[current_connection_id] but not be in_use?. If this can in fact happen, it seems disastrous -- because while #with_connection which uses active_connection? will decide not to use it (and to do a new checkout), the bare #connection checks @reserved_connections directly without an in_use? check and will use it.

Perhaps #connection should use active_connection? too rather than checking @reserved_connections directly, further reducing code that touches the somewhat weird @reserved_connections directly rather than via api?

This seems like a problem. If you know it's not, feel free to ignore, just wanted to at least point it out for your attention.

Aaron Patterson
Owner

I don't think much has changed between master and 3-2-stable except for the deprecated behavior being removed.

checkin on the connection pool doesn't removed it from the @reserved_connections hash. The connection will be recycled, but it's basically still reserved for use by the same thread. Honestly, I'm not sure why the ConnectionPool has this functionality. checkin has never removed the connection from the @reserved_connections hash, but maybe it should.

Yes, please let me know how your app works. I'll close this ticket once I get confirmation from you! :-)

Jonathan Rochkind

that further info possibly further concerns the above -- if checkin does not remove a connection from @reserved_connections, then that means after checkin, if that thread uses AR and #connection gets called, the checked-in connection will be returned without another checkout. Problem, yes? (EDIT: I will try to find time to write a test case seeing if this is a problem, unless you're sure it ain't)

I'm guessing that where you want to move to is no longer having #connection ever do an auto-checkout -- if you do AR functionality that calls #connection without having done a with_connection or checkout, you'd get an exception. This seems perhaps wise, and would remove some of this complexity.

EDIT: If you are interested in this as an option, I have my own hacky monkey patches that add that kind of safe checkout to AR on a thread-by-thread opt-in basis (using thread local storage to indicate opt-in). If a thread opts in, then any calls to #connection without an explicit #checkout or wrapped in #with_connection raise. I used that in my local app to try to identify any places my app might be violating the ConnectionPool contract and leaking connections.

Aaron Patterson
Owner

that further info possibly further concerns the above -- if checkin does not remove a connection from @reserved_connections, then that means after checkin, if that thread uses AR and #connection gets called, the checked-in connection will be returned without another checkout. Problem, yes? (EDIT: I will try to find time to write a test case seeing if this is a problem, unless you're sure it ain't)

Yes, this will currently happen, and it's bad news. We should fix it.

I'm guessing that where you want to move to is no longer having #connection ever do an auto-checkout -- if you do AR functionality that calls #connection without having done a with_connection or checkout, you'd get an exception. This seems perhaps wise, and would remove some of this complexity.

Correct. That is the direction I'd like to head. One problem with removing auto checkout is testing. We can ensure a connection is checked out for the thread handing the request in a middleware, but we don't have middleware like that in the test suites. And I'm not talking about Rails's on test suites (we can fix those), I mean user land test suites. Not everybody calls super inside their test setup method, and not everybody uses ActiveSupport::TestCase (they may use rspec or whatever). Ensuring a connection is available while testing will be a challenge.

Jonathan Rochkind

Awesome, I noticed one bad thing at least that was legit, heh. Not entirely clear to me how to fix it, but if I have time to figure out how to write a failing list, i'll file an issue with the failing test to at least note the problem.

Not sure I'm following on the test suites. If #connection raises if you use it without a checkout or with_connection and a user land test suite does not do a checkout or call 'super', then the tests will all fail due to raised exception, right? As long as they have an easy documented way to fix it (say, call 'super' inside your test 'setup' method, or do a checkout in setup and a checkin in teardown, or equivalent for other test frameworks like rspec), this doesn't seem disastrous.

Of course, that change will likely break some existing non-Rails AR-using apps for real (not just a false positive), that aren't properly checking out their connections. But that would be intentional, and ConnectionPool is hopefully getting solid and documented enough that it won't be hard to fix. checkout at beginning of thread and checkin in an ensure, or wrap in with_connection to have it do it for you.

Need to think through and possibly provide tests for what happens though when you have multiple databases with multiple ConnectionPools ---- not entirely sure there's a sensible way to handle this now in a multi-threaded environment, let alone with removing auto-checkout.

Hopefully your code doesn't need to know exactly how many connections there are and checkout each of them manually (bad seperation of concerns to have to know every time you use a model if it was in the same db/connection_pool as the other models or not), there possibly ought to be something on ConnectionHandler (perhaps via a method on ActiveRecord::Base, not needing to access the connection_handler directly) that will check all of em out for you.

Although then you'd be checking out a connection for every database attached to any model in your app, even if the particular with_connection block didn't touch a model that used that db. Hmm, I can think of no great solution - except going back to the current auto-checkout-on-use, and having with_connection do what I originally thought it did, not force a checkout but let the block trigger a checkout automatically if needed. I suppose things could be rigged up so you'd still get an exception on an automatic-checkout in #connection only if you weren't wrapped in a with_connection at the time.

Jonathan Rochkind

(Alternately, for test suite, is middleware neccesarily the right place to handle checkout/checkin? If it was actually done by ActionController::Base , it'd work the same way in tests as in live I think.

In general, while I see the advantages to re-usability (and composability with non-rails rack apps) with the current trend to move more and more of Rails functionality to rack middleware, it definitely makes things harder to manage, and harder for a new developer to figure out where the heck some logic is implemented, or understand what's going on. Perhaps sometimes too much abstraction is too much abstraction.

Anuj Dutta andhapp referenced this issue from a commit March 10, 2012
Commit has since been removed from the repository and is no longer available.
Anuj Dutta andhapp referenced this issue from a commit March 10, 2012
Commit has since been removed from the repository and is no longer available.
Jonathan Rochkind

having trouble figuring out how to run a rails app against the 3-2-stable checked out version of rails (or just activerecord?) instead of one from rubygems.

I'm familiar with bundler's :path (or :git), which I usually use for this, but the nature of rails as a bundle of several gems expressed through dependencies is confusing me, and perhaps is what's making my usual techniques result in weird bundler errors.

I'll continue to hack away at it, but if you have any hints @tenderlove would be appreciated.

Jonathan Rochkind

okay, figured it out.

Indeed, my local app passes all tests and does not produce deprecation warnings running tests with current 3-2-stable, while neither is true with 3.2.2 (without monkey patch). So I confirm test of this fix. Thanks @tenderlove!

Jonathan Rochkind

For googlers et al, this fix is included in 3.2.3. Thanks!

Andy Goldstein
ncdc commented April 04, 2012

We were running into what we thought was this issue with 3.2.1, but even after upgrading to 3.2.3, we still see the same stack trace we were getting before. It's entirely possible that it's a different issue, but I figured I'd start by posting what we're seeing here, and if you all think I should open a new bug, please let me know and I'm happy to do that.

We are deploying our app to TorqueBox/JRuby. We run some method calls in the background using a feature of TorqueBox. We've set the concurrency level of the background methods to 4 simultaneous executions, and the database pool size is 25. After our background processing has been running for a little while (the time varies), we start seeing stack traces like this in our TorqueBox log:

undefined method in_use?' for nil:NilClass
/opt/grandcentral/shared/bundle/jruby/1.9/gems/activerecord-3.2.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:202:in
clear_stale_cached_connections!'
org/jruby/RubyArray.java:1615:in each'
/opt/grandcentral/shared/bundle/jruby/1.9/gems/activerecord-3.2.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:200:in
clear_stale_cached_connections!'
/opt/grandcentral/shared/bundle/jruby/1.9/gems/activerecord-3.2.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:264:in checkout'
org/jruby/RubyKernel.java:1410:in
loop'
/opt/grandcentral/shared/bundle/jruby/1.9/gems/activerecord-3.2.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:232:in checkout'
/opt/torquebox/jruby/lib/ruby/1.9/monitor.rb:201:in
mon_synchronize'
/opt/torquebox/jruby/lib/ruby/1.9/monitor.rb:200:in mon_synchronize'
/opt/grandcentral/shared/bundle/jruby/1.9/gems/activerecord-3.2.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:229:in
checkout'
/opt/grandcentral/shared/bundle/jruby/1.9/gems/activerecord-3.2.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:95:in connection'
/opt/grandcentral/shared/bundle/jruby/1.9/gems/activerecord-3.2.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:398:in
retrieve_connection'
/opt/grandcentral/shared/bundle/jruby/1.9/gems/activerecord-3.2.3/lib/active_record/connection_adapters/abstract/connection_specification.rb:168:in retrieve_connection'
/opt/grandcentral/shared/bundle/jruby/1.9/gems/activerecord-3.2.3/lib/active_record/connection_adapters/abstract/connection_specification.rb:142:in
connection'
org/jruby/RubyBasicObject.java:1698:in __send__'
/opt/grandcentral/shared/bundle/jruby/1.9/gems/activerecord-3.2.3/lib/active_record/relation/delegation.rb:7:in
connection'

From looking at the implementation of clear_stale_cached_connections!, is it possible there's a race condition? Maybe 1 thread removes the connection from @reserved_connections just before a 2nd thread calls conn = @reserved_connections[key]. (I'm somewhat unsure how this could happen since it appears to be called in a "synchronized" block, but I've only briefly looked at the code.)

Does anyone have any ideas on what we're encountering and how to fix it?

Thanks,
Andy

Andy Goldstein
ncdc commented April 04, 2012

I just found #5491 - maybe that's my problem?

Jonathan Rochkind

@ncdc this bug would have caused problems for you under 3.2.1 in that scenario, although I can't say for sure if it did cause the problem you saw in 3.2.1.

What did your stack traces look like under 3.2.1? The same or different?

The particular problem you are seeing under 3.2.3, is, I think, not this problem -- but it's hard to be sure from the info we have, debugging multi-threaded code is hard. It's possible the fix for this issue actually caused a regression; I thnk it's more likely we fixed one bug allowing an entirely separate bug that effects you to be revealed, that you weren't seeing before because the first bug kept you from getting there.

I agree that #5491 sounds like your problem. Note it was closed without a fix, because it no longer applies to Rails master, but it still applies to Rails 3.2.x. Can you reproduce the problem on demand in your system? You could try wrapping lines 263-265 in a synchronize block and see if that makes the problem unreproducible. And/or you could try wrapping the entire body of the clear_stale_active_connections! method in a synchronize block. I'll admit that this is just sort of debugging by flipping bits though, to really get to the bottom of what's going on would require some painful debugging.

You also might want to check to see if your application might be violating the ConnectionPool contract by checking out connections (explicitly or implicitly by use) and not checking them back in. Rails 3.2.x (as well as previous) is supposed to clean up after you if you accidentally leave connections not checked in -- but the stack trace you post leads right into the code that's supposed to do that. Now the code runs even if you haven't left any 'orphaned' connections around, but it will run/do less if you haven't -- if you are leaving orphaned connections around, it could be causing the bug you've found to manifest, or at least to manifest more often. So make sure you understand what that ConnectionPool documentation is saying about how you are responsible for checking connections back in under multi-threaded use, and your options for doing so (depending on how they got checked out), and at least review your code to see if it looks like it's doing right.

Hope this helps, sorry I can't say exactly what's going on, debugging multi-threaded stuff here is hard, spent many hours identifying the problem mentioned in this issue. (which would have been causing problems one way or another for you too, although fixing it sadly still leaves you with a problem)

Andy Goldstein
ncdc commented April 05, 2012

@jrochkind the stack traces were always the same for us - exactly what was reported for #5491.

We are able to reproduce this fairly easily, and the problem seems to go away if we explicitly close the connection in our backgrounded method. Given that our job and backgrounded method aren't running through a controller/ActionPack, does that mean we do need to explicitly close the connection when we're done with it? I'm having a bit of trouble figuring that answer out - if we call a finder on a model (which ends up grabbing a connection from the pool), is it up to me to call close() on the connection?

Jonathan Rochkind

@ncdc review the docs for ConnectionPool. You are responsible for checking back in a connection you use. That's not ordinarily by calling #close though.

The way ActiveRecord works right now, even if you don't explicitly check out a connection, one will be automatically checked out for you on first use. You're still responsible for checking it back in.

As per the documentation, if you just use AR without explicitly checking out a connection, the documented way to check in any auto-checked-out connections is by calling ActiveRecord::Base.clear_active_connections! when you are done with AR in that thread. Yes, something in controller/ActionPack ordinarily does that for you -- but not when it's in a thread you create yourself, or code that's not in the controller loop.

Another alternative, as mentioned in the documentation, is wrapping all your AR code in a ActiveRecord::Base.connection_pool.with_connection block. I personally prefer that for multi-threaded or non-controller-loop AR use.

You are seeing a bug of some kind in AR. It's possible however that if you properly check in your connections you won't trigger the bug anymore; your experience so far makes that seem promissing.

I think (but could be wrong) that calling #close will actually close the network connection to the database instead of leaving it open in the pool, so that's probably NOT what you want, although it's not neccesarily disastrous (a new connection should be automatically opened by AR if needed later), so if it seems to work, hey, take what works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.