Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Rails 4.0.0 and 4.0.1 PostgreSQLAdapter does not properly check whether a connection is active #12867

Closed
aetherknight opened this Issue Nov 12, 2013 · 103 comments

Comments

Projects
None yet

TL;DR: Rails 4's PostgreSQLAdapter does not properly check whether a connection has died or not because connect_poll does not detect if the underlying TCP socket has closed.

Rails 4.0.0 and 4.0.1 do not properly detect whether a PostgreSQL connection still alive or not before using it. A previously good connection that has been disconnected by the remote server or by an intermediate proxy will not be detected by rails, which will result in queries failing. An example error that we ran into:

    PG::UnableToSend: SSL SYSCALL error: EOF detected

    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql_adapter.rb:774→ exec
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql_adapter.rb:774→ exec_no_cache
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/database_statements.rb:138→ block in exec_query
    activerecord-4.0.1/lib/active_record/connection_adapters/abstract_adapter.rb:435→ block in log
    activesupport-4.0.1/lib/active_support/notifications/instrumenter.rb:20→ instrument
    activerecord-4.0.1/lib/active_record/connection_adapters/abstract_adapter.rb:430→ log
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/database_statements.rb:137→ exec_query
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql_adapter.rb:921→ column_definitions
    activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/schema_statements.rb:174→ columns
    activerecord-4.0.1/lib/active_record/connection_adapters/schema_cache.rb:114→ block in prepare_default_proc
    activerecord-4.0.1/lib/active_record/connection_adapters/schema_cache.rb:56→ yield
    activerecord-4.0.1/lib/active_record/connection_adapters/schema_cache.rb:56→ columns
    activerecord-4.0.1/lib/active_record/model_schema.rb:208→ columns
    activerecord-4.0.1/lib/active_record/model_schema.rb:217→ columns_hash
    activerecord-4.0.1/lib/active_record/relation/delegation.rb:14→ columns_hash
    activerecord-4.0.1/lib/active_record/relation/finder_methods.rb:278→ find_one
    activerecord-4.0.1/lib/active_record/relation/finder_methods.rb:268→ find_with_ids
    activerecord-4.0.1/lib/active_record/relation/finder_methods.rb:35→ find
    activerecord-deprecated_finders-1.0.3/lib/active_record/deprecated_finders/relation.rb:122→ find
    activerecord-4.0.1/lib/active_record/relation.rb:334→ update
    activerecord-4.0.1/lib/active_record/querying.rb:7→ update
    (our controller method)
    ...

This error began occurring when we began using Rails 4.0.0 and 4.0.1 instead of Rails 3.2.15. Our database connections were being terminated by an HAProxy in our setup that terminates idle TCP connections, which Rails 3.2 had no issue with --- it would detect the dead connection and establish a new connection without issue. Rails 4 on the other hand does not. A subsequent request cycle caused a new connection to be created, which would then work as long as that connection was not killed as well.

It turns out, Rails already has a unit test to verify that a dead connection is properly detected, but it is "skipped" by default because it requires a human to manually restart the PostgreSQL database (although there are ways this could be automated). When ActiveRecord::PostgresqlConnectionTest#test_reconnection_after_actual_disconnection_with_verify is run in Rails 4.0.1:

cd activerecord
$EDITOR test/config.yml # Set with_manual_interventions to true instead of false
ARCONN=postgresql ruby -Itest test/cases/adapters/postgresql/connection_test.rb

It fails, along with the tests performed after it.

The above error, as well as the failed test, are caused by PostgreSQLAdapter not properly discovering when the underlying TCP connection closes: libpq ran into an EOF, which occurs if the underlying TCP socket has closed.

When the connection pool determines whether a given database connection is still valid, it calls PostgreSQLAdapter#active?:

      # Is this connection alive and ready for queries?
      def active?
        @connection.connect_poll != PG::PGRES_POLLING_FAILED
      rescue PGError
        false
      end

Ruby-pg's PGConn#connect_poll is a thin wrapper around PQconnectPoll() from libpq. However, PQconnectPoll() is meant for setting up a Postgres client connection in a non-blocking manner after the connection's socket is ready for reading/writing (as determined by select() or poll()) --- it does not actually perform any health checks on the connection. In fact, if the last known state of the connection is good, it immediately returns:

(From src/interface/libpq/fe-connect.c in the postgresql source)

PostgresPollingStatusType
PQconnectPoll(PGconn *conn)
{
    PGresult   *res;
    char        sebuf[256];
    int         optval;

    if (conn == NULL)
        return PGRES_POLLING_FAILED;

    /* Get the new data */
    switch (conn->status)
    {
            /*
             * We really shouldn't have been polled in these two cases, but we
             * can handle it.
             */
        case CONNECTION_BAD:
            return PGRES_POLLING_FAILED;
        case CONNECTION_OK:
            return PGRES_POLLING_OK;

...

This means that connect_poll is not sufficient for detecting an inactive connection that has been closed by the other end. The Rails 3.2.15 version of active? successfully detects the lost TCP connection by actually performing a database query, which exercises the connection and would fail if the connection were terminated:

      # Is this connection alive and ready for queries?
      def active?
        @connection.query 'SELECT 1'
        true
      rescue PGError
        false
      end

This code change was introduced in 34c7e73

To Fix

There is no easy way to determine if a TCP socket is dead other than by trying to use the connection (as was done in Rails 3.2.15 and earlier, and is also still used by the mysql adapters).

Traditionally, a remotely closed TCP connection can be detected by calling recv on the socket. If recv returns a string of length 0, then the remote side has closed the connection. If the socket is non-blocking but has not been closed, it will return an EAGAIN error. However, this approach does not work if there is data buffered on the TCP stream. I attempted to come up with a fix that uses recv_nonblock to peek ahead in the stream:

      def socket_alive?
        socket = Socket.for_fd(@connection.socket)
        socket.autoclose = false
        socket.recv_nonblock(1, Socket::MSG_PEEK) != ''
      rescue Errno::EAGAIN
        true
      end

      # Is this connection alive and ready for queries?
      def active?
        socket_alive? && @connection.connect_poll != PG::PGRES_POLLING_FAILED
      rescue PGError
        false
      end

However, if the server closes the connection, as it does during test_reconnection_after_actual_disconnection_with_verify, it sends a final message to the client that won't have been consumed yet when socket_alive? runs:

[3] pry(#<ActiveRecord::ConnectionAdapters::PostgreSQLAdapter>)> socket.recv_nonblock(1000, Socket::MSG_PEEK)
=> "E\x00\x00\x00mSFATAL\x00C57P01\x00Mterminating connection due to administrator command\x00Fpostgres.c\x00L2855\x00RProcessInterrupts\x00\x00"

There is actually a deeper problem here involving a time-of-check-to-time-of-use (TOCTTOU) weakness where the connection might die after checking the connection but before trying to use it. A better solution would be to detect the dead connection while trying to use it, and to then attempt to reestablish the connection when it dies. ActiveRecord should only let the error bubble up if the reconnect attempt fails.

Hi @tenderlove, Your change to #active? in the mentioned commit (34c7e73) 8 months ago has been causing us some major pain with recovering from dead database connections. What sort of trouble did a select 1 query cause during the connection pool's reaping?

Owner

tenderlove commented Nov 12, 2013

Reaping happens in a background thread. We can't send any data over the socket because there are no locks around the file descriptor (that I know of).

Aaron Patterson
http://tenderlovemaking.com/
I'm on an iPhone so I apologize for top posting.

On Nov 12, 2013, at 2:59 PM, "William (B.J.) Snow Orvis" notifications@github.com wrote:

Hi @tenderlove, Your change to #active? in the mentioned commit (34c7e73) 8 months ago has been causing us some major pain with recovering from dead database connections. What sort of trouble did a select 1 query cause during the connection pool's reaping?


Reply to this email directly or view it on GitHub.

Member

jonleighton commented Nov 18, 2013

Awesome bug report @aetherknight. I'm also experiencing a variant of this problem in my application - the StatementPool#connection_active? method is also broken in a similar way.

Member

jonleighton commented Nov 18, 2013

Here's the related error which can occur when calling establish_connection:

PG::UnableToSend: SSL SYSCALL error: Connection timed out 
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:517:in `exec`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:517:in `dealloc`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:500:in `block in clear`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:499:in `each_value`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:499:in `clear`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:563:in `clear_cache!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract_adapter.rb:339:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/postgresql_adapter.rb:588:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:310:in `block (2 levels) in disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:308:in `each`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:308:in `block in disconnect!`
/usr/local/rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/monitor.rb:211:in `mon_synchronize`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:306:in `disconnect!`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_adapters/abstract/connection_pool.rb:564:in `remove_connection`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_handling.rb:88:in `remove_connection`
[GEM_ROOT]/gems/activerecord-4.0.1.rc2/lib/active_record/connection_handling.rb:45:in `establish_connection`

So I think there may be a few possible approaches to fixing this:

  1. Implement correct TCP connection handling. When a connection abnormally closes during use, AR should attempt to reconnect automatically. The goal here is to also eliminate the TOCTTOU issue that exists between checking whether a connection is still good and then trying to do work on it, and it would only cause an error on a closed connection if it can't reestablish the connection, or if there's an additional error during the reestablish attempt. This approach would make the #active? and #connection_active? methods not particularly meaningful anymore (since you discover a dead connection by attempting to do work on it), and reaping stale connections in the ConnectionManager would need to only be based on connections that weren't checked back in (the current logic only reaps a connection that is in use, is stale, and is no longer active).
  2. Wrap uses of the PG connection with a monitor/lock within the PostgreSQLAdapter to prevent 2+ threads (the worker/request handling thread and the reaper thread) from doing work on the PG connection object. #active? and #connection_active? then would probably need to handle timeouts on the lock, or their caller would need to handle that case.

For the later approach, #active? and #connection_active? would then do work on the connection (eg, go back to select 1 within the synchronize/monitor/lock code.

Member

jonleighton commented Nov 18, 2013

Yep. The former definitely makes more sense to me.

Owner

tenderlove commented Nov 25, 2013

@aetherknight I agree, we should go with solution #1. Seems the mysql adapter has a setting to automatically do this, but the pg adapter does not. We'll have to rescue exceptions wherever we call methods on the connection, then call PGConn#reset. Though we should probably do a loop and PGConn#reset_set + PGConn#reset_poll so we don't block the process.

Contributor

bradleypriest commented Nov 26, 2013

Will another stacktrace help out here?
I'm getting intermittent ruby segfaults with a Capybara test suite https://gist.github.com/bradleypriest/7660331 in the same place

Same issue, using sidekiq but not sure that is the main cause of problems.
Any fix already pending? My app totally broke and gives the ActiveRecord::ConnectionTimeoutError - could not obtain a database connection within ... in about 1 or 2 page reloads.

Is there a temporarily fix for this issue?
Its preventing my app to run at all locally or in production.
Tried increasing pg pool, timeout all fails this problem comes back every time.
Connection pool is now at 100 which still doesn't help. There are only 10 active pg connections.
How is the rest of the world running there Rails apps on PG when this issue persist?

Stacktrace:
https://gist.github.com/rubytastic/7786482

Contributor

mperham commented Dec 4, 2013

@rubytastic You could determine why your db connections are dying so quickly. I don't have any pgsql experience but I know MySQL has a timeout setting where the server will kill any clients that haven't used the socket in that time period.

For the short term, we monkeypatched The PostgreSQLAdapter to use the older
version of #active? that performed a select 1 query, since we aren't
using Ruby threading to handle multiple requests, nor are we using the
reaper in its own thread.

On Wed, Dec 4, 2013 at 1:37 PM, Mike Perham notifications@github.comwrote:

@rubytastic https://github.com/rubytastic You could determine why your
db connections are dying so quickly. I don't have any pgsql experience but
I know MySQL has a timeout setting where the server will kill any clients
that haven't used the socket in that time period.


Reply to this email directly or view it on GitHubhttps://github.com/rails/rails/issues/12867#issuecomment-29848045
.

Contributor

mperham commented Dec 4, 2013

@aetherknight Unfortunately that's not an option for Sidekiq users.

@mperham Ah, good point.

@aetherknight
Total PG newbie just switched from Mysql.
Understanding correctly there is no solid solution nor will it be implemented in the short terms.
Than this is a serious issue for PG and Sidekiq(?) combination, I wonder why so little people have this issue.

So switching back to mysql, for now would be the best option to go or is there a monkey patch that could be implemented to fix the problem temporarily

Owner

matthewd commented Dec 14, 2013

@tenderlove by my reading, solution 2 is necessary.

If you can't tell whether a connection is #active?, you either never reap it, or you regularly reap (stale) still-active connections -- and so the server could end up with many more connections than your pool limit.

But then, maybe I'm just missing something. Because as far as I can see, neither of the mysql adapters are any safer from a reaper in a different thread than the SELECTive postgres version.

@matthewd I am beginning to think that both solutions are ultimately necessary:

#active? is fundamentally a bad way of determining if any network connection is still alive because there is still a race condition between when #active? is called and when the connection is then used (this does not apply to the reaper's use of #active? but to other uses such as in #verify!). This can only be solved by solution 1.

Also, if I understand the current reaper code correctly, it will only reap a connection if it is currently in use, it has been longer than dead_connection_timeout since it was checked out, and #active? reports that the connection is dead. That means that it will not reap (close and remove from the connection pool) a dangling connection from a dead thread until the OS, the network, or the remote server kills the connection --- which can take a minute or more of inactivity on the connection. However, this may be the only way to determine whether the connection is dangling without pulling a connection out from under a long-lived thread (as may be the case with sidekiq jobs). Which sounds like we also need solution 2, in order to allow the reaper to actually detect dead connections.

Contributor

mperham commented Dec 15, 2013

All this stems from the fact that the connection is not self-repairing, yes? Dalli and Redis client connections are pooled with my connection_pool gem all the time. connection_pool documents that all connections should be self-repairing so that the pool doesn't need to worry about reaping or repairing.

On Dec 15, 2013, at 8:50, "William (B.J.) Snow Orvis" notifications@github.com wrote:

@matthewd I am beginning to think that both solutions are ultimately necessary:

#active? is fundamentally a bad way of determining if any network connection is still alive because there is still a race condition between when #active? is called and when the connection is then used (this does not apply to the reaper's use of #active? but to other uses such as in #verify!). This can only be solved by solution 1.

Also, if I understand the current reaper code correctly, it will only reap a connection if it is currently in use, it has been longer than dead_connection_timeout since it was checked out, and #active? reports that the connection is dead. That means that it will not reap (close and remove from the connection pool) a dangling connection from a dead thread until the OS, the network, or the remote server kills the connection --- which can take a minute or more of inactivity on the connection. However, this may be the only way to determine whether the connection is dangling without pulling a connection out from under a long-lived thread (as may be the case with sidekiq jobs). Which sounds like we also need solution 2, in order to allow the reaper to actually detect dead connections.


Reply to this email directly or view it on GitHub.

Owner

tenderlove commented Dec 15, 2013

@mperham not just self repairing. If a thread checks out a connection, then dies before checking the connection back in, what happens? This is why the reaper checks on "long lived" connections.

Contributor

mperham commented Dec 16, 2013

@tenderlove If a thread somehow does not check in connections, they'd quickly exhaust. What's wrong with that? connection_pool only documents block usage with_connection {|conn| ... } to ensure it's automatically checked back in. Which AR API can leak connections? You talking about conn = Model.connection?

Owner

tenderlove commented Dec 16, 2013

@mperham yes, that will leak a connection if done inside a Thread.new { } (and you don't specifically check it back in)

Contributor

mperham commented Dec 16, 2013

Yep, that form of API should be deprecated in preference for with_connection in Rails 5 IMO. I know that doesn't help today though. :-/

@mrbrdo mrbrdo referenced this issue in jeremyevans/sequel Dec 16, 2013

Closed

PG::UnableToSend: SSL SYSCALL error: EOF detected #746

Owner

matthewd commented Dec 16, 2013

@aetherknight Yeah, I could definitely get behind an idea of arranging automatic reconnection iff you're performing a SELECT (or BEGIN), and there isn't [already] supposed to be a transaction open.

If you're trying to send an INSERT, say, you'd want to be very sure the exception meant "driver failed to send the query", and not "driver lost connection while waiting to hear what happened", before silently retrying.

Owner

matthewd commented Dec 16, 2013

@tenderlove can/does the pool keep track of which thread checked out the connection? We can determine pretty easily that the thread's gone away... though whether you trust that it's safe to return the connection to the pool, vs just tell it to close, is another matter.

Or is there a worry about code that goes out of its way to check out a connection in one thread, hand it to another, then have the first thread die?

Owner

tenderlove commented Dec 17, 2013

@matthewd The pool does keep track of which thread checks out the connection, but you shouldn't rely on it (it uses the object id of the thread). We used to loop through every thread in the system, check it's status, then check the pool, but I'd prefer a solution that doesn't depend on doing that. We could probably stuff the thread in a hash or something when it checks out, then just have the reaper check connections for those threads.

We shouldn't worry about one thread passing the connection to another. I don't think we support that (I think it's documented too).

Also, we should probably split this in to two issues. One for self-repairing connections, and one for dealing with long lived and possibly dead threads. I tried adding code for self-repairing connections, but TBH I can't repro this issue and I don't want to rescue from just any old PG exception.

Owner

matthewd commented Dec 18, 2013

@tenderlove I did a thing: matthewd/rails@master...pg_verify

Does that look at all sane?

As far as I can see, those two tests both need to pass for the reaper and checkout_and_verify to both function correctly... and no less-invasive solution comes to mind.

Owner

tenderlove commented Dec 18, 2013

It looks sane, but I think you can make the locks more granular, and possibly remove some locks. For example, the locks should probably be inside the "log" calls.

The connection shouldn't be shared among threads (except the reaper). Does the reaper impact the statement cache? I'm not sure we need that lock.

Shouldn't there be a lock in "active?"

Looks good!

Aaron Patterson
http://tenderlovemaking.com/
I'm on an iPhone so I apologize for top posting.

On Dec 18, 2013, at 7:39 AM, Matthew Draper notifications@github.com wrote:

@tenderlove I did a thing: matthewd@master...pg_verify

Does that look at all sane?

As far as I can see, those two tests both need to pass for the reaper and checkout_and_verify to both function correctly... and no less-invasive solution comes to mind.


Reply to this email directly or view it on GitHub.

Owner

matthewd commented Dec 18, 2013

I wrapped the logging so we weren't claiming to be running a query when we were actually waiting, but that's easy to swap if you prefer.

For the connection management methods, I made the locks wider out of principle -- it's implausible to ever affect a SELECT 1, but for purity, nothing ought to see a half-configured connection.

Deleting from the statement cache sends a DEALLOCATE to the raw connection (as can writing, because eviction).

Yes, I managed to lose the most important lock while splitting the commits. How about I fix that? 😊

Anyway, if you're happy this is a reasonable option, I'll turn it into a pull request shortly, and we can continue discussion over there.

@tenderlove how have you attempted to repro this? I'd try setting up something that closes the connection on the DB side some time after a thread has established the connection. We're using PGBouncer between Rails 4 and PG, and are getting the PG::UnableToSend exceptions regularly in threaded workers that sit inactive for longer than PGBouncer's timeout. I'm sure the OP's HAProxy setup provides a similar situation.

Actually, there's an even easier way to reproduce in Rails 4.0.2:

2.0.0-p353> ActiveRecord::Base.connection.execute 'SELECT 1'
 => #<PG::Result ...

# kill and restart Postgres

2.0.0-p353> ActiveRecord::Base.connection.active?
 => true
2.0.0-p353> ActiveRecord::Base.connection.execute 'SELECT 1'
PG::UnableToSend: : SELECT 1

I'm kind of surprised the lack of attention (from users) this issue has gotten. Seems like this could be a serious issue for a lot of users, no?

Contributor

mperham commented Jan 3, 2014

No, most people don't put a proxy between their app server and db server which kills idle connections.

On Jan 3, 2014, at 10:14, Tanel Suurhans notifications@github.com wrote:

I'm kind of surprised the lack of attention (from users) this issue has gotten. Seems like this could be a serious issue for a lot of users, no?


Reply to this email directly or view it on GitHub.

But, if i understand correctly, this issue is present when you restart your database server for example. So right now with Rails 3.2 you don't need to restart your application (which is expensive), with this issue in Rails 4 you cannot do the same. This ,to me, seems the most obvious issue that most people should run into.

@mperham most people needing to scale ActiveRecord + Postgres do exactly that. Unlike MySQL, Postgres can't handle more than a few hundred simultaneous connections (process per connection). If you have any sort of traffic, a proxy to multiplex these connections is crucial and recommended by Postgres admins everywhere (including Heroku).

I imagine the reason this issue hasn't been complained about is because most of the teams using ActiveRecord + Postgres at scale haven't moved to 4.0 yet.

Contributor

mperham commented Jan 3, 2014

Ok, we run MySQL and haven't restarted our database in the last year so I'm
obviously pretty far from this issue. Thanks for the info, very
informative.

On Fri, Jan 3, 2014 at 10:59 AM, Evan Owen notifications@github.com wrote:

@mperham https://github.com/mperham most people needing to scale
ActiveRecord + Postgres do exactly that. Unlike MySQL, Postgres can't
handle more than a few hundred simultaneous connections (process per
connection). If you have any sort of traffic, a proxy to multiplex these
connections is crucial and recommended by Postgres admins everywhere
(including Heroku).

I imagine the reason this issue hasn't been complained about is because
most of the teams using ActiveRecord + Postgres at scale haven't moved to
4.0 yet.


Reply to this email directly or view it on GitHubhttps://github.com/rails/rails/issues/12867#issuecomment-31544727
.

Owner

matthewd commented Jan 4, 2014

@tanelsuurhans Indeed; error fallout from a DB restart (for a minor version update) is precisely what brought me here.

Technically, you don't need to restart your application... you'll only get one error per newly-dead connection, so depending on your server setup, you could contrive a single destined-to-fail request to each worker process, or something.

Similarly, for most of my apps (largely because the DB restart was in the middle of the night), the errors were actually all 'absorbed' by our normal site monitoring, and everything had fixed itself before a real user came along.

Owner

matthewd commented Jan 4, 2014

@kainosnoema For a number of (fairly common) scaling profiles, a connection proxy is absolutely The Right Thing. But I'm not so sure that most people do the "which kills idle [client] connections" part.

Having same issue with mysql and postgres since upgrading to rails 4.x ( tested 4.01 and 4.02 )
it always time-out with:

could not obtain a database connection within 5.000 seconds (waited 5.000 seconds)

Just plain OSX + rails + mysql or postgres no fancy server setups. In development mode using Puma.
No complex queries, reloading any page a few times makes this occur.

@rubytastic It sounds like you need to increase your database connection pool size. Does not sound like it's related to this issue.

@tanelsuurhans
I have pool set to 16, this makes no difference, 100 or 16 or 5 , it always times out in a certain few reloads.

bcarreno added a commit to bcarreno/abagnale that referenced this issue Jan 21, 2014

There's an issue with Rails 4 and Postgres Adapter:
'could not obtain a database connection within 5.000 seconds (waited 5.000 seconds)'
rails/rails#12867

@rubytastic did you ever find a solution to your problem? I am having a very similar issue with Rails 4 and Postgresql

@BobaFaux
No, Although I havent looked at this past the last month the problem seems persist.
With Mysql And Postgress. I find it strange that no one has brought this issue to attention or have had similar issues.

We are experiencing this issue frequently (multiple times per day, in a low throughput staging environment) without any sort of proxy between the two. Using a dedicated postgresql server which services multiple ruby applications, and a separate application server running multiple ruby applications connecting to it. We only get exceptions for this on rails 4 (currently 4.0.2), our rails 3 applications do not appear to be affected. The affected app (and other apps too), run under passenger (which forks), and also have background workers under sidekiq (multi-thread).

The connection pool size being more highly contested due to more applications may be exacerbating this issue for us, but it's a problem that rails 3 is able to handle this situation gracefully, but rails 4 is not.

pyro2927 commented Feb 5, 2014

Sadly I don't have any sort of solution, but I wanted to chime in that I'm also having this issue when using Sidekiq workers.

jalberto commented Feb 6, 2014

I'm having a similar problem with rufus-scheduler on passenger + rials 4.0.2

Traying to use reaper_frequency:10 but at moment it only delay the problem

This should seriously be fixed / looked at by one of the core members?
Can we Raise some attention to this issue? @aetherknight @tenderlove

@pyro2927
I believe this issue is not caused by using sidekiq in your product. It happens every time

pyro2927 commented Feb 6, 2014

@rubytastic, I realize it's not caused by Sidekiq, but switching to using Resque worked around the issue.

Contributor

tonywok commented Feb 6, 2014

I'm running into this problem as well on rails 4.0.2 and pg 9.3

I have a long running request (rarely hit) that I think is causing other requests to queue, and eventually time out in some cases. I'm assuming those abrupt timeouts are leaving postgres connections open - and eventually leading to the issue.

I'd like to get a quick fix into my app while I take a stab at a real solution.

Sorry if this a naive question, @mperham, but if monkey patching the old behavior isn't sufficient for sidekiq users, how did sidekiq not experience an issue in versions of rails 3 where this problem didn't exist? I was hoping to do that in the meantime. I only ask as it would be good information to know when trying to come up with a fix.

Thanks for all the excellent info on this issue :)

Contributor

mperham commented Feb 6, 2014

@tonywok I don't remember the reason I wrote that. Perhaps it will work, I don't recall.

As I recall, the concern with the Rails 3 behavior involves the ConnectionPool reaper running on a different thread, exercising the the PG connection after a reaper timeout period:

      def reap
        synchronize do
          stale = Time.now - @dead_connection_timeout
          connections.dup.each do |conn|
            if conn.in_use? && stale > conn.last_use && !conn.active?
              remove conn
            end
          end
        end
      end

in_use is true when the connection has been checked out (leased), and last_use is the time when it was checked out. So conn.active? will be called by the reaper thread in sidekiq if a long-lived job takes longer than the dead_connection_timeout. Admittedly, the reaper will only reap the connection if the connection is also dead, in that either the OS, the network, or the remote host has killed the connection (provided that conn.active? correctly returns whether further work can be done on the connection).

Contributor

tonywok commented Feb 6, 2014

@aetherknight So as a stop gap, this sounds like it would be "safe" as long as your jobs don't exceed dead_connection_timeout?

@tonywok Probably. After I switched to using the Rails 3.x version of #active? I still ran into some issues with ConnectionPool's checkout process: if an existing connection fails to reconnect during checkout and throws an error (eg if the reconnect fails because the database is down), the connection is put into a limbo state where it has been checked out, but it is never checked back in or removed form the connection pool which causes a connection pool exhaustion (when there is no reaper). I had to add a rescue block to ConnectionPool#checkout to forcibly remove a connection from the ConnectionPool when an exception is raised during the checkout logic in order to allow the ConnectionPool to create a new connection.

My current set of monkey patches address this as well as reverting #active? to its Rails 3 behavior: https://gist.github.com/aetherknight/3d0af086a92bac8f3ee9

The patches assume a single thread per process (until it the pg connection is properly wrapped in synchronize to fix #active?), but as long as the reaper does not run and jobs always terminate, then it is probably safe for sidekiq as well.

@aetherknight
Added your patches to initialisers although the problem persist. Im not using haproxy could you please elaborate a bit more on how to set it up?

Contributor

tonywok commented Feb 7, 2014

@aetherknight Thanks so much for the info. I was planning on making an attempt at a fix tonight. Not my area of expertise, but worth a shot I suppose. After using the monkey patch for a while, do you have any new thoughts on how to fix I should consider before digging in?

Contributor

mperham commented Feb 25, 2014

AFAIK it has nothing to do with sidekiq. Any multithreaded environment will trigger this.

On Feb 24, 2014, at 18:20, marcusbaguley notifications@github.com wrote:

Hi I am still getting this with a sidekiq worker with SHA ad84512 which includes the pull request 14036

gem 'rails', git: "https://github.com/rails/rails.git", branch: '4-0-stable'

Error Message:
PG::UnableToSend: SSL SYSCALL error: EOF detected

Where:
[GEM_ROOT]/bundler/gems/rails-ad84512f89b4/activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb, line 525

Is this a particular issue with sidekiq and pg?


Reply to this email directly or view it on GitHub.

Owner

matthewd commented Feb 25, 2014

Well as I understand it, this issue (as opposed to the also-discussed pool starvation issue) actually has nothing to do with multithreading.

But I'm confused that you'd still be seeing that error since 9337f64 restored the previous behaviour of #active?.

@marcusbaguley you're seeing that when the database server has already gone away before the worker starts, right? It's expected that you'll see such an error if the server goes away after the connection gets checked out.

We're actually seeing the same problem after applying the patch as well. Only in long-lived threads. Not sure what's causing it, but the patch didn't exactly resolve things for us either.

Same issue here, the patch did not resolved the issue. Problem persist :(

Owner

matthewd commented Feb 25, 2014

@rubytastic you have a different error message, and a different issue: yours is a problem with pool starvation.

@kainosnoema "long-lived threads"? You mean if the connection goes away after it's already been checked out? That's not expected to work, and I'm not sure how it ever would've.

Contributor

pawel2105 commented Feb 25, 2014

Haven't experienced this issue in the last 5 days. The only factor then must have been the restart of my DB process.

@matthewd The database is on a different server, but the three rails3 apps we have running on the same servers do not report any database connection errors over the same time span.

@lightswitch05 lightswitch05 referenced this issue in rsim/oracle-enhanced Mar 3, 2014

Closed

Reaper not detecting closed connections #420

mjacobus commented Mar 5, 2014

Same problem here.

Rails 4.0.3 and MySQl

I "soved" this problem with two different approaches.

  1. By increasing pool from 5 to 7.
  2. By applying this monkey patch.
# config/initializers/ar.rb

module ActiveRecord
  module ConnectionAdapters
    class ConnectionPool
      class Queue
        def wait_poll(timeout)
        end
      end
    end
  end
end

Even with above patch problem persist.
Quite frustrating, I have suspended my project because of this issue for 2 months now.

My error:

could not obtain a database connection within 5.000 seconds (waited 5.001 seconds)
Application Frames All Frames
block in ActiveRecord::ConnectionAdapters::ConnectionPool::Queue#wait_poll
config/initializers/ar.rb, line 21
ActiveRecord::ConnectionAdapters::ConnectionPool::Queue#wait_poll
config/initializers/ar.rb, line 12

Full Trace:

https://gist.githubusercontent.com/anonymous/8985920d42e1df4db193/raw/fa5366d2a96138f4402392ccca4cb93af54da744/gistfile1.txt

Any ideas? Not sure it relates exactly to above.

mjacobus commented Mar 6, 2014

@rubytastic what rails version, pool size and database?

4.0.3, poolsize 10, postgres 9.3(osx app)

mjacobus commented Mar 6, 2014

@rubytastic Perhaps the postgres adapter deals differently with wait_poll, cause if I raise an error inside this method my application still works.

def wait_poll(timeout)
  raise "shouldn't it crash now?"
end

@mjacobus
Correct you'r using mysql, overlooked this.
Still not resolved issue though anyone new suggestions?

Member

robin850 commented Mar 6, 2014

@rubytastic @mjacobus : The mentioned patch (i.e. 9337f64) hasn't been included in any release yet. You can look under the commit message to see the branch and the tags that include it ; this will be in the next 4.0.x non-security release. :-)

matthewd added a commit to matthewd/rails that referenced this issue Mar 7, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

matthewd added a commit to matthewd/rails that referenced this issue Mar 7, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Thanks all for the efforts so this will be in 4.0.4? any ETA for that
I try monkey patch for now

matthewd added a commit to matthewd/rails that referenced this issue Mar 12, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.
Member

robin850 commented Mar 12, 2014

Thanks all for the efforts so this will be in 4.0.4? any ETA for that

Yes, it will. Moreover, 4.0.4.rc1 has been released yesterday. I you can, please give it a try! :-)

Tried tonight after upgrading one of my apps to latest rails and ruby 2.1.1.
Still the problem persist using latest versions and 10 connection pool size :(
"could not obtain a database connection within 5.000 seconds (waited 5.001 seconds)"

Rails 4.0.4.rc1
Postgres app OSX Version 9.3.3.0 (21)
https://gist.github.com/anonymous/8de0b374a1a9588d21f9

development:
adapter: postgresql
encoding: utf8
#reconnect: false or true or removed makes no difference
database: books_development
pool: 10
timeout: 15000

I have same issue in another application.

Owner

matthewd commented Mar 13, 2014

@rubytastic what's your puma thread pool size?

You're definitely not doing anything that would spawn additional threads?

@JohnKellyFerguson JohnKellyFerguson referenced this issue in danbartlett/opensit Mar 15, 2014

Merged

Upgrade to Ruby 2.1.1 #73

matthewd added a commit to matthewd/rails that referenced this issue Mar 18, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

@matthewd

This is my puma.rb config ( I tried higher values like 8,32 based on puma documentation, problem persist still) very frustrating :( :

rails_env = ENV['RAILS_ENV'] || 'development'
threads 4,4
bind        "unix:///srv/books/shared/puma.sock"
pidfile     "/srv/books/current/tmp/pids/puma.pid"
state_path  "/srv/books/current/tmp/puma/state"
activate_control_app

flujan commented Mar 18, 2014

Yes, same here. Still get this error. Using 4.2.0 alpha.

Owner

matthewd commented Mar 18, 2014

@rubytastic And none of your actions create extra threads?

Hmm... can you try latest Rails master? My change there should only improve things if you're spawning extra threads, but it's at least worth a try.

@flujan which error are you seeing? UnableToSend (believed fixed in 4.0.4), or ConnectionTimeoutError (possibly addressed in master ~12 hours ago)?

flujan commented Mar 18, 2014

@matthewd ConnectionTimeoutError.

I am getting this errors on a external script using active_record.

https://gist.github.com/flujan/9621099

I installed the master rails from github but don't how to require the master "active_record" on it.

anazar commented Mar 18, 2014

We get this error using mysql on 4.1.0.rc1

Like above folks it still seems broken.
I have same with Postgres and mysql its in the rails 4 active record layer. How can an issue like this persist so long ( or got introduced ) for 2/4 months now with so little people having this problem. This is a real rails killer for me and sadly i don't have the core competence to fix it or supply thoughts on how to resolve it.

This problem started when upgrading to rails 3 to rails 4 after that it just has timed out for me totally and I have reverted back to PHP for several projects. Please anyone track this down I really wan't my rails development env back! Is there anything we can do beside posting or configs and stack traces to make this problem fixed sooner>?

Contributor

pawel2105 commented Mar 19, 2014

Interestingly this morning I was having asset issues.

I ran my app in production mode locally and discovered that because my error controller wasn't handling requests for non html format those assets would return a 500 instead of a 404. After refreshing the page 4 or 5 times while tailing my log, the page would stop loading and good old ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds (waited 5.001 seconds) reared its ugly head again.

This is on Rails 4.0.3 running on puma 2.8.0.

Don't know if this info will be useful to anyone.

@pawel2105
Your comment made me think about puma i use it blatantly without thought in every project. So I switched my server to thin in development mode.

Rails 4.0.4.rc1
thin 1.6.2 codename Doc Brown
ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-darwin13.0]

Tried using Thin but couldn't validate if the problem persist since I have rails 4 server side events in my project which is hard to remove from it. Perhaps its an idea to testdrive if the problem would persist if using a non puma server.

flujan commented Mar 19, 2014

I just take the tip from @matthewd and put a larger checkout on my external script. Put a :checkout_timeout of 60 seconds... The error happens with less frequency but still occurs.

@flujan
Tried that also. All the bleeding edge latest versions as of today but problem still persist

ActiveRecord::ConnectionTimeoutError - could not obtain a database connection within 5.000 seconds (waited 5.002 seconds):
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:190:in `block in wait_poll'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:181:in `wait_poll'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:136:in `block in poll'
  /Users/Pascal/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:146:in `synchronize'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:134:in `poll'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:423:in `acquire_connection'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:356:in `block in checkout'
  /Users/Pascal/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:355:in `checkout'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:265:in `block in connection'
  /Users/Pascal/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:264:in `connection'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:546:in `retrieve_connection'
  activerecord (4.0.4.rc1) lib/active_record/connection_handling.rb:79:in `retrieve_connection'
  activerecord (4.0.4.rc1) lib/active_record/connection_handling.rb:53:in `connection'
  activerecord (4.0.4.rc1) lib/active_record/query_cache.rb:51:in `restore_query_cache_settings'
  activerecord (4.0.4.rc1) lib/active_record/query_cache.rb:43:in `rescue in call'
  activerecord (4.0.4.rc1) lib/active_record/query_cache.rb:32:in `call'
  activerecord (4.0.4.rc1) lib/active_record/connection_adapters/abstract/connection_pool.rb:626:in `call'
  activerecord (4.0.4.rc1) lib/active_record/migration.rb:373:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/callbacks.rb:29:in `block in call'
  activesupport (4.0.4.rc1) lib/active_support/callbacks.rb:373:in `_run__4419992542473468413__call__callbacks'
  activesupport (4.0.4.rc1) lib/active_support/callbacks.rb:80:in `run_callbacks'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/callbacks.rb:27:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/reloader.rb:64:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/remote_ip.rb:76:in `call'
  better_errors (1.1.0) lib/better_errors/middleware.rb:84:in `protected_app_call'
  better_errors (1.1.0) lib/better_errors/middleware.rb:79:in `better_errors_call'
  better_errors (1.1.0) lib/better_errors/middleware.rb:56:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/debug_exceptions.rb:17:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/show_exceptions.rb:30:in `call'
  railties (4.0.4.rc1) lib/rails/rack/logger.rb:38:in `call_app'
  railties (4.0.4.rc1) lib/rails/rack/logger.rb:20:in `block in call'
  activesupport (4.0.4.rc1) lib/active_support/tagged_logging.rb:68:in `block in tagged'
  activesupport (4.0.4.rc1) lib/active_support/tagged_logging.rb:26:in `tagged'
  activesupport (4.0.4.rc1) lib/active_support/tagged_logging.rb:68:in `tagged'
  railties (4.0.4.rc1) lib/rails/rack/logger.rb:20:in `call'
  quiet_assets (1.0.2) lib/quiet_assets.rb:18:in `call_with_quiet_assets'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/request_id.rb:21:in `call'
  rack (1.5.2) lib/rack/methodoverride.rb:21:in `call'
  rack (1.5.2) lib/rack/runtime.rb:17:in `call'
  activesupport (4.0.4.rc1) lib/active_support/cache/strategy/local_cache.rb:83:in `call'
  rack-livereload (0.3.15) lib/rack/livereload.rb:23:in `_call'
  rack-livereload (0.3.15) lib/rack/livereload.rb:14:in `call'
  actionpack (4.0.4.rc1) lib/action_dispatch/middleware/static.rb:64:in `call'
  rack (1.5.2) lib/rack/sendfile.rb:112:in `call'
  railties (4.0.4.rc1) lib/rails/engine.rb:511:in `call'
  railties (4.0.4.rc1) lib/rails/application.rb:97:in `call'
  rack (1.5.2) lib/rack/content_length.rb:14:in `call'
  puma (2.8.1) lib/puma/server.rb:490:in `handle_request'
  puma (2.8.1) lib/puma/server.rb:361:in `process_client'
  puma (2.8.1) lib/puma/server.rb:254:in `block in run'
  puma (2.8.1) lib/puma/thread_pool.rb:92:in `block in spawn_thread'

flujan commented Mar 21, 2014

@rubytastic since we do not have knowledge enough to fix it and the issue is closed we will need to wait other bugs to happen and hope that the fix of one of them, fixes this one too.

With ruby > 2.0 the use of threads will improve and this kind of bug will surely happen more.

chris-maginatics added a commit to maginatics/rails that referenced this issue Apr 10, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

grahamjenson added a commit to LoyaltyNZ/rails that referenced this issue May 6, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Conflicts:
	activerecord/CHANGELOG.md

@rivsc rivsc referenced this issue in activerecord-hackery/ransack May 6, 2014

Closed

Compatibility with Rails 4.1 and Arel 6 #368

barnett commented May 8, 2014

any update on possible solutions?

Owner

matthewd commented May 8, 2014

@blklane this thread ended up discussing two very different issues... which one are you experiencing, and on which Rails version?

barnett commented May 8, 2014

@matthewd 4.0.1 and ruby 2.0 with the following error

 Error during failsafe response: could not obtain a database connection within 5.000 seconds (waited 5.000 seconds)
2014-05-08T14:08:49.009040+00:00 app[web.1]:   /app/vendor/bundle/ruby/2.0.0/gems/activerecord-4.0.2/lib/active_record/connection_adapters/abstract/connection_pool.rb:190:in `block in wait_poll'
2014-05-08T14:08:49.009042+00:00 app[web.1]:   /app/vendor/bundle/ruby/2.0.0/gems/activerecord-4.0.2/lib/active_record/connection_adapters/abstract/connection_pool.rb:181:in `loop'
2014-05-08T14:08:49.009044+00:00 app[web.1]:   /app/vendor/bundle/ruby/2.0.0/gems/activerecord-4.0.2/lib/active_record/connection_adapters/abstract/connection_pool.rb:181:in `wait_poll'
2014-05-08T14:08:49.009046+00:00 app[web.1]:   /app/vendor/bundle/ruby/2.0.0/gems/activerecord-4.0.2/lib/active_record/connection_adapters/abstract/connection_pool.rb:136:in `block in poll'
Owner

matthewd commented May 8, 2014

@blklane 4.1 revised the connection pool handling to address some sources of starvation (around thread death), so you could try that... but the most common cause I've seen has been people genuinely starving the pool: running more DB-touching threads than they have connections in their pool. And there's not currently much we can do about that. 😕

@blklane Rails 4.0.4+ made a change to the Postgresql connection adapter to allow the normal request-response cycle to correctly detect when a connection is no longer active again, which should help the ConnectionPool self-heal after a DB connection dies, which might help you.

However, there are still some other DB connection issues 4.0.x that could lead to the ConnectionPool exhausting its pool of possible connections. Part of a gist I posted further up attempts to address a scenario where if an error occurs during checkout, the ConnectionPool cleans up the connection that failed to be fully checked out: https://gist.github.com/aetherknight/3d0af086a92bac8f3ee9#file-active_record_patches-rb-L76

I haven't looked into Rails 4.1's connection management yet.

I believe this is related to this issue:

I run into pool starvation when I perform any database operations while handling errors (404 not found) via config.exceptions_app. A refresh of a 404 page 4 or 5 times seems to starve my connection pool. I workaround it by removing DB operations from the error pages. Posting here in case anyone else sees this.

Let me know if anyone wants more information on my setup for this.

I still have this issue with the latest rails, ruby, postgres, all stable versions.

rivsc commented Jun 6, 2014

Hi,

My error message :

INFO -- : Started POST "/my_ctrl/import" for X.X.X.X at 2014-06-06 15:38:52 +0200
INFO -- : Processing by MyController#import as HTML
INFO -- : Parameters: {...}
ERROR -- : PG::UnableToSend: : INSERT INTO "..."
ERROR -- : PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK
INFO -- : Completed 500 Internal Server Error in 5115ms
FATAL -- :
ActiveRecord::StatementInvalid (PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK):

With this infrastructure (also tested with rb 1.9.3 and rails 4.0.4) :

infrarb

@rubytastic : Your postgresql is on the same server of rails app ?

Ps : I'm using Apartment. (gem for multi-tenant App).

[Note : I work with @rivsc]

I added these lines at the end of /etc/sysctl.conf on machine pgpool2 (cf. the infrastructure schema from @rivsc comment above)

# KeepAlives tuning to detect dead PG connections faster
net.ipv4.tcp_keepalive_time = 3
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 9

And I pushed config changes to procfs to avoid a reboot :
# echo 3 > /proc/sys/net/ipv4/tcp_keepalive_time
# echo 60 > /proc/sys/net/ipv4/tcp_keepalive_intvl
# echo 9 > /proc/sys/net/ipv4/tcp_keepalive_probes

According to @rivsc, it solved the problem. On my opinion, that tends to indicate that ActiveRecord and/or the PG Gem don't close connections correctly.

@matthewd matthewd locked and limited conversation to collaborators Jun 21, 2014

Owner

matthewd commented Jun 21, 2014

I'm locking this issue, because it's in danger of becoming a catch-all for every possible issue involving database connections and PostgreSQL.

  • PG::UnableToSend: SSL SYSCALL error / PG::UnableToSend in general, should be fixed as of 4.0.4 / 4.1.0 (#13981)
  • ActiveRecord::ConnectionTimeoutError - could not obtain a database connection within 5.000 seconds (pool exhaustion) should no longer occur in a properly behaving application as of master / future 4.2.0 (#14360). It's still quite possible to exhaust your pool if you have more threads than connections, and those threads are hanging on to the connections for too long.

If you're experiencing either of the above problems on a version I've claimed it should be fixed, or some other problem not described above, please open a new issue, so it can be properly tracked and dealt with.

If you're having an issue with connection management and you're using a gem that deliberately overrides ActiveRecord's connection management, please attempt to reproduce the problem without that gem first.

jigish added a commit to jigish/rails that referenced this issue Aug 8, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Conflicts:
	activerecord/CHANGELOG.md

jigish added a commit to jigish/rails that referenced this issue Sep 9, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Conflicts:
	activerecord/CHANGELOG.md
	activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb

jigish added a commit to jigish/rails that referenced this issue Sep 9, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Conflicts:
	activerecord/CHANGELOG.md

uar-david-litvak pushed a commit to FanaticalPlatform/rails that referenced this issue Sep 17, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Conflicts:
	activerecord/CHANGELOG.md
	activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb

jigish added a commit to jigish/rails that referenced this issue Oct 13, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.

Conflicts:
	activerecord/CHANGELOG.md
	activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb

Conflicts:
	activerecord/CHANGELOG.md

babbtx pushed a commit to babbtx/rails that referenced this issue Oct 24, 2014

Reap connections based on owning-thread death
.. not a general timeout.

Now, if a thread checks out a connection then dies, we can immediately
recover that connection and re-use it.

This should alleviate the pool exhaustion discussed in #12867. More
importantly, it entirely avoids the potential issues of the reaper
attempting to check whether connections are still active: as long as the
owning thread is alive, the connection is its business alone.

As a no-op reap is now trivial (only entails checking a thread status
per connection), we can also perform one in-line any time we decide to
sleep for a connection.
(cherry picked from commit 9e457a8)

Conflicts:
	activerecord/CHANGELOG.md
+ (babbtx) didn't take this change from 9e457a8)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.