Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection pool unreliable? #196

Open
m-o-e opened this issue Nov 23, 2019 · 7 comments
Open

Connection pool unreliable? #196

m-o-e opened this issue Nov 23, 2019 · 7 comments

Comments

@m-o-e
Copy link

m-o-e commented Nov 23, 2019

Currently DB::Database does not recover from temporary
connection failures. It does not seem to evict broken connections
from the pool but continues to attempt to use them.

This means just a single brief outage will
leave the pool in a permanently broken state.

The only way to recover from connection failures appears to
be to discard the entire DB::Database instance and create a new pool.

Is this a bug or am I using it wrong? 🤔


Test-case:

require "db"
require "pg"

initial_pool_size=5
max_pool_size=5
retry_attempts=99999

url = "postgres://localhost/test?initial_pool_size=#{initial_pool_size}&max_pool_size=#{max_pool_size}&max_idle_pool_size=#{max_pool_size}&checkout_timeout=5.0&retry_attempts=#{retry_attempts}&retry_delay=1"

puts "Trying: #{url}"
db = DB.open(url)
loop do
  begin
    db.query("select 42").close
    puts "."
  rescue e
    puts e
  end
  sleep 1
end

Just leave the above running and while it runs
restart your local postgres server.

Regardless of what settings you choose, you will see the following output:

.
.
.
Error reading socket: Connection reset by peer
Error writing to socket: Broken pipe
Error writing to socket: Broken pipe
Error writing to socket: Broken pipe
[...]
@bcardiff
Copy link
Collaborator

As long as you are not checking out connections explicitly the pool should be able to recover from temporary outages. Sample in mysql https://crystal-lang.org/reference/database/connection_pool.html

Maybe an exception with the wrong type is raised and the connection loss is not detected.

But you snippets looks good and i would expect to be resilient.

@m-o-e
Copy link
Author

m-o-e commented Nov 23, 2019

Possibly it's postgres specific?

I couldn't find a way to make my above snippet resilient - it always breaks
when the server is restarted.

@straight-shoota
Copy link
Contributor

Yeah, I've encountered similar errors with postgres as well, but never did any deeper investigation. So maybe it's an issue in crystal-pg.

@stakach
Copy link

stakach commented Feb 13, 2021

also finding this an issue working with GCP's hosted postgres compatible database and new connections take upwards of 2 seconds to connect meaning we really need to rely on the pool.
So we have the pool configured with max_idle_pool_size == max_pool_size == 50 but if nothing is going on for awhile the connections eventually time out and broken sockets are returned causing 500 errors in our web app DB::ConnectionLost errors

Also seeing the pool not recovering from DB outages - we work around that using health checks, but it's not optimal

@akadusei
Copy link

akadusei commented Dec 1, 2023

As long as you are not checking out connections explicitly the pool should be able to recover from temporary outages.

From a cursory look at the code, the connection pool does not automatically retry a lost connection. To ensure a lost connection is retried, #retry(&block) must be called explicitly:

db.retry do
  db.query #...
end

Otherwise, all the retry_* options do nothing at all, it seems.

Perhaps PG::Statement should inherit DB::PoolStatement instead of DB::Statement?

@bcardiff
Copy link
Collaborator

bcardiff commented Dec 1, 2023

The db.retry is not needed. If you do db.query (over the Database) the retry should happen automatically. See https://crystal-lang.org/reference/1.10/database/connection_pool.html#sample

I haven't validated if something is off with pg in this regard though.

@akadusei
Copy link

akadusei commented Dec 1, 2023

Oh, I must have missed that, then.

I get DB::ConnectionLost regularly with multiple apps in staging. This happens when those apps haven't seen any interaction in a while. I use a serverless database service provider that suspends the server after it's been idle for some time (it starts up after the first connection attempt).

Here's a trace, if that helps:

 (DB::ConnectionLost)
  from /tmp/lucky/lib/pg/src/pg/statement.cr:11:5 in 'perform_query'
  from /tmp/lucky/lib/db/src/db/statement.cr:93:9 in 'exec_query'
  from /tmp/lucky/lib/avram/src/avram/queryable.cr:291:7 in 'results'
  from /usr/share/crystal/src/indexable.cr:749:5 in 'first?'
  from /tmp/lucky/lib/samba/src/presets/client.cr:7:3 in 'run_operation'
  from /tmp/lucky/src/actions/oauth/token/create.cr:4:3 in 'perform_action'
  from /tmp/lucky/lib/lucky/src/lucky/route_handler.cr:10:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
  from /tmp/lucky/lib/lucky/src/lucky/log_handler.cr:29:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
  from /tmp/lucky/lib/lucky/src/lucky/force_ssl_handler.cr:36:8 in 'call'
  from /tmp/lucky/lib/lucky/src/lucky/request_id_handler.cr:24:5 in 'call'
  from /usr/share/crystal/src/http/server/request_processor.cr:51:11 in 'handle_client'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???

I suspect the app attempts the first connection, fails and raises the error without a retry. I have tried various connection params to no avail. Currently its initial_pool_size=20&max_idle_pool_size=20&max_pool_size=20&retry_attempts=3&retry_delay=1. Restarting the apps seems to fix it, until it happens again.

They are Lucky apps, and it seems Avram does an explicit checkout here:

https://github.com/luckyframework/avram/blob/0b9797b918373615bafea54986849bb886ec239b/src/avram/database.cr#L188-L201

I don't see a retry there. Could that be it, or am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants