Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnecting after database server disconnect/reconnect #1833

Closed
ty10r opened this issue Dec 15, 2016 · 74 comments · Fixed by #2017
Closed

Reconnecting after database server disconnect/reconnect #1833

ty10r opened this issue Dec 15, 2016 · 74 comments · Fixed by #2017

Comments

@ty10r
Copy link

ty10r commented Dec 15, 2016

I searched through the issues to find if others are having trouble attempting to reconnect.

It's been a very long time since these two:
#1341
#936

Most of the suggestions I have seen in the issues so far are relying on pool2 and include configuration for requestTimeout.

I am on version 0.12.2 of knex, which no longer uses pool2. I also can no longer see implementation of ping in src/client.js.

I'm running tests where I use docker to run these steps:

  1. Spin up postgres db in container.
  2. Start knex/connect to postgres.
  3. Run a test query, evaluate success.
  4. Stop postgres db in container.
  5. Run test query again, evaluate failure (with config: `{pool: {acquireTimeoutMillis: XXXX}, acquireConnectionTimeout: XXXX)
  6. Restart postgres db in container.
  7. Run test query again, evaluate success.

Step 7 is failing in the same way step 5 does.

I assumed knex would automatically attempt to reconnect. Is there a way I can cause a reconnection?

If you'd like, I can post this question instead at one of the past conversations on reconnection. I thought it was worth having a new conversation since most of the other conversations include only answers from before switching to generic-pool.

@catamphetamine
Copy link

+1 to implementing autoreconnect feature: it's really inconvenient not having it out-of-the-box.
I didn't even imagine that in such a popular library such a basic issue exists, I assumed it handles everything by itself.

@ty10r
Copy link
Author

ty10r commented Dec 19, 2016

I'm personally not convinced there isn't a way.

I have seen tests in external projects that suggest that 0.9.X was automatically reconnecting.

I am asking this question for these reasons:

  1. My test logic/timing may be flawed, if a contributor here told me that it should be reconnecting and how, it should give everyone a better understanding of an undocumented behavior.

  2. There may be a work around that the implementors intended when switching pooling (resulting in losing the previous work arounds)

  3. If there are absolutely no solutions, then yes, we can wonder about why reconnection does not exist and then find a good solution for its existence.

@catamphetamine
Copy link

catamphetamine commented Dec 19, 2016

@ty10r Well, given there's absolutely zero results when searching for reconnect in this repo, I can state that the absence of a built-in reconnection in this library is a proven fact.
https://github.com/tgriesser/knex/search?utf8=%E2%9C%93&q=reconnect

I wonder how is it even possible to not build-in (and then document) such a crucial piece of functionality into a library which is being advertised as being made for production use.

@elhigu
Copy link
Member

elhigu commented Mar 27, 2017

Lets try to keep future discussion of knex reconnecting to database here. Now that we are using generic-pool it is fair to look again how this works. Would be great if knex would be able to connect again to DB if server was closed and restarted (or network was broken for a while or anything like that).

Would be awesome if we would be able to add some automatic testing for this. Any ideas how to do them would be valuable (or do we just have to stop / restart db servers in CI? for that separate test script that is ran after all others could be viable solution).

@smulesoft
Copy link
Contributor

Hi @elhigu, I leave you some comments:

Reconnection

According to my tests, reconnection is almost working. I've only found one weird scenario (see this issue).
The problem is this one: Suppose you restart your database service, and after the restart there is still 1 connection to the old database in the pool. Two things could happen:

  • The idle connection is purged because it has stayed idle for idleTimeoutMillis (it is evicted). Then, a new connection is created (to the 'new' database), and it can perform queries correclty.
  • A client performs a query so the idle connection is used (instead of being evicted). If the testOnBorrow flag is set, a call to either config.poool.validate or config.pool.validateAsync is made. If the function returns true, the idle connection is accepted so it will be used. And this is the weird scenario. An invalid connection is being used (it is incorrect because it was created for the old database). In my test, when this happens, the query will just hang indefinitely.

If what I say is correct, maybe you can save this case by correcting the implementation of config.pool.validate. Actually, I would implement validateAsync instead of validate in client.js. And there I would simply do a query like SELECT 1 as one (in addition to the check on connection.__knex__disposed).

Testing

In my tests I am using dockerode to start and stop postgres containers directly from node.js tests, and I am able to reproduce this bug.

Let me know what you think and maybe I'll make myself some time to help you with a PR.

@elhigu
Copy link
Member

elhigu commented Mar 30, 2017

@smulesoft if you are able to add PR which actually can test this somehow on our CI it would be very valuable.

Problem is that if there is no tests it will break again in the future. I don't have currently resources to write PRs for this stuff by myself.

@damirm
Copy link

damirm commented Apr 13, 2017

Have the same issue. validate function still returns true even If connection with database is closed.
I found fix: set __knex__disposed on connection close event. (https://github.com/tgriesser/knex/blob/master/src/dialects/postgres/index.js#L111)
Like this:

connection.on('error', (err) => {
  connection.__knex__disposed = err
})
connection.on('end', (err) => {
  connection.__knex__disposed = err
})

Tested in postgres dialect.
What cons of that solution?

@damirm
Copy link

damirm commented Apr 14, 2017

Another solution is check for connection.connection.stream.readable in validate function. This will be protect from using closed stream. I think that is better than my previous comment.
In this case validate function looks like this:

  validate: (connection) => {
    if (connection.__knex__disposed) {
      helpers.warn(`Connection Error: ${connection.__knex__disposed}`)
      return false
    }
    if (!connection.connection.stream.readable) {
      return false;
    }
    return this.validateConnection(connection)
  }

@smulesoft
Copy link
Contributor

smulesoft commented Apr 18, 2017

Alright, I have a fix inspired on @damirm's first comment -including a reproducing test case which depends on having docker insatlled in the machine running tests.
I created this PR #2017, I invite you to follow the discussion over there.

smulesoft added a commit to smulesoft/knex that referenced this issue Apr 18, 2017
smulesoft added a commit to smulesoft/knex that referenced this issue May 15, 2017
elhigu pushed a commit to smulesoft/knex that referenced this issue Jun 1, 2017
elhigu pushed a commit that referenced this issue Jun 1, 2017
* fixes #1833 on postgres

* adds @elhigu requested changes

* adds test when database is down

* adds script for docker support

* removes old comment

* fixes syntax

* fixes more syntax errors

* fixes final syntax errors

* Docker testing enabled only on linux for now
@leebenson
Copy link

As someone that is seeing this thread for the first time after running into rejected connection promises, could someone please do me a quick favour and summarise the status of the fix?

I'm seeing this in my error logs:

js/release/async.js:17:14) fatal: true }
ERRORS -> { Error: Can't add new command when connection is in closed state
at Connection._addCommandClosedState (/app/node_modules/mysql2/lib/connection.js:158:13)
at Connection.query (/app/node_modules/mysql2/lib/connection.js:621:15)
at /app/node_modules/knex/lib/dialects/mysql/index.js:156:18
at Promise._execute (/app/node_modules/bluebird/js/release/debuggability.js:303:9)
at Promise._resolveFromExecutor (/app/node_modules/bluebird/js/release/promise.js:483:18)
at new Promise (/app/node_modules/bluebird/js/release/promise.js:79:10)
at Client_MySQL2._query (/app/node_modules/knex/lib/dialects/mysql/index.js:150:12)
at Client_MySQL2.query (/app/node_modules/knex/lib/client.js:211:17)
at Runner. (/app/node_modules/knex/lib/runner.js:149:36)
at Runner.tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)
at Runner.query (/app/node_modules/bluebird/js/release/method.js:15:34)
at /app/node_modules/knex/lib/runner.js:61:21
at tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)
at /app/node_modules/bluebird/js/release/using.js:185:26
at tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)
at Promise._settlePromiseFromHandler (/app/node_modules/bluebird/js/release/promise.js:512:31)
at Promise._settlePromise (/app/node_modules/bluebird/js/release/promise.js:569:18)
at Promise._settlePromise0 (/app/node_modules/bluebird/js/release/promise.js:614:10)
at Promise._settlePromises (/app/node_modules/bluebird/js/release/promise.js:693:18)
at Promise._fulfill (/app/node_modules/bluebird/js/release/promise.js:638:18)
at PromiseArray._resolve (/app/node_modules/bluebird/js/release/promise_array.js:126:19)
at PromiseArray._promiseFulfilled (/app/node_modules/bluebird/js/release/promise_array.js:144:14)
at Promise._settlePromise (/app/node_modules/bluebird/js/release/promise.js:574:26)
at Promise._settlePromise0 (/app/node_modules/bluebird/js/release/promise.js:614:10)
at Promise._settlePromises (/app/node_modules/bluebird/js/release/promise.js:693:18)
at Async._drainQueue (/app/node_modules/bluebird/js/release/async.js:133:16)
at Async._drainQueues (/app/node_modules/bluebird/js/release/async.js:143:10)

I assumed that specifying a pool option would add automatic handling of reconnections, but so far the server doesn't seem to have recovered.

Is there some extra configuration that is needed, or something else I'm missing?

Thanks in advance.

@elhigu
Copy link
Member

elhigu commented Nov 13, 2017

@leebenson I believe there is new bug in 0.14 which is causing that error. Maybe related to this #2320 and here is other issue about your problem #2321

@vjpr
Copy link
Contributor

vjpr commented Nov 26, 2017

I saw infinite loop of trying to reconnect in 0.14 when db didn't exist, but it is now fixed in 0.14.2 just as an FYI.

@elhigu
Copy link
Member

elhigu commented Dec 4, 2017

Removed unrelated question, which was added by accident to this issue.

@knex knex deleted a comment from shivamtandon Dec 5, 2017
@analytik
Copy link

analytik commented Dec 12, 2017

This isn't resolved for me in 0.14.2. When I kill a database in between making knex queries, everything works. When I make queries, then scale the database to zero replicas, attempt to make more queries, I get the following error - and I get the same error even when I scale the database back up and it's accepting connections and everything.

Unhandled rejection TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
    at /project/node_modules/knex/lib/client.js:351:13
    at runCallback (timers.js:789:20)
    at tryOnImmediate (timers.js:751:5)
    at processImmediate [as _immediateCallback] (timers.js:722:5)
From previous event:
    at Client_PG.acquireConnection (/project/node_modules/knex/lib/client.js:350:13)
    at Runner.ensureConnection (/project/node_modules/knex/lib/runner.js:215:24)
    at Runner.run (/project/node_modules/knex/lib/runner.js:47:42)
    at Raw.Target.then (/project/node_modules/knex/lib/interface.js:35:43)
    at serveSomeDataFromDatabase (/project/controllers/api.js:12:34)
<Express stuff>
    at /project/app.js:40:17
<more Express stuff>

So, steps to reproduce:

  • make an knex query and succeed
  • bring down database
  • make more queries while database is down, receive the above-mentioned error
  • bring database up
  • never recover and when running knex queries, receive the same error always, forever EDIT 1: for a long time. Explanation below.

My connection settings are { dialect: 'pg', debug: true, acquireConnectionTimeout: 4000, ... } and I got these ONLY the first time the error appeared, just before the above error/stack trace was thrown:

Knex:warning - Connection Error: Connection ended unexpectedly
Knex:warning - Connection Error: Connection ended unexpectedly

EDIT 1: I've tried setting pool.idleTimeoutMillis to 25s, pool.acquireTimeoutMillis to 4s, evictionRunIntervalMillis to 20s, and it still took around 90 seconds until the queries started succeeding. I will test this more.

EDIT 2: with timeouts set to 5 seconds and 10 connections in the pool, it took 40 seconds from the time when database could be queried until first query succeeded. With 2 connections in pool, it still took 40-60 second until the first query succeeded, from the time when the database could be queried from another tool. Just to be clear, that's with these settings:

    return knex({
        dialect: 'pg',
        debug: true,
        acquireConnectionTimeout: 4 * 1000,
        pool: {
            min: 2,
            max: 2,
            idleTimeoutMillis: 5 * 1000,
            acquireTimeoutMillis: 3 * 1000,
            evictionRunIntervalMillis: 5 * 1000
        },
        connection: fooBarConnstr,
    });

@kibertoad
Copy link
Collaborator

Default is 1000 though, which should be fine,

@jazoom
Copy link

jazoom commented Oct 24, 2018

@kibertoad Maybe, but since the Knex docs didn't mention anything about tarn.js I didn't look at tarn.js. I looked at generic-pool, since that's what the Knex docs mention.

Yeah, I doubt it's the cause of this error.

@smulesoft
Copy link
Contributor

@jazoom You should check tarn if you're using knex above version 0.14.0. You can see here that #2450 was merged in that version.

@jazoom
Copy link

jazoom commented Oct 25, 2018

Yes I am very well aware now that Knex uses tarn. I knew as soon as @ChieveiT pointed it out.

I'm not convinced it's related to this issue, but others did have theirs fixed by downgrading, so that's interesting. I still haven't seen the issue again since I last reported it happened. I'm not even 100% convinced it's Knex's fault.

@smulesoft
Copy link
Contributor

When I was testing this because we had the same problem in prod. In my case, I came up with an integration test that could reproduce the problem:

1. Do a batch of queries, they all work fine.
2. Shut down the database programatically.
3. Start the database again.
4. Do a batch of queries, see what happens.

I did steps 2 and 3 using dockerode to create/stop a postgres-docker-container in #2017.
It would be awesome if you do the same because you'll find a way to reproduce the issue, and to add it to knex test-suite to make sure it doesn't happen again in the future.

@kibertoad
Copy link
Collaborator

@smulesoft Would you consider contributing a test script that would be doing that?

@smulesoft
Copy link
Contributor

Yeah, the PR I mentioned, #2017.

@elhigu
Copy link
Member

elhigu commented Oct 29, 2018

Yeah, AFAIK that shouldn't be the problem anymore.

@kibertoad
Copy link
Collaborator

@elhigu Did #2017 land in one of 0.15.x releases or is it only in NEXT?

@elhigu
Copy link
Member

elhigu commented Oct 29, 2018

@kibertoad I think it has been in already in 0.14 or something like that. Long enough that it was already broken at some point... I wouldn't have it implemented the way it was anymore and I haven't tried that recently.

@smulesoft
Copy link
Contributor

smulesoft commented Oct 29, 2018

@kibertoad Adding this callback was the fix to my problem.
Actually, the fix was defaulting to a non-falsey value. You can track it down to know in which release it was merged, but it was in 0.14 for sure.
Note that it was a fix for PSQL. I don't know who's having the problem now, but @jazoom mentioned CockroachDB, which may need to handle the problem in its own driver.

@kibertoad
Copy link
Collaborator

@smulesoft CockroachDB is actually using pg driver.

@smulesoft
Copy link
Contributor

Thanks for the info. Then we should wait until we have a way to reproduce this consistently and then debug the problem...

@elhigu
Copy link
Member

elhigu commented Nov 6, 2018

I think this issue has too much clutter already from ancient history and different issues, that this should be closed and add new issues when something reproducible happens. Closing.

@elhigu elhigu closed this as completed Nov 6, 2018
@jazoom
Copy link

jazoom commented Nov 6, 2018

As a bit of closure, yesterday I finally worked out what the problem was in my case. It was actually a problem with the database, which has now been resolved with the latest version of CockroachDB that was just released.

@kibertoad
Copy link
Collaborator

@jazoom Glad to hear that! Would you consider providing a bit more details (e. g. link to resolved issue in CockroachDB tracker) for people with same issue who stumble upon this ticket?

@jazoom
Copy link

jazoom commented Nov 6, 2018

I actually don't know which issue it was fixed in, or even if it had an issue. I noticed sometimes the metrics would catch a spike in RAM and CPU usage and the CockroachDB logs said something about a heartbeat timeout.

I updated to version 2.1.0 (a major update with a great many fixes and changes) and the problem hasn't happened since. But, it's only been 1 day, so I can't be too sure.

@kibertoad
Copy link
Collaborator

@jazoom I would appreciate it if you reported back after having it run for a while, whether or not problem reappeared again :)

@jazoom
Copy link

jazoom commented Nov 6, 2018

No probs. Will do.

@jazoom
Copy link

jazoom commented Nov 7, 2018

It's been 3 days now and I still haven't seen the problem. Hopefully that means it has been resolved, at least in my case.

@kibertoad
Copy link
Collaborator

@jazoom Given that it was a week by now, is it still going strong?

@jazoom
Copy link

jazoom commented Nov 10, 2018

Actually, it happened once, but I'm pretty sure it was CockroachDB's fault, or perhaps a brief network outage. I don't think it had anything to do with Knex.

@necevil
Copy link

necevil commented Nov 29, 2018

Hey guys,
Not sure if this one has been officially solved in the most recent versions or not — but on Azure AKS — when you are running Ghost with a remote database outside the cluster it was happening due to the Azure enforced 4 minute connection kill when connected to an external IP / port.

You can see some of my trouble shooting process over here:
https://stackoverflow.com/questions/50706483/what-azure-kubernetes-aks-time-out-happens-to-disconnect-connections-in-out

Generally I was able to solve this previously (in the last iteration of my modified Ghost docker image) by rebuilding / rolling back to Knex 0.12.6. Since Ghost uses MySql (by default) but some of my services on the cluster also need to connect to Postgres (Rails in this case) — as an added solution / redundancy I added a PGbouncer instance in order to separate any services connecting to external DBs from the connection pooling issue by enforcing a DB connection drop on the PGbouncer instance (prior to Azure's 4 minute Load Balancer automatic connection severing).

If you are using Postgres and not MySQL you can solve the problem by adding PGBouncer in the middle and telling it to cycle connections frequently.

I don't know if the issue is still occurring but I am open to testing it with the latest version used by the official Ghost repo (Knex 0.14.6) if you guys have a test procedure!

@brandonros
Copy link

I have Connection terminated unexpectedly errors using Postgres.

My pool options are:

{
    min: 2,
    max: 10,
    propagateCreateError: false
}

It seems if the connection in the pool constantly gets kept alive through querying activity, it is fine.

However, if the query sits idle... there is no kind of keep alive?

Any advice on how to solve this issue? A keep alive setting? Retry on Connection terminated unexpectedly?

@elhigu
Copy link
Member

elhigu commented Dec 20, 2018

@brandonros if it is idle, then connections may be closed by the server or by pool and when new query is done, new connection is fetched to the pool if necessary... So no need to keepalive... you should never get Connection terminated unexpectedly error when you are executing query, unless connection is terminated, after connection has been fetched out from pool (app has made query and connection terminated during it).

@jamesdixon
Copy link

@brandonros did you ever solve the issue?

@suryacaprice

This comment has been minimized.

@mremick
Copy link

mremick commented Aug 3, 2021

@brandonros what was your solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.