Reconnecting after database server disconnect/reconnect #1833

ty10r · 2016-12-15T23:10:02Z

I searched through the issues to find if others are having trouble attempting to reconnect.

It's been a very long time since these two:
#1341
#936

Most of the suggestions I have seen in the issues so far are relying on pool2 and include configuration for requestTimeout.

I am on version 0.12.2 of knex, which no longer uses pool2. I also can no longer see implementation of ping in src/client.js.

I'm running tests where I use docker to run these steps:

Spin up postgres db in container.
Start knex/connect to postgres.
Run a test query, evaluate success.
Stop postgres db in container.
Run test query again, evaluate failure (with config: `{pool: {acquireTimeoutMillis: XXXX}, acquireConnectionTimeout: XXXX)
Restart postgres db in container.
Run test query again, evaluate success.

Step 7 is failing in the same way step 5 does.

I assumed knex would automatically attempt to reconnect. Is there a way I can cause a reconnection?

If you'd like, I can post this question instead at one of the past conversations on reconnection. I thought it was worth having a new conversation since most of the other conversations include only answers from before switching to generic-pool.

The text was updated successfully, but these errors were encountered:

catamphetamine · 2016-12-17T21:08:43Z

+1 to implementing autoreconnect feature: it's really inconvenient not having it out-of-the-box.
I didn't even imagine that in such a popular library such a basic issue exists, I assumed it handles everything by itself.

ty10r · 2016-12-19T17:37:24Z

I'm personally not convinced there isn't a way.

I have seen tests in external projects that suggest that 0.9.X was automatically reconnecting.

I am asking this question for these reasons:

My test logic/timing may be flawed, if a contributor here told me that it should be reconnecting and how, it should give everyone a better understanding of an undocumented behavior.
There may be a work around that the implementors intended when switching pooling (resulting in losing the previous work arounds)
If there are absolutely no solutions, then yes, we can wonder about why reconnection does not exist and then find a good solution for its existence.

catamphetamine · 2016-12-19T17:40:11Z

@ty10r Well, given there's absolutely zero results when searching for reconnect in this repo, I can state that the absence of a built-in reconnection in this library is a proven fact.
https://github.com/tgriesser/knex/search?utf8=%E2%9C%93&q=reconnect

I wonder how is it even possible to not build-in (and then document) such a crucial piece of functionality into a library which is being advertised as being made for production use.

elhigu · 2017-03-27T14:31:12Z

Lets try to keep future discussion of knex reconnecting to database here. Now that we are using generic-pool it is fair to look again how this works. Would be great if knex would be able to connect again to DB if server was closed and restarted (or network was broken for a while or anything like that).

Would be awesome if we would be able to add some automatic testing for this. Any ideas how to do them would be valuable (or do we just have to stop / restart db servers in CI? for that separate test script that is ran after all others could be viable solution).

smulesoft · 2017-03-27T14:48:32Z

Hi @elhigu, I leave you some comments:

Reconnection

According to my tests, reconnection is almost working. I've only found one weird scenario (see this issue).
The problem is this one: Suppose you restart your database service, and after the restart there is still 1 connection to the old database in the pool. Two things could happen:

The idle connection is purged because it has stayed idle for idleTimeoutMillis (it is evicted). Then, a new connection is created (to the 'new' database), and it can perform queries correclty.
A client performs a query so the idle connection is used (instead of being evicted). If the testOnBorrow flag is set, a call to either config.poool.validate or config.pool.validateAsync is made. If the function returns true, the idle connection is accepted so it will be used. And this is the weird scenario. An invalid connection is being used (it is incorrect because it was created for the old database). In my test, when this happens, the query will just hang indefinitely.

If what I say is correct, maybe you can save this case by correcting the implementation of config.pool.validate. Actually, I would implement validateAsync instead of validate in client.js. And there I would simply do a query like SELECT 1 as one (in addition to the check on connection.__knex__disposed).

Testing

In my tests I am using dockerode to start and stop postgres containers directly from node.js tests, and I am able to reproduce this bug.

Let me know what you think and maybe I'll make myself some time to help you with a PR.

elhigu · 2017-03-30T07:56:16Z

@smulesoft if you are able to add PR which actually can test this somehow on our CI it would be very valuable.

Problem is that if there is no tests it will break again in the future. I don't have currently resources to write PRs for this stuff by myself.

damirm · 2017-04-13T16:51:21Z

Have the same issue. validate function still returns true even If connection with database is closed.
I found fix: set __knex__disposed on connection close event. (https://github.com/tgriesser/knex/blob/master/src/dialects/postgres/index.js#L111)
Like this:

connection.on('error', (err) => {
  connection.__knex__disposed = err
})
connection.on('end', (err) => {
  connection.__knex__disposed = err
})

Tested in postgres dialect.
What cons of that solution?

damirm · 2017-04-14T10:35:55Z

Another solution is check for connection.connection.stream.readable in validate function. This will be protect from using closed stream. I think that is better than my previous comment.
In this case validate function looks like this:

  validate: (connection) => {
    if (connection.__knex__disposed) {
      helpers.warn(`Connection Error: ${connection.__knex__disposed}`)
      return false
    }
    if (!connection.connection.stream.readable) {
      return false;
    }
    return this.validateConnection(connection)
  }

smulesoft · 2017-04-18T13:21:37Z

Alright, I have a fix inspired on @damirm's first comment -including a reproducing test case which depends on having docker insatlled in the machine running tests.
I created this PR #2017, I invite you to follow the discussion over there.

@elhigu

* fixes #1833 on postgres * adds @elhigu requested changes * adds test when database is down * adds script for docker support * removes old comment * fixes syntax * fixes more syntax errors * fixes final syntax errors * Docker testing enabled only on linux for now

leebenson · 2017-11-12T18:53:01Z

As someone that is seeing this thread for the first time after running into rejected connection promises, could someone please do me a quick favour and summarise the status of the fix?

I'm seeing this in my error logs:

js/release/async.js:17:14) fatal: true }
ERRORS -> { Error: Can't add new command when connection is in closed state
at Connection._addCommandClosedState (/app/node_modules/mysql2/lib/connection.js:158:13)
at Connection.query (/app/node_modules/mysql2/lib/connection.js:621:15)
at /app/node_modules/knex/lib/dialects/mysql/index.js:156:18
at Promise._execute (/app/node_modules/bluebird/js/release/debuggability.js:303:9)
at Promise._resolveFromExecutor (/app/node_modules/bluebird/js/release/promise.js:483:18)
at new Promise (/app/node_modules/bluebird/js/release/promise.js:79:10)
at Client_MySQL2._query (/app/node_modules/knex/lib/dialects/mysql/index.js:150:12)
at Client_MySQL2.query (/app/node_modules/knex/lib/client.js:211:17)
at Runner. (/app/node_modules/knex/lib/runner.js:149:36)
at Runner.tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)
at Runner.query (/app/node_modules/bluebird/js/release/method.js:15:34)
at /app/node_modules/knex/lib/runner.js:61:21
at tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)
at /app/node_modules/bluebird/js/release/using.js:185:26
at tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)
at Promise._settlePromiseFromHandler (/app/node_modules/bluebird/js/release/promise.js:512:31)
at Promise._settlePromise (/app/node_modules/bluebird/js/release/promise.js:569:18)
at Promise._settlePromise0 (/app/node_modules/bluebird/js/release/promise.js:614:10)
at Promise._settlePromises (/app/node_modules/bluebird/js/release/promise.js:693:18)
at Promise._fulfill (/app/node_modules/bluebird/js/release/promise.js:638:18)
at PromiseArray._resolve (/app/node_modules/bluebird/js/release/promise_array.js:126:19)
at PromiseArray._promiseFulfilled (/app/node_modules/bluebird/js/release/promise_array.js:144:14)
at Promise._settlePromise (/app/node_modules/bluebird/js/release/promise.js:574:26)
at Promise._settlePromise0 (/app/node_modules/bluebird/js/release/promise.js:614:10)
at Promise._settlePromises (/app/node_modules/bluebird/js/release/promise.js:693:18)
at Async._drainQueue (/app/node_modules/bluebird/js/release/async.js:133:16)
at Async._drainQueues (/app/node_modules/bluebird/js/release/async.js:143:10)

I assumed that specifying a pool option would add automatic handling of reconnections, but so far the server doesn't seem to have recovered.

Is there some extra configuration that is needed, or something else I'm missing?

Thanks in advance.

elhigu · 2017-11-13T09:55:12Z

@leebenson I believe there is new bug in 0.14 which is causing that error. Maybe related to this #2320 and here is other issue about your problem #2321

vjpr · 2017-11-26T20:56:11Z

I saw infinite loop of trying to reconnect in 0.14 when db didn't exist, but it is now fixed in 0.14.2 just as an FYI.

elhigu · 2017-12-04T20:45:01Z

Removed unrelated question, which was added by accident to this issue.

analytik · 2017-12-12T12:29:11Z

This isn't resolved for me in 0.14.2. When I kill a database in between making knex queries, everything works. When I make queries, then scale the database to zero replicas, attempt to make more queries, I get the following error - and I get the same error even when I scale the database back up and it's accepting connections and everything.

Unhandled rejection TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
    at /project/node_modules/knex/lib/client.js:351:13
    at runCallback (timers.js:789:20)
    at tryOnImmediate (timers.js:751:5)
    at processImmediate [as _immediateCallback] (timers.js:722:5)
From previous event:
    at Client_PG.acquireConnection (/project/node_modules/knex/lib/client.js:350:13)
    at Runner.ensureConnection (/project/node_modules/knex/lib/runner.js:215:24)
    at Runner.run (/project/node_modules/knex/lib/runner.js:47:42)
    at Raw.Target.then (/project/node_modules/knex/lib/interface.js:35:43)
    at serveSomeDataFromDatabase (/project/controllers/api.js:12:34)
<Express stuff>
    at /project/app.js:40:17
<more Express stuff>

So, steps to reproduce:

make an knex query and succeed
bring down database
make more queries while database is down, receive the above-mentioned error
bring database up
never recover and when running knex queries, receive the same error ~~always, forever~~ EDIT 1: for a long time. Explanation below.

My connection settings are { dialect: 'pg', debug: true, acquireConnectionTimeout: 4000, ... } and I got these ONLY the first time the error appeared, just before the above error/stack trace was thrown:

Knex:warning - Connection Error: Connection ended unexpectedly
Knex:warning - Connection Error: Connection ended unexpectedly

EDIT 1: I've tried setting pool.idleTimeoutMillis to 25s, pool.acquireTimeoutMillis to 4s, evictionRunIntervalMillis to 20s, and it still took around 90 seconds until the queries started succeeding. I will test this more.

EDIT 2: with timeouts set to 5 seconds and 10 connections in the pool, it took 40 seconds from the time when database could be queried until first query succeeded. With 2 connections in pool, it still took 40-60 second until the first query succeeded, from the time when the database could be queried from another tool. Just to be clear, that's with these settings:

    return knex({
        dialect: 'pg',
        debug: true,
        acquireConnectionTimeout: 4 * 1000,
        pool: {
            min: 2,
            max: 2,
            idleTimeoutMillis: 5 * 1000,
            acquireTimeoutMillis: 3 * 1000,
            evictionRunIntervalMillis: 5 * 1000
        },
        connection: fooBarConnstr,
    });

kibertoad · 2018-10-24T11:40:37Z

Default is 1000 though, which should be fine,

jazoom · 2018-10-24T22:27:09Z

@kibertoad Maybe, but since the Knex docs didn't mention anything about tarn.js I didn't look at tarn.js. I looked at generic-pool, since that's what the Knex docs mention.

Yeah, I doubt it's the cause of this error.

smulesoft · 2018-10-25T11:30:34Z

@jazoom You should check tarn if you're using knex above version 0.14.0. You can see here that #2450 was merged in that version.

jazoom · 2018-10-25T11:39:14Z

Yes I am very well aware now that Knex uses tarn. I knew as soon as @ChieveiT pointed it out.

I'm not convinced it's related to this issue, but others did have theirs fixed by downgrading, so that's interesting. I still haven't seen the issue again since I last reported it happened. I'm not even 100% convinced it's Knex's fault.

smulesoft · 2018-10-25T11:52:39Z

When I was testing this because we had the same problem in prod. In my case, I came up with an integration test that could reproduce the problem:

1. Do a batch of queries, they all work fine.
2. Shut down the database programatically.
3. Start the database again.
4. Do a batch of queries, see what happens.

I did steps 2 and 3 using dockerode to create/stop a postgres-docker-container in #2017.
It would be awesome if you do the same because you'll find a way to reproduce the issue, and to add it to knex test-suite to make sure it doesn't happen again in the future.

kibertoad · 2018-10-25T11:53:54Z

@smulesoft Would you consider contributing a test script that would be doing that?

smulesoft · 2018-10-25T11:58:11Z

Yeah, the PR I mentioned, #2017.

elhigu · 2018-10-29T13:41:07Z

Yeah, AFAIK that shouldn't be the problem anymore.

kibertoad · 2018-10-29T13:46:47Z

@elhigu Did #2017 land in one of 0.15.x releases or is it only in NEXT?

elhigu · 2018-10-29T14:38:44Z

@kibertoad I think it has been in already in 0.14 or something like that. Long enough that it was already broken at some point... I wouldn't have it implemented the way it was anymore and I haven't tried that recently.

smulesoft · 2018-10-29T14:52:59Z

@kibertoad ~~Adding this callback was the fix to my problem.~~
Actually, the fix was defaulting to a non-falsey value. You can track it down to know in which release it was merged, but it was in 0.14 for sure.
Note that it was a fix for PSQL. I don't know who's having the problem now, but @jazoom mentioned CockroachDB, which may need to handle the problem in its own driver.

kibertoad · 2018-10-29T15:03:37Z

@smulesoft CockroachDB is actually using pg driver.

smulesoft · 2018-10-29T16:48:03Z

Thanks for the info. Then we should wait until we have a way to reproduce this consistently and then debug the problem...

elhigu · 2018-11-06T08:42:11Z

I think this issue has too much clutter already from ancient history and different issues, that this should be closed and add new issues when something reproducible happens. Closing.

jazoom · 2018-11-06T08:54:54Z

As a bit of closure, yesterday I finally worked out what the problem was in my case. It was actually a problem with the database, which has now been resolved with the latest version of CockroachDB that was just released.

kibertoad · 2018-11-06T09:20:49Z

@jazoom Glad to hear that! Would you consider providing a bit more details (e. g. link to resolved issue in CockroachDB tracker) for people with same issue who stumble upon this ticket?

jazoom · 2018-11-06T10:15:46Z

I actually don't know which issue it was fixed in, or even if it had an issue. I noticed sometimes the metrics would catch a spike in RAM and CPU usage and the CockroachDB logs said something about a heartbeat timeout.

I updated to version 2.1.0 (a major update with a great many fixes and changes) and the problem hasn't happened since. But, it's only been 1 day, so I can't be too sure.

kibertoad · 2018-11-06T10:43:10Z

@jazoom I would appreciate it if you reported back after having it run for a while, whether or not problem reappeared again :)

jazoom · 2018-11-06T10:46:11Z

No probs. Will do.

jazoom · 2018-11-07T23:14:45Z

It's been 3 days now and I still haven't seen the problem. Hopefully that means it has been resolved, at least in my case.

kibertoad · 2018-11-10T15:46:26Z

@jazoom Given that it was a week by now, is it still going strong?

jazoom · 2018-11-10T20:38:34Z

Actually, it happened once, but I'm pretty sure it was CockroachDB's fault, or perhaps a brief network outage. I don't think it had anything to do with Knex.

necevil · 2018-11-29T19:52:23Z

Hey guys,
Not sure if this one has been officially solved in the most recent versions or not — but on Azure AKS — when you are running Ghost with a remote database outside the cluster it was happening due to the Azure enforced 4 minute connection kill when connected to an external IP / port.

You can see some of my trouble shooting process over here:
https://stackoverflow.com/questions/50706483/what-azure-kubernetes-aks-time-out-happens-to-disconnect-connections-in-out

Generally I was able to solve this previously (in the last iteration of my modified Ghost docker image) by rebuilding / rolling back to Knex 0.12.6. Since Ghost uses MySql (by default) but some of my services on the cluster also need to connect to Postgres (Rails in this case) — as an added solution / redundancy I added a PGbouncer instance in order to separate any services connecting to external DBs from the connection pooling issue by enforcing a DB connection drop on the PGbouncer instance (prior to Azure's 4 minute Load Balancer automatic connection severing).

If you are using Postgres and not MySQL you can solve the problem by adding PGBouncer in the middle and telling it to cycle connections frequently.

I don't know if the issue is still occurring but I am open to testing it with the latest version used by the official Ghost repo (Knex 0.14.6) if you guys have a test procedure!

brandonros · 2018-12-19T23:29:21Z

I have Connection terminated unexpectedly errors using Postgres.

My pool options are:

{
    min: 2,
    max: 10,
    propagateCreateError: false
}

It seems if the connection in the pool constantly gets kept alive through querying activity, it is fine.

However, if the query sits idle... there is no kind of keep alive?

Any advice on how to solve this issue? A keep alive setting? Retry on Connection terminated unexpectedly?

elhigu · 2018-12-20T10:33:52Z

@brandonros if it is idle, then connections may be closed by the server or by pool and when new query is done, new connection is fetched to the pool if necessary... So no need to keepalive... you should never get Connection terminated unexpectedly error when you are executing query, unless connection is terminated, after connection has been fetched out from pool (app has made query and connection terminated during it).

jamesdixon · 2020-01-23T01:09:19Z

@brandonros did you ever solve the issue?

mremick · 2021-08-03T23:03:44Z

@brandonros what was your solution?

danilodeveloper mentioned this issue Jan 24, 2017

knex connection recreated problem #1877

Closed

elhigu changed the title ~~Reconnecting after postgres disconnect/reconnect~~ Reconnecting after database server disconnect/reconnect Mar 27, 2017

smulesoft added a commit to smulesoft/knex that referenced this issue Apr 18, 2017

fixes knex#1833 on postgres

01acff2

smulesoft mentioned this issue Apr 18, 2017

fixes #1833 on postgres #2017

Merged

smulesoft added a commit to smulesoft/knex that referenced this issue May 15, 2017

fixes knex#1833 on postgres

38733d1

elhigu pushed a commit to smulesoft/knex that referenced this issue Jun 1, 2017

fixes knex#1833 on postgres

5a81140

elhigu closed this as completed in #2017 Jun 1, 2017

leebenson mentioned this issue Nov 13, 2017

Can't add new command when connection is in closed state #2321

Closed

elhigu mentioned this issue Nov 18, 2017

Lets use Tarn.js as a pool instead of generic-pool #2339

Closed

knex deleted a comment from shivamtandon Dec 5, 2017

elhigu closed this as completed Nov 6, 2018

This comment has been minimized.

Sign in to view

ajesse11x mentioned this issue Jun 5, 2023

[Snyk] Upgrade commander from 2.20.3 to 10.0.1 ajesse11x/knex#9

Open

Reconnecting after database server disconnect/reconnect #1833

Reconnecting after database server disconnect/reconnect #1833

Comments

ty10r commented Dec 15, 2016

catamphetamine commented Dec 17, 2016

ty10r commented Dec 19, 2016 • edited Loading

catamphetamine commented Dec 19, 2016 • edited Loading

elhigu commented Mar 27, 2017

smulesoft commented Mar 27, 2017

Reconnection

Testing

elhigu commented Mar 30, 2017

damirm commented Apr 13, 2017

damirm commented Apr 14, 2017

smulesoft commented Apr 18, 2017 • edited Loading

leebenson commented Nov 12, 2017

elhigu commented Nov 13, 2017

vjpr commented Nov 26, 2017 • edited Loading

elhigu commented Dec 4, 2017 • edited Loading

analytik commented Dec 12, 2017 • edited Loading

kibertoad commented Oct 24, 2018

jazoom commented Oct 24, 2018 • edited Loading

smulesoft commented Oct 25, 2018

jazoom commented Oct 25, 2018

smulesoft commented Oct 25, 2018

kibertoad commented Oct 25, 2018

smulesoft commented Oct 25, 2018

elhigu commented Oct 29, 2018

kibertoad commented Oct 29, 2018

elhigu commented Oct 29, 2018

smulesoft commented Oct 29, 2018 • edited Loading

kibertoad commented Oct 29, 2018

smulesoft commented Oct 29, 2018

elhigu commented Nov 6, 2018

jazoom commented Nov 6, 2018

kibertoad commented Nov 6, 2018

jazoom commented Nov 6, 2018

kibertoad commented Nov 6, 2018

jazoom commented Nov 6, 2018

jazoom commented Nov 7, 2018

kibertoad commented Nov 10, 2018

jazoom commented Nov 10, 2018

necevil commented Nov 29, 2018

brandonros commented Dec 19, 2018

elhigu commented Dec 20, 2018

jamesdixon commented Jan 23, 2020

This comment has been minimized.

mremick commented Aug 3, 2021

ty10r commented Dec 19, 2016 •

edited

Loading

catamphetamine commented Dec 19, 2016 •

edited

Loading

smulesoft commented Apr 18, 2017 •

edited

Loading

vjpr commented Nov 26, 2017 •

edited

Loading

elhigu commented Dec 4, 2017 •

edited

Loading

analytik commented Dec 12, 2017 •

edited

Loading

jazoom commented Oct 24, 2018 •

edited

Loading

smulesoft commented Oct 29, 2018 •

edited

Loading