New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pooled connections not releasing #1382

Closed
devinivy opened this Issue May 4, 2016 · 9 comments

Comments

Projects
None yet
4 participants
@devinivy
Contributor

devinivy commented May 4, 2016

I use knex 0.10.0 in my app with postgresql 9.4, and I've been having some trouble with pooling.

When my app gets busy and requires all pooled connections for an extended period of time, I find that connections are not always released back to the pool. Eventually every connection becomes stuck as being active and the pool's queue grows indefinitely, causing every new query to eventually timeout ("Knex: Timeout acquiring a connection").

Meanwhile, my postgresql database sees those connections as being completely idle (those connections persist, but queries are not being sent on them). Worth adding I'm not explicitly using transaction() or transacting() anywhere, so I can't see how it could be the issue described in #1040 / #1177.

I currently do not have a simple reproduction case– I just know that it can eventually happen when my app stays busy and queries start queueing on the pool.

Would love to know if anyone has any hunches or has experienced anything similar! I'm looking into potential issues with pool2 as well.

@wubzz

This comment has been minimized.

Collaborator

wubzz commented May 4, 2016

@devinivy

Worth adding I'm not explicitly using transaction() or transacting()

If not using any transactions at all, getting this error is very strange since all the connections are technically just "shared" between users.

Are you using a single knex instance within your app, or multiple? The only thing I can think of if not using transactions is continiously reaching postgres's connection limit for the duration of acquireConnectionTimeout setting, but I find it very unlikely.

@devinivy

This comment has been minimized.

Contributor

devinivy commented May 4, 2016

It is a single knex instance. I have verified that the issue is not related to transactions– DEBUG=knex:tx didn't turn-up anything at all while the app ran and the issue occurred. Trying to come-up with a knex-only minimal repro case to share.

@devinivy

This comment has been minimized.

Contributor

devinivy commented May 4, 2016

I think the issue is that acquireConnectionTimeout (default 60000) is less than pool.requestTimeout (default Infinity). Knex's acquireConnection() fails due to a timeout, but pool.acquire() happily carries on and allocates a connection (to which knex is oblivious). This connection is never released.

This was very confusing because acquireConnectionTimeout and pool.acquireTimeout settings are not analogous, nor are pool2.acquire() and pool.acquireTimeout directly related. Turns out pool2.acquire() is primarily ruled by pool.requestTimeout!

@tjwebb you might be interested in this, since waterline-postgresql can fix this with its own defaults.

Here's a repro case. Uncomment the commented lines to see it fixed.
https://gist.github.com/devinivy/7bbbdab92d96d4c4b1bfd22fb7191668

@wubzz

This comment has been minimized.

Collaborator

wubzz commented May 9, 2016

@devinivy acquireConnectionTimeout was added because connections would just hang indefinitely. Poor research on my end regarding requestTimeout, as it seems that's the reason they were hanging since the default value of requestTimeout in pool2 is Infinity (aka no timeout at all).

Your issue seems to be about connections not being released back to the pool. In my opinion changing the requestTimeout / acquireConnectionTimeout is not the correct solution to the issue, but rather what needs to be fixed is figuring out where and why the resource is not being released back into the pool after finishing its task.

This could be either within knex (possibly even dialect-specific as seen in mysql2 recently) or in pool2. I can't say for sure.

@devinivy

This comment has been minimized.

Contributor

devinivy commented May 9, 2016

I understand why this happens– it's not dialect-specific.

  1. Knex asks to acquire a connection from pool2, which is currently saturated.
  2. The app is currently unhealthy or under intense load, so acquireConnectionTimeout time passes.
  3. User receives the knex error, "Knex: Timeout acquiring a connection".
  4. A connection is released back to the pool before requestTimeout.
  5. Pool2 doesn't know about the knex-level error due to acquireConnectionTimeout, so as far as it's concerned the original request for a connection is still valid. Pool2 acquires a connection for knex.
  6. Knex forgot that it had asked for the connection, so that connection is acquired then never released back to the pool.

TL;DR the promise to acquire a connection times-out (as far as knex is concerned) but carries on with its task behind the scenes. By the time the connection is acquired, knex has forgotten it ever asked.

Seems like pool2's requestTimeout should just be used instead of knex's acquireConnectionTimeout.

@wubzz

This comment has been minimized.

Collaborator

wubzz commented May 9, 2016

@devinivy I'm probably being very slow right now but I still don't get how that will solve your issue. If we toss acquireConnectionTimeout aside for a second and pretend knex is using requestTimeout instead with the same default value of 60000ms, your issue should still persist, no? The fact that acquireConnectionTimeout was triggered at all should attest to that.

Anyway a PR is welcome. I'll probably understand better if I see the code. I agree acquireConnectionTimeout can be removed and replaced with pool2's requestTimeout, but the default value should remain or we'll be back to square one with weekly issues of hanging connections.

@devinivy

This comment has been minimized.

Contributor

devinivy commented May 9, 2016

It's cool– it's hard to reason about, and it took me a lot of staring at the same few functions to understand how the issue occurred. If you rely on requestTimeout rather than acquireConnectionTimeout then pool2 will know that the request for a connection failed and deal with it accordingly (by re-allocating the connection 👍 ). In both cases you get an error, but when you rely on acquireConnectionTimeout the connection gets "stuck" as being allocated, since there's nobody around to tell pool2 to release it. Timing-wise, this issue occurs between the acquireConnectionTimeout and the requestTimeout. The reproduction case in the gist posted a few comments above might help illustrate.

I'll put up a non-breaking-change PR soon. For knex 0.11 though, it might make sense to drop acquireConnectionTimeout and set a good default for requestTimeout, like you said.

@pasupulaphani

This comment has been minimized.

pasupulaphani commented May 16, 2016

Hey @devinivy any news on getting this to new release?

@devinivy

This comment has been minimized.

Contributor

devinivy commented May 16, 2016

@pasupulaphani PR #1394 has my proposed fix– I assume it's taking a little while because it's a non-trivial change requiring close review. In the meantime you can fix it in your app with configuration (ensure acquireConnectionTimeout is much larger than pool.requestTimeout).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment