-
-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent error reporting #75
Comments
The error message included along with a checkout timeout error is just a convenience to hopefully give some context if the pool can no longer connect to the database. The only reason that a checkout fails is due to a timeout, and the pool may have attempted to open zero, one, or many connections during that time, but none of those was tied to that specific checkout request. The pool's set of connections is maintained entirely asynchronously to individual checkouts. |
That makes sense, but it has the unfortunate effect that many kinds of issue when using r2d2 are quite hard to debug. Let me explain a problem and maybe you can think of a way to make these kinds of issues more obvious:
In response to this issue, we put in place some measures to prevent it from happening again:
One thing that exacerbated the issue was that when a connection is released to the pool, it seems like it will almost always be picked up by the thread that most recently tried to obtain a connection - I don't know if Condvar is supposed to give any fairness guarantees, but it appears to be very unfair. The result was that threads which failed to obtain a connection on the first attempt could easily be starved (in this case, it is the reason why instances rarely recovered by themselves - the steady influx of new requests caused a self-perpetuating problem, as new requests would get the connections as they were released, so older requests would take the full 30s or so to time out). Anyway, as far as what can be actually done within r2d2, adding logging and also finding a way to more fairly distribute connections are things I think would help. There are certain things about the state of the pool that are very strong indicators that something is wrong (eg. long lease times, high contention for connections, etc.) In our case, the most important thing was when we added logging for the lease time on connections, as we knew our code should only ever lease connections for a few milliseconds. |
I'd be happy to take a PR adding a bunch of debug and trace level logging! We could switch from the stdlib synchronization primitives to parking lot, which does have a fair unlock API: https://docs.rs/lock_api/0.1.5/lock_api/struct.MutexGuard.html#method.unlock_fair. It would also be nice to report various metrics like checkout and lease times, but I'm not sure how exactly we'd want that API to look. Maybe something like this? pub trait ConsumeMetric {
fn checkout_time(&self, time: Duration) {}
fn lease_time(&self, time: Duration) {}
// ... You could then hook that up to whatever metrics library your service is using and feed the times into a histogram or whatever. |
You can also install an error handler which can listen on every error produced from the connection manager, which should be more reliable than whatever happens to end up tacked on to the timeout error: https://docs.rs/r2d2/0.8.3/r2d2/struct.Builder.html#method.error_handler |
Cool, if I get time I'll try to put together a PR adding some logging. That metric API looks reasonable. I did have a vision for some kind of automatic monitor you could (optionally) wrap around a pool - you would configure expected lease / checkout times and it would log errors or warnings any time the behaviour was too unusual, possibly even reset the pool, but maybe that's over-engineering it... The benefit would be that you could worry less about performance impact because the monitor would be opt-in, whilst also providing a very resilient "default" for users of the library. |
This can be used by applications to better monitor their connection pool usage. For example, you could maintain a histogram of pool checkout times and alert if the p95 gets too high, indicating oversaturating. cc #75
Has there been any additional developments with this? I seem to have ran into a similar issue where connections in a pool would die and not be recovered while using |
You could register an event handler to track what's going on in the pool to try to diagnose. |
The way errors are reported appears to be racey/inconsistent when a connection cannot be obtained from the pool. Sometimes you will just get
Error(None)
, whilst other times you will get an actual error message from the connection manager. I assume this is because another thread happens to take the error message before the original thread gets to it.I can see why there might be reasons to not want to synchronously establish the connection from the thread that could not find an available connection in the pool (eg. so that its request could be fullfilled sooner if another connection is returned to the pool more quickly) but I think when a connection cannot be established, it should behave as though it synchronously attempted to establish the new connection (guaranteeing a 1:1 relationship between failed requests and errors).
It would also be useful to distinguish failures due to hitting the "max_size" on the pool, from connection failures - at the moment you also (IIUC) get an
Error(None)
if the pool is maxed out.The text was updated successfully, but these errors were encountered: