Inconsistent error reporting #75

Diggsey · 2019-03-18T22:23:27Z

The way errors are reported appears to be racey/inconsistent when a connection cannot be obtained from the pool. Sometimes you will just get Error(None), whilst other times you will get an actual error message from the connection manager. I assume this is because another thread happens to take the error message before the original thread gets to it.

I can see why there might be reasons to not want to synchronously establish the connection from the thread that could not find an available connection in the pool (eg. so that its request could be fullfilled sooner if another connection is returned to the pool more quickly) but I think when a connection cannot be established, it should behave as though it synchronously attempted to establish the new connection (guaranteeing a 1:1 relationship between failed requests and errors).

It would also be useful to distinguish failures due to hitting the "max_size" on the pool, from connection failures - at the moment you also (IIUC) get an Error(None) if the pool is maxed out.

The text was updated successfully, but these errors were encountered:

sfackler · 2019-03-20T01:18:47Z

The error message included along with a checkout timeout error is just a convenience to hopefully give some context if the pool can no longer connect to the database. The only reason that a checkout fails is due to a timeout, and the pool may have attempted to open zero, one, or many connections during that time, but none of those was tied to that specific checkout request. The pool's set of connections is maintained entirely asynchronously to individual checkouts.

Diggsey · 2019-03-20T02:10:04Z

That makes sense, but it has the unfortunate effect that many kinds of issue when using r2d2 are quite hard to debug.

Let me explain a problem and maybe you can think of a way to make these kinds of issues more obvious:

Yesterday I had to deal with a production issue where one or more instances of our application would get into a "bad state", and would start failing a certain percentage of requests. Sometimes these instances would recover, whilst other times they would get stuck and have to be manually restarted. The trigger seemed to be when a flurry of requests happened at the same time.

The only error being reported was that r2d2 was failing to obtain connections from the pool. Sometimes these errors would say "could not connect" - most of the time the error would just be empty.

I had a bunch of theories, from connections in the pool becoming "tainted" in some way, to the "database proxy" we use malfunctioning. In the end, it turned out to be a rather subtle deadlock: in a couple of places in the code, the same thread would try to lease a connection from the pool whilst a connection was already active further up the callstack.

When several requests came in at the exact same time, and they all obtained the "first" connection, they would then all deadlock trying to obtain the "second" connection, as it would exceed the pool limit, so no thread could make progress. As soon as one thread timed out, only one connection would be released, and (due to some apparent unfairness in the way threads are woken?) that connection would be taken by the next incoming request, which would continue the cycle indefinitely.

In response to this issue, we put in place some measures to prevent it from happening again:

TLS safety check: ensuring only one connection is obtained from a thread at a time (in our own code)
Adding logging throughout our fork of r2d2, including the amount of time spent in various stages (how long to establish a connection, how long a connection is leased before being returned, how long it took to get a connection from the pool, etc.)

One thing that exacerbated the issue was that when a connection is released to the pool, it seems like it will almost always be picked up by the thread that most recently tried to obtain a connection - I don't know if Condvar is supposed to give any fairness guarantees, but it appears to be very unfair. The result was that threads which failed to obtain a connection on the first attempt could easily be starved (in this case, it is the reason why instances rarely recovered by themselves - the steady influx of new requests caused a self-perpetuating problem, as new requests would get the connections as they were released, so older requests would take the full 30s or so to time out).

Anyway, as far as what can be actually done within r2d2, adding logging and also finding a way to more fairly distribute connections are things I think would help.

There are certain things about the state of the pool that are very strong indicators that something is wrong (eg. long lease times, high contention for connections, etc.) In our case, the most important thing was when we added logging for the lease time on connections, as we knew our code should only ever lease connections for a few milliseconds.

sfackler · 2019-03-20T02:18:40Z

I'd be happy to take a PR adding a bunch of debug and trace level logging!

We could switch from the stdlib synchronization primitives to parking lot, which does have a fair unlock API: https://docs.rs/lock_api/0.1.5/lock_api/struct.MutexGuard.html#method.unlock_fair.

It would also be nice to report various metrics like checkout and lease times, but I'm not sure how exactly we'd want that API to look. Maybe something like this?

pub trait ConsumeMetric {
    fn checkout_time(&self, time: Duration) {}

    fn lease_time(&self, time: Duration) {}

    // ...

You could then hook that up to whatever metrics library your service is using and feed the times into a histogram or whatever.

sfackler · 2019-03-20T02:21:58Z

You can also install an error handler which can listen on every error produced from the connection manager, which should be more reliable than whatever happens to end up tacked on to the timeout error: https://docs.rs/r2d2/0.8.3/r2d2/struct.Builder.html#method.error_handler

Diggsey · 2019-03-21T14:38:03Z

Cool, if I get time I'll try to put together a PR adding some logging.

That metric API looks reasonable. I did have a vision for some kind of automatic monitor you could (optionally) wrap around a pool - you would configure expected lease / checkout times and it would log errors or warnings any time the behaviour was too unusual, possibly even reset the pool, but maybe that's over-engineering it... The benefit would be that you could worry less about performance impact because the monitor would be opt-in, whilst also providing a very resilient "default" for users of the library.

This can be used by applications to better monitor their connection pool usage. For example, you could maintain a histogram of pool checkout times and alert if the p95 gets too high, indicating oversaturating. cc #75

repnop · 2019-05-07T14:49:23Z

Has there been any additional developments with this? I seem to have ran into a similar issue where connections in a pool would die and not be recovered while using diesel's pooled connections. This would happen over night and my service that depends on the connections would stop responding and error out with no discernible error message. First one would be Err("The MySQL server has left") then I'd just get Err(None) when the database is running just fine. In the mean time I have removed all of the connection pooling from the service since its not high traffic anyway, but was quite annoying to find out the issue.

sfackler · 2019-05-07T15:30:03Z

You could register an event handler to track what's going on in the pool to try to diagnose.

sfackler mentioned this issue Mar 24, 2019

Add an event handler #78

Merged

Diggsey closed this as completed Jun 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent error reporting #75

Inconsistent error reporting #75

Diggsey commented Mar 18, 2019

sfackler commented Mar 20, 2019

Diggsey commented Mar 20, 2019 •

edited

Loading

sfackler commented Mar 20, 2019

sfackler commented Mar 20, 2019

Diggsey commented Mar 21, 2019

repnop commented May 7, 2019

sfackler commented May 7, 2019

Inconsistent error reporting #75

Inconsistent error reporting #75

Comments

Diggsey commented Mar 18, 2019

sfackler commented Mar 20, 2019

Diggsey commented Mar 20, 2019 • edited Loading

sfackler commented Mar 20, 2019

sfackler commented Mar 20, 2019

Diggsey commented Mar 21, 2019

repnop commented May 7, 2019

sfackler commented May 7, 2019

Diggsey commented Mar 20, 2019 •

edited

Loading