Skip to content

postmortem 2018 01 18

William Stein edited this page Jan 18, 2018 · 1 revision


We continued to have issues today (and last night) where the connections to the database mysteriously just stopped returning results indefinitely. I caught this quickly this morning, got logs of it, studied them, and constructed and deployed a fix.

Now, if the database does not respond to a query within a certain amount of time, the connection is automatically reset. Also, there's a periodic test that the database is working.

Relevant code... and

Basically this hardens all database clients, so even if the connection stops working (for any reason at all), then things are quickly fixed. This doesn't get at the root cause of why the much lower-level networking might sometime fail, but makes things robust against that sort of failure causing serious trouble in the future.

I also setup additional monitoring to detect this problem if it were to happen again.

Clone this wiki locally
You can’t perform that action at this time.