Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
postmortem 2018 01 18
We continued to have issues today (and last night) where the connections to the database mysteriously just stopped returning results indefinitely. I caught this quickly this morning, got logs of it, studied them, and constructed and deployed a fix.
Now, if the database does not respond to a query within a certain amount of time, the connection is automatically reset. Also, there's a periodic test that the database is working.
Basically this hardens all database clients, so even if the connection stops working (for any reason at all), then things are quickly fixed. This doesn't get at the root cause of why the much lower-level networking might sometime fail, but makes things robust against that sort of failure causing serious trouble in the future.
I also setup additional monitoring to detect this problem if it were to happen again.