Skip to content

postmortem 2018 01 18

William Stein edited this page Jan 18, 2018 · 1 revision

2018-01-18

We continued to have issues today (and last night) where the connections to the database mysteriously just stopped returning results indefinitely. I caught this quickly this morning, got logs of it, studied them, and constructed and deployed a fix.

Now, if the database does not respond to a query within a certain amount of time, the connection is automatically reset. Also, there's a periodic test that the database is working.

Relevant code... https://github.com/sagemathinc/cocalc/commit/56c5f8ef7f537359996244d426327913643c9c37 and https://github.com/sagemathinc/cocalc/commit/6c4140d4a4f82f8438093879220dfa3759bc83cb

Basically this hardens all database clients, so even if the connection stops working (for any reason at all), then things are quickly fixed. This doesn't get at the root cause of why the much lower-level networking might sometime fail, but makes things robust against that sort of failure causing serious trouble in the future.

I also setup additional monitoring to detect this problem if it were to happen again.

Clone this wiki locally