Skip to content

postmortem 2018 01 12

William Stein edited this page Jan 18, 2018 · 4 revisions

Details about 2018-01-12 outage

  • by William Stein

At about 11am on 2018-01-12, CoCalc mysteriously stopped working, as was down for 30 minutes.

I immediately noticed, and investigated. All backend monitoring systems looked OK, but sign in was not working, and most internal communication had stopped.

Investigating logs showed the queries to the database by backend webservers were not completing. I restarted these servers. This did NOT fix the problem.

The only change we had made recently were fixes for the Meltdown/Spectre bugs, so I first suspected that those might have caused the problem. The details of the rollout are too complicated to explain succinctly, but I switched the webservers so they were running on nodes without the fix. This did NOT fix the problem.

Further testing revealed that the database was working fine, but data from the database back to clients was not being sent over the network. I don't know why, and my only hypothesis is that Google Compute Engine's networking had a "glitch". It seems like most Google incidents are "We are experiencing an issue with packet loss..." Also, we had not changed anything that would cause this, and it's a problem we have never seen before.

So, I destroyed the VM on which the database was running, made a new VM, and started the database there. Then everything immediately started working fine.

The best way to prevent this class of problems is probably to make CoCalc run in multiple Google availability zones. This would cost far more than we can afford with our current customer base, so we can't do it until we have far more paying customers.

Followup

Clone this wiki locally