[SEV] 41 minutes of downtime due to bad AMI rollout #89

raxod502 · 2021-07-27T03:24:12Z

See incident history on Statuspage.io:

The root cause appears to have been #86. I've isolated an affected node so that we can investigate the behavior later, but in the meantime I rolled back the autoscaling group to the old AMI and reverted the code change in #88.

Things that could have gone better generally:

Mitigation would have been an order of magnitude faster if I had remembered that I could just scp a new supervisor binary to the running instance instead of spinning up a new one (twice!). We may want to prepare a runbook for future availability incidents.
There were a lot of unnecessarily manual operations here and I did screw up some of the configuration for the new node I added to the ASG at first. This screw-up led me to mistakenly think that my manual operations might have been responsible for the outage, rather than the recent AMI change. Add tooling to roll ASG with new AMI #87 would solve this problem.

raxod502 · 2021-08-17T23:48:38Z

Manual operations are no longer needed to roll the autoscaling group due to my alternate resolution of #78 (place all ASG instances in a single AZ).

raxod502 added the sev label Jul 27, 2021

raxod502 mentioned this issue Jul 27, 2021

Create a (separated?) status page #85

Closed

raxod502 closed this as completed Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEV] 41 minutes of downtime due to bad AMI rollout #89

[SEV] 41 minutes of downtime due to bad AMI rollout #89

raxod502 commented Jul 27, 2021

raxod502 commented Aug 17, 2021

[SEV] 41 minutes of downtime due to bad AMI rollout #89

[SEV] 41 minutes of downtime due to bad AMI rollout #89

Comments

raxod502 commented Jul 27, 2021

raxod502 commented Aug 17, 2021