You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The root cause appears to have been #86. I've isolated an affected node so that we can investigate the behavior later, but in the meantime I rolled back the autoscaling group to the old AMI and reverted the code change in #88.
Things that could have gone better generally:
Mitigation would have been an order of magnitude faster if I had remembered that I could just scp a new supervisor binary to the running instance instead of spinning up a new one (twice!). We may want to prepare a runbook for future availability incidents.
There were a lot of unnecessarily manual operations here and I did screw up some of the configuration for the new node I added to the ASG at first. This screw-up led me to mistakenly think that my manual operations might have been responsible for the outage, rather than the recent AMI change. Add tooling to roll ASG with new AMI #87 would solve this problem.
The text was updated successfully, but these errors were encountered:
See incident history on Statuspage.io:
The root cause appears to have been #86. I've isolated an affected node so that we can investigate the behavior later, but in the meantime I rolled back the autoscaling group to the old AMI and reverted the code change in #88.
Things that could have gone better generally:
The text was updated successfully, but these errors were encountered: