-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Closed
Labels
ci: sevcritical failure affecting PyTorch CIcritical failure affecting PyTorch CIci: sev-mitigatedThis label marks a sev as mitigated and suppress "ci: sev"This label marks a sev as mitigated and suppress "ci: sev"
Description
NOTE: Remember to label this issue with "
ci: sev
"
If you want autorevert to be disabled, keep the ci: disable-autorevert label
Current Status
Mitigated, queues are recovering.
AWS experienced a big outage (https://health.aws.amazon.com/health/status) this morning resulting in most of our GHA infra going down with them.
We are still in the process of recovering and will update as soon as our services are able to recover.
Error looks like
Provide some way users can tell that this SEV is causing their issue.
Incident timeline (all times pacific)
Include when the incident began, when it was detected, mitigated, root caused, and finally closed.
User impact
How does this affect users of PyTorch CI?
Root cause
What was the root cause of this issue?
Mitigation
How did we mitigate the issue?
Prevention/followups
How do we prevent issues like this in the future?
Metadata
Metadata
Assignees
Labels
ci: sevcritical failure affecting PyTorch CIcritical failure affecting PyTorch CIci: sev-mitigatedThis label marks a sev as mitigated and suppress "ci: sev"This label marks a sev as mitigated and suppress "ci: sev"
Type
Projects
Status
Done