Skip to content

AWS was down, GHA infrastructure effected / recovering #165909

@seemethere

Description

@seemethere

NOTE: Remember to label this issue with "ci: sev"
If you want autorevert to be disabled, keep the ci: disable-autorevert label

Current Status

Mitigated, queues are recovering.

AWS experienced a big outage (https://health.aws.amazon.com/health/status) this morning resulting in most of our GHA infra going down with them.

We are still in the process of recovering and will update as soon as our services are able to recover.

Error looks like

Provide some way users can tell that this SEV is causing their issue.

Incident timeline (all times pacific)

Include when the incident began, when it was detected, mitigated, root caused, and finally closed.

User impact

How does this affect users of PyTorch CI?

Root cause

What was the root cause of this issue?

Mitigation

How did we mitigate the issue?

Prevention/followups

How do we prevent issues like this in the future?

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci: sevcritical failure affecting PyTorch CIci: sev-mitigatedThis label marks a sev as mitigated and suppress "ci: sev"

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions