-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Closed
Labels
module: ciRelated to continuous integrationRelated to continuous integrationtrackerA tracking issueA tracking issuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
This will be a meta-issue/tracker for directly actionable issues regarding stability and reliability in PyTorch CI.
Motivation
CI stability is crucial to developer velocity and user experience. Otherwise, developers may waste time debugging their CI failures that are irrelevant to their changes and the on-call who is in charge of keeping trunk green may be tasked with more work to triage issues. When trunk is red, PyTorch users who pull from source could be impacted as well.
Immediately Actionable Issues
- Modify Dr. CI so it could detect runner disconnection failures #66902
- [CI] Flaky initialization of OMP #79055
- android-tests is often flaky #79785
- Bazel build is very flaky #78100
Windows Specific
none so far!
Rolled into bigger projects (please comment relevant suggestions in their respective issues)
- Flaky Test Detection + Retry Automatically Detect and Remove Flaky Tests #71005
- Removing Network Connectivity Risks In CI CI: Removing Network Connectivity Risks #71003
- Retryable Steps GHA: Use retryable steps #71563
zhouzhuojielangong347
Metadata
Metadata
Assignees
Labels
module: ciRelated to continuous integrationRelated to continuous integrationtrackerA tracking issueA tracking issuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
Done