Skip to content

Document CI retry rules #103482

@huydhn

Description

@huydhn

With all the recent changes w.r.t retrying to harden PyTorch CI, we need to create a wiki page to document all these mechanisms. The tentative list includes:

  • Individual test case retry (flaky bot)
  • Retry test file
  • Retry on workflow steps (using GHA)
  • Retry the job itself (retry bot)

In addition, we also want to gather data points to answer the following questions

  • How much resource do we spend on retrying these cases?
  • And a rough estimation on how frequently people manually retry stuffs on their PR to get green signals or to debug flaky issue

cc @ZainRizvi @kit1980 @clee2000

Metadata

Metadata

Assignees

Labels

module: devxRelated to PyTorch contribution experience (HUD, pytorchbot)triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

Cold Storage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions