Skip to content

Conversation

@ZainRizvi
Copy link
Contributor

@ZainRizvi ZainRizvi commented Dec 12, 2022

This PR introduces two changes:

  1. It pairs with Give linting steps a unique prefix pytorch#90705 to avoid retrying linter steps which do the actual linting or building, limiting retrys to infra level steps which are actually likely to be flaky
  2. Extend retry mechanism to all PRs, not just trunk.

If a retryable step fails before a nonretryable step, the behavior will be determined by the if the conditions defined on that step allow that step to run. By default, steps get skipped if a previous step is skipped (and its conclusion will be skipped, which retry bot treats the same way as successful steps). This is the case for all the nonretryable steps that depend on prev infra steps passing. However, if a step is set to run on failure() or always() then it'll have its status populated appropriately and is taken into consideration by retrybot.

The net effect being, if some infra step fails and a nonretryable step conditioned to always() run also fails, then we will trust that the nonretryable step was a legitimate failure and will not retry the workflow

@vercel
Copy link

vercel bot commented Dec 12, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
torchci ✅ Ready (Inspect) Visit Preview Dec 14, 2022 at 5:35PM (UTC)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 12, 2022
@ZainRizvi ZainRizvi requested review from a team and clee2000 December 12, 2022 18:11
@ZainRizvi ZainRizvi changed the title RetryBot enhancements: Linter & Extend retry mechanism to PRs RetryBot enhancements: Better retry logic & Extend retry mechanism to PRs Dec 12, 2022
return true;
}

// for builds, rerun if it didn't fail on the actual build step
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same logic applies for build, lint, and test, so may be it's easier to put them into some data structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted common logic to a helper function. Still wanted to leave step navigation flexible though.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a mechanism of measuring potential overhead of such change? Also, it looks like we are pursuing 2 orthogonal strategies: adding retry and then retrying the entire jobs. I wonder why? I.e. perhaps only incomplete jobs (i.e. the ones that lost connection to the server) should be retried?

}

// @ts-expect-error - we don't have types for these
let jobsDoesntFailStepsLike = (job, predicate: (step: any) => boolean) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable name sounds confusing. It's common practice to start boolean predicate names with is/does (so it looks like a question: isEven, didFail, etc..

Suggested change
let jobsDoesntFailStepsLike = (job, predicate: (step: any) => boolean) => {
let doesLookLikeInfraFailure = (job, predicate: (step: any) => boolean) => {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo, I like this name!


// rerun if the linter didn't fail on the actual linting steps
if (workflowName === "lint" &&
jobsDoesntFailStepsLike(job, step => step.name.toLowerCase().startsWith("run lint - "))){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that why you need: pytorch/pytorch#90705 ? Could it perhaps be done differently somehow? Like querying some extra property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was the reason for that PR. There isn't any existing property that offers this information, the renaming of certain steps is what adds that data

@ZainRizvi
Copy link
Contributor Author

ZainRizvi commented Dec 12, 2022

Do we have a mechanism of measuring potential overhead of such change? Also, it looks like we are pursuing 2 orthogonal strategies: adding retry and then retrying the entire jobs. I wonder why? I.e. perhaps only incomplete jobs (i.e. the ones that lost connection to the server) should be retried?

We've been running this logic against master for a very long time without much overhead detected. The steps that might be retried are only steps that fail due to infra outages and should be retried anyways. Excluding the actual build/lint steps prevent bad PRs from triggering extra load.

We'll also measure hud/metrics to check if we start seeing excessive resource usage after this

return job.steps?.filter(
// @ts-expect-error
(step) =>
step.conclusion !== null &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: If step before non-retriable one in the job has failed (with infra failure), what would be the conclusion of the next step? If it will be failure as well, then the logic would not work, would it? Can we add a unit test of sorts for this one (using a simulated failure in canary or ones personal repo?)

Copy link
Contributor Author

@ZainRizvi ZainRizvi Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: It’s status is skipped, which is ignored

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it would be good to have an example on PR description, where it works as expected, otherwise it's not really useful (i.e. if later API will change it would be good to have a reference to a runs when it was working as expected)
And in general, testing code before landing is a good idea.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Dec 14, 2022
Give a unique prefix to all steps in lint.yml which catch valid linter errors. This will let retrybot identify lint.yml steps which should not be retried.

This is a prelude to pytorch/test-infra#1275 which extends the retry-on-failure behavior to all PRs in addition to trunk.

This hadn't been an issue previously since we would always only linter failures on `master`, where linter failures were always safe to retry since legitimate linter failures there are virtually non-existent
Pull Request resolved: #90705
Approved by: https://github.com/huydhn, https://github.com/malfet
@ZainRizvi ZainRizvi merged commit 10beb05 into main Dec 14, 2022
clee2000 added a commit that referenced this pull request Dec 14, 2022
clee2000 added a commit that referenced this pull request Dec 14, 2022
…… …anism to PRs (#1275)" (#1296)

can result in a weird race where someone pushes a commit -> trigger job
1 -> job 1 fails
push another commit -> trigger job 2
job 1 gets rerun -> job 2 gets cancelled

seems to be a pr exclusive problem b/c of concurrency group
ZainRizvi added a commit that referenced this pull request Dec 21, 2022
… to PRs (#1325)

2nd attempt of #1275

This time we limit only cancelled jobs and workflows on the `master`
branch. This approach will result in false negatives, where we won't
retry some jobs on PRs which should be retried, but it's better than the
current situation of never retrying on pulls and we can iterate on it
going forward while still catching many kinds of infra flakes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants