RetryBot enhancements: Better retry logic & Extend retry mechanism to PRs #1275

ZainRizvi · 2022-12-12T18:11:09Z

This PR introduces two changes:

It pairs with Give linting steps a unique prefix pytorch#90705 to avoid retrying linter steps which do the actual linting or building, limiting retrys to infra level steps which are actually likely to be flaky
Extend retry mechanism to all PRs, not just trunk.

If a retryable step fails before a nonretryable step, the behavior will be determined by the if the conditions defined on that step allow that step to run. By default, steps get skipped if a previous step is skipped (and its conclusion will be skipped, which retry bot treats the same way as successful steps). This is the case for all the nonretryable steps that depend on prev infra steps passing. However, if a step is set to run on failure() or always() then it'll have its status populated appropriately and is taken into consideration by retrybot.

The net effect being, if some infra step fails and a nonretryable step conditioned to always() run also fails, then we will trust that the nonretryable step was a legitimate failure and will not retry the workflow

vercel · 2022-12-12T18:11:13Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
torchci	✅ Ready (Inspect)	Visit Preview	Dec 14, 2022 at 5:35PM (UTC)

huydhn · 2022-12-12T20:56:12Z

torchci/lib/bot/retryBot.ts

        return true;
      }
+
+      // for builds, rerun if it didn't fail on the actual build step


The same logic applies for build, lint, and test, so may be it's easier to put them into some data structure?

Extracted common logic to a helper function. Still wanted to leave step navigation flexible though.

malfet

Do we have a mechanism of measuring potential overhead of such change? Also, it looks like we are pursuing 2 orthogonal strategies: adding retry and then retrying the entire jobs. I wonder why? I.e. perhaps only incomplete jobs (i.e. the ones that lost connection to the server) should be retried?

malfet · 2022-12-12T21:52:29Z

torchci/lib/bot/retryBot.ts

    }
+
+    // @ts-expect-error - we don't have types for these
+    let jobsDoesntFailStepsLike = (job, predicate: (step: any) => boolean) => {


Variable name sounds confusing. It's common practice to start boolean predicate names with is/does (so it looks like a question: isEven, didFail, etc..

Suggested change

let jobsDoesntFailStepsLike = (job, predicate: (step: any) => boolean) => {

let doesLookLikeInfraFailure = (job, predicate: (step: any) => boolean) => {

Ooo, I like this name!

malfet · 2022-12-12T21:54:13Z

torchci/lib/bot/retryBot.ts

+
+      // rerun if the linter didn't fail on the actual linting steps
+      if (workflowName === "lint" &&
+        jobsDoesntFailStepsLike(job, step => step.name.toLowerCase().startsWith("run lint - "))){


Is that why you need: pytorch/pytorch#90705 ? Could it perhaps be done differently somehow? Like querying some extra property?

Yeah, this was the reason for that PR. There isn't any existing property that offers this information, the renaming of certain steps is what adds that data

ZainRizvi · 2022-12-12T22:03:40Z

Do we have a mechanism of measuring potential overhead of such change? Also, it looks like we are pursuing 2 orthogonal strategies: adding retry and then retrying the entire jobs. I wonder why? I.e. perhaps only incomplete jobs (i.e. the ones that lost connection to the server) should be retried?

We've been running this logic against master for a very long time without much overhead detected. The steps that might be retried are only steps that fail due to infra outages and should be retried anyways. Excluding the actual build/lint steps prevent bad PRs from triggering extra load.

We'll also measure hud/metrics to check if we start seeing excessive resource usage after this

malfet · 2022-12-13T17:15:43Z

torchci/lib/bot/retryBot.ts

+      return job.steps?.filter(
+        // @ts-expect-error
+        (step) =>
+          step.conclusion !== null &&


Q: If step before non-retriable one in the job has failed (with infra failure), what would be the conclusion of the next step? If it will be failure as well, then the logic would not work, would it? Can we add a unit test of sorts for this one (using a simulated failure in canary or ones personal repo?)

Edit: It’s status is skipped, which is ignored

Well, it would be good to have an example on PR description, where it works as expected, otherwise it's not really useful (i.e. if later API will change it would be good to have a reference to a runs when it was working as expected)
And in general, testing code before landing is a good idea.

Give a unique prefix to all steps in lint.yml which catch valid linter errors. This will let retrybot identify lint.yml steps which should not be retried. This is a prelude to pytorch/test-infra#1275 which extends the retry-on-failure behavior to all PRs in addition to trunk. This hadn't been an issue previously since we would always only linter failures on `master`, where linter failures were always safe to retry since legitimate linter failures there are virtually non-existent Pull Request resolved: #90705 Approved by: https://github.com/huydhn, https://github.com/malfet

…anism to PRs (#1275)" This reverts commit 10beb05.

…… …anism to PRs (#1275)" (#1296) can result in a weird race where someone pushes a commit -> trigger job 1 -> job 1 fails push another commit -> trigger job 2 job 1 gets rerun -> job 2 gets cancelled seems to be a pr exclusive problem b/c of concurrency group

… to PRs (#1325) 2nd attempt of #1275 This time we limit only cancelled jobs and workflows on the `master` branch. This approach will result in false negatives, where we won't retry some jobs on PRs which should be retried, but it's better than the current situation of never retrying on pulls and we can iterate on it going forward while still catching many kinds of infra flakes

Extend retry mechanism to PRs and only retry infra setup steps on linter

c1437f9

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 12, 2022

ZainRizvi mentioned this pull request Dec 12, 2022

Give linting steps a unique prefix pytorch/pytorch#90705

Closed

ZainRizvi requested review from a team and clee2000 December 12, 2022 18:11

ZainRizvi changed the title ~~RetryBot enhancements: Linter & Extend retry mechanism to PRs~~ RetryBot enhancements: Better retry logic & Extend retry mechanism to PRs Dec 12, 2022

Don't retry the build step

ae08fba

vercel bot deployed to Preview December 12, 2022 20:26 View deployment

huydhn reviewed Dec 12, 2022

View reviewed changes

Refactor

baca878

vercel bot deployed to Preview December 12, 2022 21:37 View deployment

Refactor

b8b1fd6

vercel bot deployed to Preview December 12, 2022 21:41 View deployment

ZainRizvi assigned huydhn Dec 12, 2022

malfet approved these changes Dec 12, 2022

View reviewed changes

Rename method

c88a6dc

vercel bot deployed to Preview December 12, 2022 22:14 View deployment

change lint prefix

b52b79c

vercel bot deployed to Preview December 12, 2022 22:24 View deployment

huydhn approved these changes Dec 12, 2022

View reviewed changes

change lint prefix

526b72f

vercel bot deployed to Preview December 13, 2022 15:35 View deployment

malfet reviewed Dec 13, 2022

View reviewed changes

Add tests

32e8f9d

vercel bot deployed to Preview December 14, 2022 17:35 View deployment

ZainRizvi merged commit 10beb05 into main Dec 14, 2022

clee2000 added a commit that referenced this pull request Dec 14, 2022

Revert "RetryBot enhancements: Better retry logic & Extend retry mech…

0b693ce

…anism to PRs (#1275)" This reverts commit 10beb05.

ZainRizvi mentioned this pull request Dec 21, 2022

RetryBot enhancements v2: Better retry logic & Extend retry mechanism to PRs #1325

Merged

	let jobsDoesntFailStepsLike = (job, predicate: (step: any) => boolean) => {
	let doesLookLikeInfraFailure = (job, predicate: (step: any) => boolean) => {

RetryBot enhancements: Better retry logic & Extend retry mechanism to PRs #1275

RetryBot enhancements: Better retry logic & Extend retry mechanism to PRs #1275

Uh oh!

Conversation

ZainRizvi commented Dec 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Dec 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZainRizvi commented Dec 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZainRizvi Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZainRizvi commented Dec 12, 2022 •

edited

Loading

vercel bot commented Dec 12, 2022 •

edited

Loading

ZainRizvi commented Dec 12, 2022 •

edited

Loading

ZainRizvi Dec 13, 2022 •

edited

Loading