-
Notifications
You must be signed in to change notification settings - Fork 107
RetryBot enhancements: Better retry logic & Extend retry mechanism to PRs #1275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
torchci/lib/bot/retryBot.ts
Outdated
| return true; | ||
| } | ||
|
|
||
| // for builds, rerun if it didn't fail on the actual build step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same logic applies for build, lint, and test, so may be it's easier to put them into some data structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted common logic to a helper function. Still wanted to leave step navigation flexible though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a mechanism of measuring potential overhead of such change? Also, it looks like we are pursuing 2 orthogonal strategies: adding retry and then retrying the entire jobs. I wonder why? I.e. perhaps only incomplete jobs (i.e. the ones that lost connection to the server) should be retried?
torchci/lib/bot/retryBot.ts
Outdated
| } | ||
|
|
||
| // @ts-expect-error - we don't have types for these | ||
| let jobsDoesntFailStepsLike = (job, predicate: (step: any) => boolean) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable name sounds confusing. It's common practice to start boolean predicate names with is/does (so it looks like a question: isEven, didFail, etc..
| let jobsDoesntFailStepsLike = (job, predicate: (step: any) => boolean) => { | |
| let doesLookLikeInfraFailure = (job, predicate: (step: any) => boolean) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo, I like this name!
torchci/lib/bot/retryBot.ts
Outdated
|
|
||
| // rerun if the linter didn't fail on the actual linting steps | ||
| if (workflowName === "lint" && | ||
| jobsDoesntFailStepsLike(job, step => step.name.toLowerCase().startsWith("run lint - "))){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that why you need: pytorch/pytorch#90705 ? Could it perhaps be done differently somehow? Like querying some extra property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was the reason for that PR. There isn't any existing property that offers this information, the renaming of certain steps is what adds that data
We've been running this logic against master for a very long time without much overhead detected. The steps that might be retried are only steps that fail due to infra outages and should be retried anyways. Excluding the actual build/lint steps prevent bad PRs from triggering extra load. We'll also measure hud/metrics to check if we start seeing excessive resource usage after this |
| return job.steps?.filter( | ||
| // @ts-expect-error | ||
| (step) => | ||
| step.conclusion !== null && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: If step before non-retriable one in the job has failed (with infra failure), what would be the conclusion of the next step? If it will be failure as well, then the logic would not work, would it? Can we add a unit test of sorts for this one (using a simulated failure in canary or ones personal repo?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: It’s status is skipped, which is ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it would be good to have an example on PR description, where it works as expected, otherwise it's not really useful (i.e. if later API will change it would be good to have a reference to a runs when it was working as expected)
And in general, testing code before landing is a good idea.
Give a unique prefix to all steps in lint.yml which catch valid linter errors. This will let retrybot identify lint.yml steps which should not be retried. This is a prelude to pytorch/test-infra#1275 which extends the retry-on-failure behavior to all PRs in addition to trunk. This hadn't been an issue previously since we would always only linter failures on `master`, where linter failures were always safe to retry since legitimate linter failures there are virtually non-existent Pull Request resolved: #90705 Approved by: https://github.com/huydhn, https://github.com/malfet
… to PRs (#1325) 2nd attempt of #1275 This time we limit only cancelled jobs and workflows on the `master` branch. This approach will result in false negatives, where we won't retry some jobs on PRs which should be retried, but it's better than the current situation of never retrying on pulls and we can iterate on it going forward while still catching many kinds of infra flakes
This PR introduces two changes:
If a retryable step fails before a nonretryable step, the behavior will be determined by the if the conditions defined on that step allow that step to run. By default, steps get skipped if a previous step is skipped (and its conclusion will be
skipped, which retry bot treats the same way as successful steps). This is the case for all the nonretryable steps that depend on prev infra steps passing. However, if a step is set to run onfailure()oralways()then it'll have its status populated appropriately and is taken into consideration by retrybot.The net effect being, if some infra step fails and a nonretryable step conditioned to
always()run also fails, then we will trust that the nonretryable step was a legitimate failure and will not retry the workflow