Skip to content

[Pytorch AutoRevert] - Improves autorevert check heuristics #6853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 30, 2025

Conversation

jeanschmidt
Copy link
Contributor

@jeanschmidt jeanschmidt commented Jun 27, 2025

Do some improvements in the back analisys for the revert logic with the goal of improving precision and recall and validate as a valid strategy.

Checked against the workflows: pull trunk inductor linux-binary-manywheel

Old code:

Timeframe: 720 hours
Commits checked: 6177
Auto revert patterns detected: 188
Actual reverts inside auto revert patterns detected: 24 (12.8%)
Total revert commits in period: 115
Reverts that dont match any auto revert pattern detected: 91

Newer code:

Workflow(s): pull, trunk, inductor, linux-binary-manywheel
Timeframe: 720 hours
Commits checked: 5403
Auto revert patterns detected: 442
Actual reverts inside auto revert patterns detected (precision): 48 (10.9%)
Total revert commits in period: 115
Reverts that dont match any auto revert pattern detected (recall): 67 (58.3%)
Per workflow precision:
  pull: 45 reverts out of 411 patterns (10.9%)
  trunk: 1 reverts out of 8 patterns (12.5%)
  inductor: 2 reverts out of 20 patterns (10.0%)
  linux-binary-manywheel: 0 reverts out of 3 patterns (0.0%)

Critical implemented changes:

  • Look forward and back for the first commit that ran the failed job, instead of trusting on always looking on the one right before or right after.
  • Job names have parts we don't care, like shards indices. As a failure could happen in any shard we want to find any shard with the same failure;

Things I tried and don't lead to great results:

  • ignoring error classification - too low precision, not significant increase in recall
  • not requiring error repetition - too low precision, not significant increase in recall

My take:
With a precision of 10% it justifies the cost of re-running jobs in order to confirm redness status, even if it is not possible to test, I suspect that the fact we force require the same output 2 times for all 3 signals, this should elevate the precision to a very high standard. Unfortunately the only way to test is run this in shadow mode.

With a recall of 55%, it points out to being able to capture most of the introduced trunk redness errors. Lots of reverts might not be caused by ci redness, especially not in the workflows we are analyzing (could be performance degradation, GHF/internal reasons and many others). This number seems comfortable to provide a substantial gain in benefit for CI quality.

@pytorch-bot pytorch-bot bot added the ci-no-td label Jun 27, 2025
Copy link

vercel bot commented Jun 27, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Jun 30, 2025 4:08pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 27, 2025
@jeanschmidt jeanschmidt changed the title Jeanschmidt/autorevert improvements [Pytorch AutoRevert] - Improves autorevert check heuristics Jun 27, 2025
Copy link
Contributor

@clee2000 clee2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments aren't aimed at the content in this PR specifically, but I'm curious about stuff you've tried for autorevert:

  • it seems like you're separating by (, which cuts off shard information, but also the test config (like default, dynamo, etc), if you add that back in, does that improve anything?
  • I only see classification_rule used, have you considered using the line instead?

I'm also curious if you know what the other ~50% not caught are, could it be that most are ghfirst or lint?

@jeanschmidt
Copy link
Contributor Author

@clee2000

it seems like you're separating by (, which cuts off shard information, but also the test config (like default, dynamo, etc), if you add that back in, does that improve anything?

I did not test with mathing test configuration, but this might be something to look. Still, I am more curious into improving recall now than precision. Given the full retry of the same job for the 3 commits should (in my opinion) be a robust signal to clear up any false positive.

I only see classification_rule used, have you considered using the line instead?

Yes, it is not great. Precision don't increase substantially, recall falls to very low numbers.

@jeanschmidt jeanschmidt merged commit 5f86d76 into main Jun 30, 2025
6 checks passed
@jeanschmidt jeanschmidt deleted the jeanschmidt/autorevert_improvements branch June 30, 2025 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants