New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate/fix DrCI flaky build classification #5063
Comments
cc @ZainRizvi |
To document what I have found. This is a clear example of the relationship between the accuracy of Dr.CI classification and the log classifier. The build failure as captured by the log classifier was a generic error:
Given this information, this was exactly the same as an actual flaky build failure from https://github.com/pytorch/pytorch/actions/runs/8473868798/job/23219108783, which the bot retried successfully. The flaky failure captured by the log classifier was:
The two looks exactly the same, including the Some thoughts:
|
This is to address a common source of wrong classifications as shown in #5063 where the merge base commit is too old. This increases the chance of marking actual failures as flaky because the search window could be big. We can relax this once we achieve a higher accuracy with the log classifier as explained in #5063 (comment) ### Testing pytorch/pytorch#123482 has flaky failure and it merge base is a month old. After this change, the flaky failure won't be marked as flaky anymore: <!-- drci-comment-start --> ## 🔗 Helpful Links ### 🧪 See artifacts and rendered test results at [hud.pytorch.org/pr/123482](https://hud.pytorch.org/pr/123482) * 📄 Preview [Python docs built from this PR](https://docs-preview.pytorch.org/pytorch/pytorch/123482/index.html) * 📄 Preview [C++ docs built from this PR](https://docs-preview.pytorch.org/pytorch/pytorch/123482/cppdocs/index.html) * ❓ Need help or want to give feedback on the CI? Visit the [bot commands wiki](https://github.com/pytorch/pytorch/wiki/Bot-commands) or our [office hours](https://github.com/pytorch/pytorch/wiki/Dev-Infra-Office-Hours) Note: Links to docs will display an error until the docs builds have been completed. ## ❌ 2 New Failures As of commit 1ab823a683847d8c01b35bda74143876f922586b with merge base 86a2d67bb9db7dae8ff4589930dd505a6c5b4ec6 (<sub><sub><img alt="image" width=70 src="https://img.shields.io/date/1710217340?label=&color=FFFFFF&style=flat-square"></sub></sub>): <details open><summary><b>NEW FAILURES</b> - The following jobs have failed:</summary><p> * [pull / linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge)](https://hud.pytorch.org/pr/pytorch/pytorch/123482#23507840916) ([gh](https://github.com/pytorch/pytorch/actions/runs/8576493039/job/23507840916)) `Process completed with exit code 1.` * [pull / linux-jammy-py3.10-clang15-asan / test (default, 6, 6, linux.4xlarge)](https://hud.pytorch.org/pr/pytorch/pytorch/123482#23507963375) ([gh](https://github.com/pytorch/pytorch/actions/runs/8576493039/job/23507963375)) `Process completed with exit code 1.` </p></details> This comment was automatically generated by Dr. CI and updates every 15 minutes. <!-- drci-comment-end -->
@malfet: we should never treat build failures as flaky/flaky build failures should never be allowed to merge without -f. Ex infra failure -> do not merge, regardless of whether the failure is due to the PR or not AI: Add builds to list of never marked as flaky jobs |
The first and second AI make sense to me, and could be implemented easily. The last one, however, needs more clarification I think. Devs would likely force merge the PR when a failure is "not your fault but you can't ignore this". We have some mechanisms to help with this AIACT:
|
The intention of the third point is that devs should explicitly force merge the PR if they don't want to figure out some way to rerun the job |
Do legitimate flaky failures tend to have generic error captures like Using a generic output like that to determine flakiness seems really risky, even if we implement the other mitigations mentioned here. |
Yup, that one and Atm, there are 2 heuristic in place to reduce the chance of having FP.
We could also mitigate this by making it easier to add a new heuristic rule for the log classifier. The PR route is a bit unyielding IMO, so we don't create new rule often while I think we should. |
+1 same thing for backward compat job! |
I have a fix ready here #5106, although it's weird that I don't see @ZainRizvi comment about lint on the issue, although I think it's there. |
This is to address some new comments on #5063. 1. Exclude `backwards_compat` from Dr.CI flaky check, as this job is deem stable enough most of the time. 2. Lint jobs have already been excluded, but there is a bug when it would still be treated as flaky when there is no associated log on S3. This case surfaces in pytorch/pytorch#124321 where there is a mix of signals from `pytorch` and` pytorch-canary` for the same commit. For example, there is no https://ossci-raw-job-status.s3.amazonaws.com/log/pytorch/pytorch-canary/24002161738 for the lint job there ``` { workflowId: 8745622961, workflowUniqueId: 13175283, id: 24002161738, runnerName: 'i-0cdd9180550988c58', authorEmail: 'ZainR@meta.com', name: 'Lint / workflow-checks / linux-job', jobName: 'workflow-checks / linux-job', conclusion: 'failure', completed_at: '2024-04-18T23:42:46Z', html_url: 'https://github.com/pytorch/pytorch-canary/actions/runs/8745622961/job/24002161738', head_branch: 'zainr/arn-fix', pr_number: 124321, head_sha: 'fa03725fa8512af211691a5164073ab7e2c4ee10', failure_captures: null, failure_lines: null, failure_context: null, time: '2024-04-18T23:42:51.131823Z' } ``` The latter is probably a canary testing one-off thing, so I didn't follow up further. But it's good to exclude lint properly anyway.
Starting from pytorch/pytorch#122350 where two build failures clearly caused by PR were marked as flaky
The text was updated successfully, but these errors were encountered: