Investigate/fix DrCI flaky build classification #5063

malfet · 2024-04-05T16:42:44Z

Starting from pytorch/pytorch#122350 where two build failures clearly caused by PR were marked as flaky

atalman · 2024-04-09T18:51:56Z

huydhn · 2024-04-09T19:35:32Z

To document what I have found. This is a clear example of the relationship between the accuracy of Dr.CI classification and the log classifier.

The build failure as captured by the log classifier was a generic error:

{
  workflowId: 8601773675,
  workflowUniqueId: 16535519,
  id: 23578796771,
  runnerName: 'i-058fb9e0227cfbbbc',
  authorEmail: 'jcjessecai@gmail.com',
  name: 'trunk / linux-focal-rocm6.0-py3.8 / build',
  jobName: 'linux-focal-rocm6.0-py3.8 / build',
  conclusion: 'failure',
  completed_at: '2024-04-08T19:09:52Z',
  html_url: 'https://github.com/pytorch/pytorch/actions/runs/8601773675/job/23578796771',
  head_branch: 'ciflow/trunk/122350',
  pr_number: 122350,
  head_sha: '247c64f9c271d4abd5f1d7e30f5deb59cc0ea979',
  failure_captures: [ 'Process completed with exit code 1.' ],
  failure_lines: [ '##[error]Process completed with exit code 1.' ],
  failure_context: [
    '+ echo ::endgroup::',
    '+ sccache --stop-server',
    '+ sccache --show-stats',
    "+ echo '::group::Sccache Compilation Log'",
    '+ sccache_epilogue',
    '+ python setup.py bdist_wheel',
    '+ [[ linux-focal-rocm6.0-py3.8 != *rocm* ]]',
    '+ [[ linux-focal-rocm6.0-py3.8 != *libtorch* ]]',
    '+ return 1',
    '+ set -e',
    '+ retcode=1',
    '+ python setup.py clean bad_argument'
  ],
  time: '2024-04-08T19:09:57.620867Z'
}

Given this information, this was exactly the same as an actual flaky build failure from https://github.com/pytorch/pytorch/actions/runs/8473868798/job/23219108783, which the bot retried successfully. The flaky failure captured by the log classifier was:

{
  workflowId: 8473868798,
  id: 23219108783,
  jobName: 'linux-focal-rocm6.0-py3.8 / build',
  name: 'trunk / linux-focal-rocm6.0-py3.8 / build',
  conclusion: 'failure',
  completed_at: '2024-03-28T21:38:47Z',
  html_url: 'https://github.com/pytorch/pytorch/actions/runs/8473868798/job/23219108783',
  head_sha: '53c6a0301c4d68d5afcdccd746ce4d1f667f71e9',
  head_branch: 'ciflow/trunk/122331',
  failure_captures: [ 'Process completed with exit code 1.' ],
  failure_lines: [ '##[error]Process completed with exit code 1.' ],
  failure_context: [
    '+ echo ::endgroup::',
    '+ sccache --stop-server',
    '+ sccache --show-stats',
    "+ echo '::group::Sccache Compilation Log'",
    '+ sccache_epilogue',
    '+ python setup.py bdist_wheel',
    '+ [[ linux-focal-rocm6.0-py3.8 != *rocm* ]]',
    '+ [[ linux-focal-rocm6.0-py3.8 != *libtorch* ]]',
    '+ return 1',
    '+ set -e',
    '+ retcode=1',
    '+ python setup.py clean bad_argument'
  ],
  authorEmail: ''
}

The two looks exactly the same, including the failure_context guardrail. The merge base of pytorch/pytorch#122350 was also 11 day old, which increase the chance getting false positives (the older the base commit, the higher the chance).

Some thoughts:

As an mitigation, I could add a check to skip/tighten flaky classification when the base commit is older than a certain threshold of maybe 3 days.
In the long run, we plan to improve the log classifier accuracy. These 2 failures were completely different, but the log classifier treated them the same.

This is to address a common source of wrong classifications as shown in #5063 where the merge base commit is too old. This increases the chance of marking actual failures as flaky because the search window could be big. We can relax this once we achieve a higher accuracy with the log classifier as explained in #5063 (comment) ### Testing pytorch/pytorch#123482 has flaky failure and it merge base is a month old. After this change, the flaky failure won't be marked as flaky anymore:  ## 🔗 Helpful Links ### 🧪 See artifacts and rendered test results at [hud.pytorch.org/pr/123482](https://hud.pytorch.org/pr/123482) * 📄 Preview [Python docs built from this PR](https://docs-preview.pytorch.org/pytorch/pytorch/123482/index.html) * 📄 Preview [C++ docs built from this PR](https://docs-preview.pytorch.org/pytorch/pytorch/123482/cppdocs/index.html) * ❓ Need help or want to give feedback on the CI? Visit the [bot commands wiki](https://github.com/pytorch/pytorch/wiki/Bot-commands) or our [office hours](https://github.com/pytorch/pytorch/wiki/Dev-Infra-Office-Hours) Note: Links to docs will display an error until the docs builds have been completed. ## ❌ 2 New Failures As of commit 1ab823a683847d8c01b35bda74143876f922586b with merge base 86a2d67bb9db7dae8ff4589930dd505a6c5b4ec6 (<img alt="image" width=70 src="https://img.shields.io/date/1710217340?label=&color=FFFFFF&style=flat-square">): <details open><summary>NEW FAILURES - The following jobs have failed:</summary> * [pull / linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge)](https://hud.pytorch.org/pr/pytorch/pytorch/123482#23507840916) ([gh](https://github.com/pytorch/pytorch/actions/runs/8576493039/job/23507840916)) `Process completed with exit code 1.` * [pull / linux-jammy-py3.10-clang15-asan / test (default, 6, 6, linux.4xlarge)](https://hud.pytorch.org/pr/pytorch/pytorch/123482#23507963375) ([gh](https://github.com/pytorch/pytorch/actions/runs/8576493039/job/23507963375)) `Process completed with exit code 1.` </details> This comment was automatically generated by Dr. CI and updates every 15 minutes.

clee2000 · 2024-04-16T21:48:02Z

@malfet: we should never treat build failures as flaky/flaky build failures should never be allowed to merge without -f. Ex infra failure -> do not merge, regardless of whether the failure is due to the PR or not

AI: Add builds to list of never marked as flaky jobs
AI: Do not mark flaky when log classifier result is really generic
AI: Category for "not your fault but you can't ignore this"

huydhn · 2024-04-17T04:39:36Z

The first and second AI make sense to me, and could be implemented easily. The last one, however, needs more clarification I think. Devs would likely force merge the PR when a failure is "not your fault but you can't ignore this". We have some mechanisms to help with this AIACT:

To retry infra failures on PR
To limit the number of failures devs can bypass on their PR to the current value of 10 (configurable)

clee2000 · 2024-04-17T15:38:11Z

The intention of the third point is that devs should explicitly force merge the PR if they don't want to figure out some way to rerun the job

ZainRizvi · 2024-04-17T21:27:19Z

Do legitimate flaky failures tend to have generic error captures like ##[error]Process completed with exit code 1.?

Using a generic output like that to determine flakiness seems really risky, even if we implement the other mitigations mentioned here.

huydhn · 2024-04-17T21:44:15Z

Do legitimate flaky failures tend to have generic error captures like ##[error]Process completed with exit code 1.?

Yup, that one and Command docker exec -t ... from Nova, i.e. https://hud.pytorch.org/pytorch/executorch/commit/b3ac5332e72a909ca900d6dc6feb21bedce753e4

Atm, there are 2 heuristic in place to reduce the chance of having FP.

The error is not compared alone, but is used together with the error context of N shell commands before https://github.com/pytorch/test-infra/blob/main/torchci/lib/jobUtils.ts#L291. So an exit code 1 due to build failure and an exit code 1 due to GH infra flaky looks different.
And the fix mentioned in the issue Add a max search window guardrail when querying similar failures #5073 narrow down the chance of having FP further.

We could also mitigate this by making it easier to add a new heuristic rule for the log classifier. The PR route is a bit unyielding IMO, so we don't create new rule often while I think we should.

albanD · 2024-04-19T18:14:57Z

+1 same thing for backward compat job!

huydhn · 2024-04-19T20:06:36Z

+1 same thing for backward compat job!

I have a fix ready here #5106, although it's weird that I don't see @ZainRizvi comment about lint on the issue, although I think it's there.

This is to address some new comments on #5063. 1. Exclude `backwards_compat` from Dr.CI flaky check, as this job is deem stable enough most of the time. 2. Lint jobs have already been excluded, but there is a bug when it would still be treated as flaky when there is no associated log on S3. This case surfaces in pytorch/pytorch#124321 where there is a mix of signals from `pytorch` and` pytorch-canary` for the same commit. For example, there is no https://ossci-raw-job-status.s3.amazonaws.com/log/pytorch/pytorch-canary/24002161738 for the lint job there ``` { workflowId: 8745622961, workflowUniqueId: 13175283, id: 24002161738, runnerName: 'i-0cdd9180550988c58', authorEmail: 'ZainR@meta.com', name: 'Lint / workflow-checks / linux-job', jobName: 'workflow-checks / linux-job', conclusion: 'failure', completed_at: '2024-04-18T23:42:46Z', html_url: 'https://github.com/pytorch/pytorch-canary/actions/runs/8745622961/job/24002161738', head_branch: 'zainr/arn-fix', pr_number: 124321, head_sha: 'fa03725fa8512af211691a5164073ab7e2c4ee10', failure_captures: null, failure_lines: null, failure_context: null, time: '2024-04-18T23:42:51.131823Z' } ``` The latter is probably a canary testing one-off thing, so I didn't follow up further. But it's good to exclude lint properly anyway.

malfet mentioned this issue Apr 5, 2024

[sparse] Add fast semi-structured spasification kernels pytorch/pytorch#122350

Closed

huydhn self-assigned this Apr 9, 2024

huydhn mentioned this issue Apr 9, 2024

Add a max search window guardrail when querying similar failures #5073

Merged

huydhn mentioned this issue Apr 19, 2024

Exclude lint and backwards_compat from Dr.CI flaky check #5106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate/fix DrCI flaky build classification #5063

Investigate/fix DrCI flaky build classification #5063

malfet commented Apr 5, 2024

atalman commented Apr 9, 2024

huydhn commented Apr 9, 2024 •

edited

clee2000 commented Apr 16, 2024

huydhn commented Apr 17, 2024

clee2000 commented Apr 17, 2024

ZainRizvi commented Apr 17, 2024

huydhn commented Apr 17, 2024

albanD commented Apr 19, 2024 •

edited

huydhn commented Apr 19, 2024

Investigate/fix DrCI flaky build classification #5063

Investigate/fix DrCI flaky build classification #5063

Comments

malfet commented Apr 5, 2024

atalman commented Apr 9, 2024

huydhn commented Apr 9, 2024 • edited

clee2000 commented Apr 16, 2024

huydhn commented Apr 17, 2024

clee2000 commented Apr 17, 2024

ZainRizvi commented Apr 17, 2024

huydhn commented Apr 17, 2024

albanD commented Apr 19, 2024 • edited

huydhn commented Apr 19, 2024

huydhn commented Apr 9, 2024 •

edited

albanD commented Apr 19, 2024 •

edited