Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate/fix DrCI flaky build classification #5063

Open
malfet opened this issue Apr 5, 2024 · 9 comments
Open

Investigate/fix DrCI flaky build classification #5063

malfet opened this issue Apr 5, 2024 · 9 comments
Assignees

Comments

@malfet
Copy link
Contributor

malfet commented Apr 5, 2024

Starting from pytorch/pytorch#122350 where two build failures clearly caused by PR were marked as flaky
image

@atalman
Copy link
Contributor

atalman commented Apr 9, 2024

cc @ZainRizvi

@huydhn huydhn self-assigned this Apr 9, 2024
@huydhn
Copy link
Contributor

huydhn commented Apr 9, 2024

To document what I have found. This is a clear example of the relationship between the accuracy of Dr.CI classification and the log classifier.

The build failure as captured by the log classifier was a generic error:

{
  workflowId: 8601773675,
  workflowUniqueId: 16535519,
  id: 23578796771,
  runnerName: 'i-058fb9e0227cfbbbc',
  authorEmail: 'jcjessecai@gmail.com',
  name: 'trunk / linux-focal-rocm6.0-py3.8 / build',
  jobName: 'linux-focal-rocm6.0-py3.8 / build',
  conclusion: 'failure',
  completed_at: '2024-04-08T19:09:52Z',
  html_url: 'https://github.com/pytorch/pytorch/actions/runs/8601773675/job/23578796771',
  head_branch: 'ciflow/trunk/122350',
  pr_number: 122350,
  head_sha: '247c64f9c271d4abd5f1d7e30f5deb59cc0ea979',
  failure_captures: [ 'Process completed with exit code 1.' ],
  failure_lines: [ '##[error]Process completed with exit code 1.' ],
  failure_context: [
    '+ echo ::endgroup::',
    '+ sccache --stop-server',
    '+ sccache --show-stats',
    "+ echo '::group::Sccache Compilation Log'",
    '+ sccache_epilogue',
    '+ python setup.py bdist_wheel',
    '+ [[ linux-focal-rocm6.0-py3.8 != *rocm* ]]',
    '+ [[ linux-focal-rocm6.0-py3.8 != *libtorch* ]]',
    '+ return 1',
    '+ set -e',
    '+ retcode=1',
    '+ python setup.py clean bad_argument'
  ],
  time: '2024-04-08T19:09:57.620867Z'
}

Given this information, this was exactly the same as an actual flaky build failure from https://github.com/pytorch/pytorch/actions/runs/8473868798/job/23219108783, which the bot retried successfully. The flaky failure captured by the log classifier was:

{
  workflowId: 8473868798,
  id: 23219108783,
  jobName: 'linux-focal-rocm6.0-py3.8 / build',
  name: 'trunk / linux-focal-rocm6.0-py3.8 / build',
  conclusion: 'failure',
  completed_at: '2024-03-28T21:38:47Z',
  html_url: 'https://github.com/pytorch/pytorch/actions/runs/8473868798/job/23219108783',
  head_sha: '53c6a0301c4d68d5afcdccd746ce4d1f667f71e9',
  head_branch: 'ciflow/trunk/122331',
  failure_captures: [ 'Process completed with exit code 1.' ],
  failure_lines: [ '##[error]Process completed with exit code 1.' ],
  failure_context: [
    '+ echo ::endgroup::',
    '+ sccache --stop-server',
    '+ sccache --show-stats',
    "+ echo '::group::Sccache Compilation Log'",
    '+ sccache_epilogue',
    '+ python setup.py bdist_wheel',
    '+ [[ linux-focal-rocm6.0-py3.8 != *rocm* ]]',
    '+ [[ linux-focal-rocm6.0-py3.8 != *libtorch* ]]',
    '+ return 1',
    '+ set -e',
    '+ retcode=1',
    '+ python setup.py clean bad_argument'
  ],
  authorEmail: ''
}

The two looks exactly the same, including the failure_context guardrail. The merge base of pytorch/pytorch#122350 was also 11 day old, which increase the chance getting false positives (the older the base commit, the higher the chance).

Some thoughts:

  1. As an mitigation, I could add a check to skip/tighten flaky classification when the base commit is older than a certain threshold of maybe 3 days.
  2. In the long run, we plan to improve the log classifier accuracy. These 2 failures were completely different, but the log classifier treated them the same.

huydhn added a commit that referenced this issue Apr 9, 2024
This is to address a common source of wrong classifications as shown in
#5063 where the merge base
commit is too old. This increases the chance of marking actual failures
as flaky because the search window could be big. We can relax this once
we achieve a higher accuracy with the log classifier as explained in
#5063 (comment)

### Testing

pytorch/pytorch#123482 has flaky failure and it
merge base is a month old. After this change, the flaky failure won't be
marked as flaky anymore:

<!-- drci-comment-start -->

## 🔗 Helpful Links
### 🧪 See artifacts and rendered test results at
[hud.pytorch.org/pr/123482](https://hud.pytorch.org/pr/123482)
* 📄 Preview [Python docs built from this
PR](https://docs-preview.pytorch.org/pytorch/pytorch/123482/index.html)
* 📄 Preview [C++ docs built from this
PR](https://docs-preview.pytorch.org/pytorch/pytorch/123482/cppdocs/index.html)
* ❓ Need help or want to give feedback on the CI? Visit the
[bot commands
wiki](https://github.com/pytorch/pytorch/wiki/Bot-commands) or our
[office
hours](https://github.com/pytorch/pytorch/wiki/Dev-Infra-Office-Hours)

Note: Links to docs will display an error until the docs builds have
been completed.


## ❌ 2 New Failures
As of commit 1ab823a683847d8c01b35bda74143876f922586b with merge base
86a2d67bb9db7dae8ff4589930dd505a6c5b4ec6 (<sub><sub><img alt="image"
width=70
src="https://img.shields.io/date/1710217340?label=&color=FFFFFF&style=flat-square"></sub></sub>):
<details open><summary><b>NEW FAILURES</b> - The following jobs have
failed:</summary><p>

* [pull / linux-focal-py3.11-clang10 / test (default, 1, 3,
linux.2xlarge)](https://hud.pytorch.org/pr/pytorch/pytorch/123482#23507840916)
([gh](https://github.com/pytorch/pytorch/actions/runs/8576493039/job/23507840916))
    `Process completed with exit code 1.`
* [pull / linux-jammy-py3.10-clang15-asan / test (default, 6, 6,
linux.4xlarge)](https://hud.pytorch.org/pr/pytorch/pytorch/123482#23507963375)
([gh](https://github.com/pytorch/pytorch/actions/runs/8576493039/job/23507963375))
    `Process completed with exit code 1.`
</p></details>


This comment was automatically generated by Dr. CI and updates every 15
minutes.
<!-- drci-comment-end -->
@clee2000
Copy link
Contributor

@malfet: we should never treat build failures as flaky/flaky build failures should never be allowed to merge without -f. Ex infra failure -> do not merge, regardless of whether the failure is due to the PR or not

AI: Add builds to list of never marked as flaky jobs
AI: Do not mark flaky when log classifier result is really generic
AI: Category for "not your fault but you can't ignore this"

@huydhn
Copy link
Contributor

huydhn commented Apr 17, 2024

The first and second AI make sense to me, and could be implemented easily. The last one, however, needs more clarification I think. Devs would likely force merge the PR when a failure is "not your fault but you can't ignore this". We have some mechanisms to help with this AIACT:

  1. To retry infra failures on PR
  2. To limit the number of failures devs can bypass on their PR to the current value of 10 (configurable)

@clee2000
Copy link
Contributor

The intention of the third point is that devs should explicitly force merge the PR if they don't want to figure out some way to rerun the job

@ZainRizvi
Copy link
Contributor

Do legitimate flaky failures tend to have generic error captures like ##[error]Process completed with exit code 1.?

Using a generic output like that to determine flakiness seems really risky, even if we implement the other mitigations mentioned here.

@huydhn
Copy link
Contributor

huydhn commented Apr 17, 2024

Do legitimate flaky failures tend to have generic error captures like ##[error]Process completed with exit code 1.?

Yup, that one and Command docker exec -t ... from Nova, i.e. https://hud.pytorch.org/pytorch/executorch/commit/b3ac5332e72a909ca900d6dc6feb21bedce753e4

Atm, there are 2 heuristic in place to reduce the chance of having FP.

We could also mitigate this by making it easier to add a new heuristic rule for the log classifier. The PR route is a bit unyielding IMO, so we don't create new rule often while I think we should.

@albanD
Copy link

albanD commented Apr 19, 2024

+1 same thing for backward compat job!

@huydhn
Copy link
Contributor

huydhn commented Apr 19, 2024

+1 same thing for backward compat job!

I have a fix ready here #5106, although it's weird that I don't see @ZainRizvi comment about lint on the issue, although I think it's there.

huydhn added a commit that referenced this issue Apr 22, 2024
This is to address some new comments on
#5063.

1. Exclude `backwards_compat` from Dr.CI flaky check, as this job is
deem stable enough most of the time.
2. Lint jobs have already been excluded, but there is a bug when it
would still be treated as flaky when there is no associated log on S3.
This case surfaces in pytorch/pytorch#124321
where there is a mix of signals from `pytorch` and` pytorch-canary` for
the same commit. For example, there is no
https://ossci-raw-job-status.s3.amazonaws.com/log/pytorch/pytorch-canary/24002161738
for the lint job there

```
{
  workflowId: 8745622961,
  workflowUniqueId: 13175283,
  id: 24002161738,
  runnerName: 'i-0cdd9180550988c58',
  authorEmail: 'ZainR@meta.com',
  name: 'Lint / workflow-checks / linux-job',
  jobName: 'workflow-checks / linux-job',
  conclusion: 'failure',
  completed_at: '2024-04-18T23:42:46Z',
  html_url: 'https://github.com/pytorch/pytorch-canary/actions/runs/8745622961/job/24002161738',
  head_branch: 'zainr/arn-fix',
  pr_number: 124321,
  head_sha: 'fa03725fa8512af211691a5164073ab7e2c4ee10',
  failure_captures: null,
  failure_lines: null,
  failure_context: null,
  time: '2024-04-18T23:42:51.131823Z'
}
```

The latter is probably a canary testing one-off thing, so I didn't
follow up further. But it's good to exclude lint properly anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Cold Storage
Development

No branches or pull requests

6 participants