Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More cases of Dr.CI wrong flaky detection #4741

Closed
huydhn opened this issue Nov 16, 2023 · 0 comments · Fixed by #4776
Closed

More cases of Dr.CI wrong flaky detection #4741

huydhn opened this issue Nov 16, 2023 · 0 comments · Fixed by #4776
Assignees
Labels
bug Something isn't working

Comments

@huydhn
Copy link
Contributor

huydhn commented Nov 16, 2023

This issue is to keep track of some recent cases of wrong flaky detection, most likely they are linked to wrong or too generic failures extracted by log classification. They are:

@huydhn huydhn added the bug Something isn't working label Nov 16, 2023
@huydhn huydhn self-assigned this Nov 16, 2023
huydhn added a commit that referenced this issue Dec 5, 2023
Fixes #4741

This is to strengthen Dr.CI flaky classification in the case of the
generic GHA `Process completed with exit code 1` failure by comparing
the failure context of the last command executed in addition to the
failure itself. The error itself doesn't mean anything in this case.

The failure context has been gathered for a while and stored in Rockset
under `job.torchci_classification.context`. Now, it's the time to start
utilize it. The context is a list of the last N commands executed traced
backward from where the failure occurs, for example,
```
[
  "+ python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard 1 5 --verbose",
  "+ [[ -z 5 ]]",
  "+ test_python_shard 1",
  "+ '[' -n '' ']'",
  "+ pip install --progress-bar off --no-use-pep517 --user git+https://github.com/pytorch/vision.git@893b4abdc0c9df36c241c58769810f69e35dab48",
  "+ pip_install --no-use-pep517 --user git+https://github.com/pytorch/vision.git@893b4abdc0c9df36c241c58769810f69e35dab48",
  "+ '[' -n '' ']'",
  "+ orig_preload=",
  "+ commit=893b4abdc0c9df36c241c58769810f69e35dab48",
  "++ cat .github/ci_commit_pins/vision.txt",
  "++ get_pinned_commit vision",
  "+ local commit",
]
```

This change extracts and compares the last command, i.e. `+ python
test/run_test.py --exclude-jit-executor --exclude-distributed-tests
--shard 1 5 --verbose`, in addition to job name and the failure string.

### Testing

Try this out on a pytorch/pytorch#112504 with
failures

```
curl --request POST \
--url "http://localhost:3000/api/drci/drci?prNumber=112504" \
--header "Authorization: TOKEN" \
--data 'repo=pytorch'
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant