More cases of Dr.CI wrong flaky detection #4741

huydhn · 2023-11-16T18:30:50Z

This issue is to keep track of some recent cases of wrong flaky detection, most likely they are linked to wrong or too generic failures extracted by log classification. They are:

Fixes #4741 This is to strengthen Dr.CI flaky classification in the case of the generic GHA `Process completed with exit code 1` failure by comparing the failure context of the last command executed in addition to the failure itself. The error itself doesn't mean anything in this case. The failure context has been gathered for a while and stored in Rockset under `job.torchci_classification.context`. Now, it's the time to start utilize it. The context is a list of the last N commands executed traced backward from where the failure occurs, for example, ``` [ "+ python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard 1 5 --verbose", "+ [[ -z 5 ]]", "+ test_python_shard 1", "+ '[' -n '' ']'", "+ pip install --progress-bar off --no-use-pep517 --user git+https://github.com/pytorch/vision.git@893b4abdc0c9df36c241c58769810f69e35dab48", "+ pip_install --no-use-pep517 --user git+https://github.com/pytorch/vision.git@893b4abdc0c9df36c241c58769810f69e35dab48", "+ '[' -n '' ']'", "+ orig_preload=", "+ commit=893b4abdc0c9df36c241c58769810f69e35dab48", "++ cat .github/ci_commit_pins/vision.txt", "++ get_pinned_commit vision", "+ local commit", ] ``` This change extracts and compares the last command, i.e. `+ python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard 1 5 --verbose`, in addition to job name and the failure string. ### Testing Try this out on a pytorch/pytorch#112504 with failures ``` curl --request POST \ --url "http://localhost:3000/api/drci/drci?prNumber=112504" \ --header "Authorization: TOKEN" \ --data 'repo=pytorch' ```

huydhn added the bug Something isn't working label Nov 16, 2023

huydhn self-assigned this Nov 16, 2023

huydhn mentioned this issue Dec 2, 2023

Compare failure context in addition to the failure itself #4776

Merged

huydhn closed this as completed in #4776 Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More cases of Dr.CI wrong flaky detection #4741

More cases of Dr.CI wrong flaky detection #4741

huydhn commented Nov 16, 2023

More cases of Dr.CI wrong flaky detection #4741

More cases of Dr.CI wrong flaky detection #4741

Comments

huydhn commented Nov 16, 2023