-
Couldn't load subscription status.
- Fork 25.7k
Apply the same fix to cleanup process on Windows CPU build job #101460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101460
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6ccf7c4: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot merge -f 'Windows CI build job has passed. Merge to fix trunk' |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This goes together with pytorch/test-infra#4169. To be replace by the main branch once pytorch/test-infra#4169 merges Pull Request resolved: #101460 Approved by: https://github.com/clee2000, https://github.com/PaliC
Windows flakiness strikes again. There is a new flaky issue start appearing on HUD in which tearing down Windows workspace fails with `Device or resource busy` error when trying to `rm -rf ./*` the workspace, for example https://github.com/pytorch/pytorch/actions/runs/5051845102/jobs/9064107717. It happens on both build and test jobs. I have looked into all commits since last weekend but there is nothing standing out or Windows-related. The error means that a process still hold the directory, but it's unclear which one as all CI processes should have been stopped by then (#101460) with the only exception of the runner daemon itself. On the other hand, the issue is flaky as the next job running on the same failed runner can clean up the workspace fine when checking out PyTorch (https://github.com/pytorch/pytorch/blob/main/.github/actions/checkout-pytorch/action.yml#L21-L35). For example, `i-0ec1767a38ec93b4e` failed at https://github.com/pytorch/pytorch/actions/runs/5051845102/jobs/9064107717 and its immediate next job succeeded https://github.com/pytorch/pytorch/actions/runs/5052147504/jobs/9064717085. So, I think that adding retrying should help mitigate this. Related to pytorch/test-infra#4206 (not the same root cause, I figured out pytorch/test-infra#4206 while working on this PR) Pull Request resolved: #102051 Approved by: https://github.com/kit1980
This goes together with pytorch/test-infra#4169. To be replace by the main branch once pytorch/test-infra#4169 merges