Skip to content

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented May 15, 2023

This goes together with pytorch/test-infra#4169. To be replace by the main branch once pytorch/test-infra#4169 merges

@huydhn huydhn added ciflow/trunk Trigger trunk jobs on your pull request test-config/default labels May 15, 2023
@huydhn huydhn requested a review from clee2000 May 15, 2023 22:13
@huydhn huydhn requested a review from a team as a code owner May 15, 2023 22:13
@pytorch-bot
Copy link

pytorch-bot bot commented May 15, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101460

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6ccf7c4:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label May 15, 2023
@huydhn
Copy link
Contributor Author

huydhn commented May 15, 2023

@pytorchbot merge -f 'Windows CI build job has passed. Merge to fix trunk'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jcaip pushed a commit that referenced this pull request May 23, 2023
pytorchmergebot pushed a commit that referenced this pull request May 23, 2023
Windows flakiness strikes again.  There is a new flaky issue start appearing on HUD in which tearing down Windows workspace fails with `Device or resource busy` error when trying to `rm -rf ./*` the workspace, for example https://github.com/pytorch/pytorch/actions/runs/5051845102/jobs/9064107717.  It happens on both build and test jobs.  I have looked into all commits since last weekend but there is nothing standing out or Windows-related.

The error means that a process still hold the directory, but it's unclear which one as all CI processes should have been stopped by then (#101460) with the only exception of the runner daemon itself.  On the other hand, the issue is flaky as the next job running on the same failed runner can clean up the workspace fine when checking out PyTorch (https://github.com/pytorch/pytorch/blob/main/.github/actions/checkout-pytorch/action.yml#L21-L35).

For example, `i-0ec1767a38ec93b4e` failed at https://github.com/pytorch/pytorch/actions/runs/5051845102/jobs/9064107717 and its immediate next job succeeded https://github.com/pytorch/pytorch/actions/runs/5052147504/jobs/9064717085.  So, I think that adding retrying should help mitigate this.

Related to pytorch/test-infra#4206 (not the same root cause, I figured out pytorch/test-infra#4206 while working on this PR)

Pull Request resolved: #102051
Approved by: https://github.com/kit1980
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants