Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows tests frequently timeout #73489

Closed
malfet opened this issue Feb 28, 2022 · 3 comments
Closed

Windows tests frequently timeout #73489

malfet opened this issue Feb 28, 2022 · 3 comments
Assignees
Labels
high priority module: ci Related to continuous integration module: regression It used to work, and now it doesn't module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@malfet
Copy link
Contributor

malfet commented Feb 28, 2022

@malfet malfet added high priority module: windows Windows support for PyTorch module: ci Related to continuous integration module: regression It used to work, and now it doesn't labels Feb 28, 2022
malfet added a commit to malfet/pytorch that referenced this issue Feb 28, 2022
@janeyx99
Copy link
Contributor

This should be fixed after #73293 and #73467. The first allowed the stats to be uploaded correctly, but since the jobs were still timing out, the stats were mostly empty for shard 1, which is why extending the timeout in the second PR allowed accurate stats to be collected.

Looking at the latest trunk commit, see https://hud2.pytorch.org/pytorch/pytorch/commit/b213041df304c05980b3209dddf1b595699c8b74, sharding for windows has been back to normal, though a combined testing of 5hrs is a lot and something we should address separately.

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 1, 2022
facebook-github-bot pushed a commit that referenced this issue Mar 1, 2022
Summary: See #73489

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/94501ff91e

Reviewed By: malfet

Differential Revision: D34527930

Pulled By: malfet

fbshipit-source-id: 85fe3860ff14f2d7b02f3823519b2d140cdd3889
cyyever pushed a commit to cyyever/pytorch_private that referenced this issue Mar 3, 2022
cyyever pushed a commit to cyyever/pytorch_private that referenced this issue Mar 3, 2022
@Blackhex
Copy link
Collaborator

It does not seem that the https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=win-vs2019-cuda11 pipelines are sufferring with this issue anymore.

@malfet
Copy link
Contributor Author

malfet commented Nov 21, 2022

I think there are multiple issues that contributed to that:

  • CUDA-11.3 has a performance bug for certain optimization
  • We've since implemented dynamic sharding and none of the shards right now seem to exceed 100 min, see https://hud.pytorch.org/metrics

I guess good follow up would be to create a PR that reduces test timeout to 120 min

kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Dec 10, 2022
This PR decreases the Windows tests pipelines timeout to 120 mins per discusison as requested at pytorch#73489 (comment)

Closes pytorch#73489.
Pull Request resolved: pytorch#89694
Approved by: https://github.com/kit1980
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: ci Related to continuous integration module: regression It used to work, and now it doesn't module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants