Skip dtensor ops on CPU-only runner due to flaky timeout #98868

huydhn · 2023-04-11T20:37:21Z

distributed/_tensor/test_dtensor_ops is still flaky in trunk with a curious timeout issue, for example https://hud.pytorch.org/pytorch/pytorch/commit/ce4df4cc596aa10534ac6d54912f960238264dfd. It seems that the test just hang without any failure. The root cause is unclear. On the other hang, #98816 might offer a solution for this. Anyway, I'm disable the test on CPU for now while the investigation is being done.

The test is still being run on CUDA-available runner because it's not flaky there.

pytorch-bot · 2023-04-11T20:37:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98868

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Failures

As of commit 7ef4187:

NEW FAILURES - The following jobs have failed:

Check Labels (gh)

BROKEN TRUNK - The following jobs failed but were present on the merge base def50d2:

👉 Rebase onto the `viable/strict` branch to avoid these failures

lintrunner / linux-job (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wanchaol · 2023-04-12T00:08:40Z

test/distributed/_tensor/test_dtensor_ops.py

-    run_tests()
+    # NB: CPU dtensor ops test frequently timeout https://github.com/pytorch/pytorch/issues/98816
+    # so running it only on CUDA
+    if torch.cuda.is_available():


hmmm I don't think test_dtensor_ops runs in CUDA, so this essentially means it would only run in CUDA available machines with cpu tests https://github.com/pytorch/pytorch/blob/master/test/distributed/_tensor/test_dtensor_ops.py#L559

which I think it might be fine for short term while we are investigating the timeouts. I just want to make sure there's at least something running in the CI so that we can capture errors when submitting PRs :)

Yeah, the test would still be run on machine with CUDA, which curiously is never getting timed out. I will update the PR title and description accordingly

Got it thanks!

huydhn · 2023-04-12T00:37:19Z

@pytorchbot merge

pytorchmergebot · 2023-04-12T00:39:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-04-12T00:49:29Z

Merge failed

Reason: 3 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 3, 3, macos-m1-12)

Details for Dev Infra team

Raised by workflow job

huydhn · 2023-04-12T03:07:36Z

@pytorchbot merge

pytorchmergebot · 2023-04-12T03:09:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Skip dtensor ops CPU test due to flaky timeout

7ef4187

huydhn added test-config/distributed test-config/default labels Apr 11, 2023

huydhn requested a review from clee2000 April 11, 2023 20:37

pytorch-bot bot added the topic: not user facing topic category label Apr 11, 2023

huydhn marked this pull request as ready for review April 12, 2023 00:02

huydhn requested review from H-Huang, awgu, fegin, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners April 12, 2023 00:02

clee2000 approved these changes Apr 12, 2023

View reviewed changes

wanchaol reviewed Apr 12, 2023

View reviewed changes

huydhn changed the title ~~Skip dtensor ops CPU test due to flaky timeout~~ Skip dtensor ops on CPU-only runner due to flaky timeout Apr 12, 2023

huydhn added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 12, 2023

pytorchmergebot added the merging label Apr 12, 2023

pytorchmergebot added Merged and removed merging labels Apr 12, 2023

pytorchmergebot closed this in d3a3595 Apr 12, 2023

XilunWu mentioned this pull request Jan 24, 2024

[DTensor][BE] re-enable test_dtensor_ops in CPU CI #118134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skip dtensor ops on CPU-only runner due to flaky timeout #98868

Skip dtensor ops on CPU-only runner due to flaky timeout #98868

Uh oh!

huydhn commented Apr 11, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 11, 2023 •

edited

Loading

Uh oh!

wanchaol Apr 12, 2023

Uh oh!

wanchaol Apr 12, 2023

Uh oh!

huydhn Apr 12, 2023 •

edited

Loading

Uh oh!

wanchaol Apr 12, 2023

Uh oh!

huydhn commented Apr 12, 2023

Uh oh!

pytorchmergebot commented Apr 12, 2023

Uh oh!

pytorchmergebot commented Apr 12, 2023

Uh oh!

huydhn commented Apr 12, 2023

Uh oh!

pytorchmergebot commented Apr 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Skip dtensor ops on CPU-only runner due to flaky timeout #98868

Skip dtensor ops on CPU-only runner due to flaky timeout #98868

Uh oh!

Conversation

huydhn commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98868

❌ 2 Failures

Uh oh!

wanchaol Apr 12, 2023

Choose a reason for hiding this comment

Uh oh!

wanchaol Apr 12, 2023

Choose a reason for hiding this comment

Uh oh!

huydhn Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wanchaol Apr 12, 2023

Choose a reason for hiding this comment

Uh oh!

huydhn commented Apr 12, 2023

Uh oh!

pytorchmergebot commented Apr 12, 2023

Merge started

Uh oh!

pytorchmergebot commented Apr 12, 2023

Merge failed

Uh oh!

huydhn commented Apr 12, 2023

Uh oh!

pytorchmergebot commented Apr 12, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huydhn commented Apr 11, 2023 •

edited

Loading

pytorch-bot bot commented Apr 11, 2023 •

edited

Loading

huydhn Apr 12, 2023 •

edited

Loading