fix the device type for with_comms decorator #125798

wanchaol · 2024-05-08T22:22:54Z

Stack from ghstack (oldest at bottom):

-> fix the device type for with_comms decorator #125798

found by @yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

@yifuwang

found by @yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 [ghstack-poisoned]

pytorch-bot · 2024-05-08T22:22:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125798

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d9621b6 with merge base bd3cbdb ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

found by yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 ghstack-source-id: 7363bd0828ffd4086c329e9f60eea8de4272d2bb Pull Request resolved: #125798

kwen2501 · 2024-05-09T06:38:10Z

torch/testing/_internal/distributed/_tensor/common_dtensor.py

+        # if enough GPU we can use GPU, otherwise we fallback to CPU
+        if torch.cuda.is_available() and torch.cuda.device_count() < self.world_size:
            self.device_type = "cpu"


I understand your intention, but the "and" here seems a bit weird.
What about:

if not torch.cuda.is_available() or torch.cuda.device_count() < self.world_size:

Or maybe I missed your other intention.

Sure that works too, let me change to that

this might also fix the current CI issues, let's see

found by yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

found by yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 ghstack-source-id: 0054e7bf268eabefd3e927f7fe48cc0468f481c4 Pull Request resolved: #125798

found by yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

found by yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 ghstack-source-id: a845ed283106e463ef87e4576e67573f514bb487 Pull Request resolved: #125798

yifuwang

LG if CI passes!

found by yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

found by yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 ghstack-source-id: 3d7c3b2a34e7c2f6d1b90d861deac037d4555470 Pull Request resolved: #125798

found by yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

found by yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 ghstack-source-id: 4c42eafe6b5a4fc3e1c2e67a5fcde0a6702aa71a Pull Request resolved: #125798

found by yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

found by yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 ghstack-source-id: dad362e216edf17b294a611b9adb46febabe70b4 Pull Request resolved: #125798

wanchaol · 2024-05-16T00:18:45Z

@pytorchbot merge

pytorchmergebot · 2024-05-16T00:21:14Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@yifuwang

found by @yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. pytorch#125366 Pull Request resolved: pytorch#125798 Approved by: https://github.com/yifuwang

fix the device type for with_comms decorator

4067c30

found by @yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. #125366 [ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 8, 2024

pytorch-bot bot added the ci-td-distributed label May 8, 2024

wanchaol added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels May 8, 2024

kwen2501 reviewed May 9, 2024

View reviewed changes

yifuwang approved these changes May 14, 2024

View reviewed changes

pytorchmergebot added the merging label May 16, 2024

pytorchmergebot added the Merged label May 16, 2024

pytorchmergebot closed this in d0dfcd2 May 16, 2024

pytorchmergebot removed the merging label May 16, 2024

github-actions bot deleted the gh/wanchaol/466/head branch June 16, 2024 01:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix the device type for with_comms decorator #125798

fix the device type for with_comms decorator #125798

wanchaol commented May 8, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 8, 2024 •

edited

kwen2501 May 9, 2024

wanchaol May 13, 2024

wanchaol May 13, 2024

yifuwang left a comment

wanchaol commented May 16, 2024

pytorchmergebot commented May 16, 2024

fix the device type for with_comms decorator #125798

fix the device type for with_comms decorator #125798

Conversation

wanchaol commented May 8, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented May 8, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125798

✅ No Failures

kwen2501 May 9, 2024

Choose a reason for hiding this comment

wanchaol May 13, 2024

Choose a reason for hiding this comment

wanchaol May 13, 2024

Choose a reason for hiding this comment

yifuwang left a comment

Choose a reason for hiding this comment

wanchaol commented May 16, 2024

pytorchmergebot commented May 16, 2024

Merge started

wanchaol commented May 8, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 8, 2024 •

edited