Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the device type for with_comms decorator #125798

Closed
wants to merge 6 commits into from

Conversation

wanchaol
Copy link
Contributor

@wanchaol wanchaol commented May 8, 2024

Stack from ghstack (oldest at bottom):

found by @yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

found by @yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

[ghstack-poisoned]
Copy link

pytorch-bot bot commented May 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125798

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d9621b6 with merge base bd3cbdb (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 8, 2024
wanchaol added a commit that referenced this pull request May 8, 2024
found by yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

ghstack-source-id: 7363bd0828ffd4086c329e9f60eea8de4272d2bb
Pull Request resolved: #125798
@wanchaol wanchaol added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels May 8, 2024
Comment on lines 358 to 360
# if enough GPU we can use GPU, otherwise we fallback to CPU
if torch.cuda.is_available() and torch.cuda.device_count() < self.world_size:
self.device_type = "cpu"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your intention, but the "and" here seems a bit weird.
What about:

if not torch.cuda.is_available() or torch.cuda.device_count() < self.world_size:

Or maybe I missed your other intention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that works too, let me change to that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might also fix the current CI issues, let's see

found by yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request May 13, 2024
found by yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

ghstack-source-id: 0054e7bf268eabefd3e927f7fe48cc0468f481c4
Pull Request resolved: #125798
found by yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request May 14, 2024
found by yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

ghstack-source-id: a845ed283106e463ef87e4576e67573f514bb487
Pull Request resolved: #125798
Copy link
Contributor

@yifuwang yifuwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG if CI passes!

found by yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request May 15, 2024
found by yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

ghstack-source-id: 3d7c3b2a34e7c2f6d1b90d861deac037d4555470
Pull Request resolved: #125798
found by yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request May 15, 2024
found by yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

ghstack-source-id: 4c42eafe6b5a4fc3e1c2e67a5fcde0a6702aa71a
Pull Request resolved: #125798
found by yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request May 15, 2024
found by yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. #125366

ghstack-source-id: dad362e216edf17b294a611b9adb46febabe70b4
Pull Request resolved: #125798
@wanchaol
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
found by @yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. pytorch#125366

Pull Request resolved: pytorch#125798
Approved by: https://github.com/yifuwang
@github-actions github-actions bot deleted the gh/wanchaol/466/head branch June 16, 2024 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-td-distributed ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants