-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-node+multi-GPU papers100m
+GCN
example
#8070
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it is possible to merge this with the initial multi-GPU example, or whether there exists a strong reason not do this.
GCN
examplepapers100m
+GCN
example
The set up is relatively different. The old one uses mp.spawn for single node multigpu, for multinode multigpu we are using torch distributed w/ nccl backend, i think for now we should keep them seperate. additionally for the single node multigpu its very simple to run, just |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a very nice addition! 🚀
Haven't tried running this example in a multi-node multi-gpu environment myself, but the code looks good to me. Just like Matthias minimised the other example in #7954, I think we should keep this example minimal, too, where it makes sense.
@akihironitta @rusty1s lmk if anything else needed to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @puririshi98. I cleaned it up a bit. I really like the example. One thing we should definitely improve on would be to allow distributed evaluation as well, but it is not blocking this PR.
working w/ nvidia pyg container