Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node+multi-GPU papers100m+GCN example #8070

Merged
merged 44 commits into from
Oct 24, 2023
Merged

Conversation

puririshi98
Copy link
Contributor

@puririshi98 puririshi98 commented Sep 22, 2023

working w/ nvidia pyg container

@puririshi98 puririshi98 self-assigned this Sep 22, 2023
Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it is possible to merge this with the initial multi-GPU example, or whether there exists a strong reason not do this.

@rusty1s rusty1s changed the title Multinode-multigpu Papers100m GCN example Multi-node+multi-GPU papers100m+GCN example Sep 23, 2023
@puririshi98
Copy link
Contributor Author

puririshi98 commented Sep 25, 2023

Wondering if it is possible to merge this with the initial multi-GPU example, or whether there exists a strong reason not do this.

The set up is relatively different. The old one uses mp.spawn for single node multigpu, for multinode multigpu we are using torch distributed w/ nccl backend, i think for now we should keep them seperate. additionally for the single node multigpu its very simple to run, just python3 scriptname.py and it handles the rest, for multinode you need to run several slurm commands to get it to work (see comments on file). i think a separate multi-node-multi-gpu example and tutorial (#8071) is more appropriate given the extra complexity

Copy link
Member

@akihironitta akihironitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a very nice addition! 🚀

Haven't tried running this example in a multi-node multi-gpu environment myself, but the code looks good to me. Just like Matthias minimised the other example in #7954, I think we should keep this example minimal, too, where it makes sense.

examples/multi_gpu/multigpu_papers100m_gcn.py Outdated Show resolved Hide resolved
@puririshi98
Copy link
Contributor Author

@akihironitta @rusty1s lmk if anything else needed to merge

@puririshi98 puririshi98 enabled auto-merge (squash) October 23, 2023 20:53
Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @puririshi98. I cleaned it up a bit. I really like the example. One thing we should definitely improve on would be to allow distributed evaluation as well, but it is not blocking this PR.

@rusty1s rusty1s merged commit 3854bcf into master Oct 24, 2023
14 checks passed
@rusty1s rusty1s deleted the papers100m-multinode branch October 24, 2023 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants