Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve multi-node multi-gpu tutorial #8353

Merged
merged 21 commits into from
Nov 16, 2023

Conversation

flxmr
Copy link
Contributor

@flxmr flxmr commented Nov 9, 2023

So, in response to my remarks in #8071 I now prepared this PR for updating the multi-node documentation (sadly no student contrib, they can prep multi-gpu metrics).

Reasoning:

  • if pyg has a tutorial on DDP it's for people who essentially grow within pyg to this usecase. This should be reflected in telling them more about this and be clear on what is done (like torch.multiprocessing injecting the rank. it seems very random otherwise).
  • in the same vein, the multi-node tutorial should evolve from the single-node one imho.
  • embedding the pyxis-container as the only way to do things is good for Nvidia, but bad for people who arrive on a system without that.

Things I skipped:

  • setting up the worker-count. I would say people should just figure this out on their own. the os.sched_getaffinity is cool, but this is a very general problem too and I essentially do it manually now.
  • anything GRES-related. I doubt that anyone growing into this and installing slurm on their computers will not grok this. Every multi-user, public research system has GRES I'd say (and everyone has their internal docs).

  - now describes usage with a sbatch-file
  - alternatively still describes using the pyxis-container
  - building upon the singlenode-multigpu-example
Copy link
Member

@akihironitta akihironitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a slurm expert, but LGTM. Just pushed a few commits, but feel free to revert them if you think otherwise.

@akihironitta akihironitta changed the title Fix multigpu docs Improve multi-node multi-gpu tutorial Nov 13, 2023
@flxmr
Copy link
Contributor Author

flxmr commented Nov 13, 2023

So, now I tried the pyxis-example on a cluster I got access too (and also to check whether this is worth implementing for our own cluster)... and it doesn't work (I built my own container, but I don't know where the master-adress would come from even in the early-access-NGVC one...). Will wrap this into an sbatch-file too and then do a final version. Maybe @puririshi98 can tell me how this is working (I installed pyg into this)

@flxmr
Copy link
Contributor Author

flxmr commented Nov 13, 2023

I didn't find how to make the srun-only example work with the containers. The current example which I just copied doesn't work: there is no environment variables in the pytorch-only NGC-container and I wonder how those would be injected anyway. Also the mounting doesn't make sense. So now it is sbatch only, but verified to work on our local HPC (which also has a tutorial on modifying the enroot-container).

In addition I noticed by trying this, that having multiple processes download data and then trying to unzip it is not working well (it worked previously, because the data was downloaded already). Fixed that too.

Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the improvements :)

Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it looks good to me we should definitely make sure containers work as my original PR was tested working on a pyxis enabled nvidia cluster using our NVIDIA early access PyG container.

@flxmr
Copy link
Contributor Author

flxmr commented Nov 14, 2023

So, I checked again and it seems pyxis/enroot is indeed injecting the LOCAL_RANK, RANK and MASTER_ADRESS into the container: https://github.com/NVIDIA/enroot/blob/master/conf/hooks/extra/50-slurm-pytorch.sh#L32-L37

it seems to be quite a mess though (user→slurm→pyxis→enroot): NVIDIA/pyxis#46 (comment)

Maybe this is site-configuration specific, they have it our HPC centers DGX don't have it? Would be nice if you check this, then you can revert the doc-rewrite (it still needs a single process downloading though!)

@puririshi98
Copy link
Contributor

@flxmr I will investigate and get back to you as soon as I can

@puririshi98
Copy link
Contributor

@flxmr i asked around internally with our enroot team

Pytorch is supported through this enroot hook: https://github.com/NVIDIA/enroot/blob/master/conf/hooks/extra/50-slurm-pytorch.sh
It is shipped by default and just need to be enabled by the administrator

does this help? I can follow back up if not.

@flxmr
Copy link
Contributor Author

flxmr commented Nov 16, 2023

So, hope this works now for everyone. I added a link to this issue because I suppose if our HPC didn't like the hook, others might not like it too.

Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks @flxmr for this great PR

@puririshi98 puririshi98 enabled auto-merge (squash) November 16, 2023 18:26
@puririshi98 puririshi98 merged commit 3af88bd into pyg-team:master Nov 16, 2023
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants