Improve multi-node multi-gpu tutorial #8353

flxmr · 2023-11-09T14:18:52Z

So, in response to my remarks in #8071 I now prepared this PR for updating the multi-node documentation (sadly no student contrib, they can prep multi-gpu metrics).

Reasoning:

if pyg has a tutorial on DDP it's for people who essentially grow within pyg to this usecase. This should be reflected in telling them more about this and be clear on what is done (like torch.multiprocessing injecting the rank. it seems very random otherwise).
in the same vein, the multi-node tutorial should evolve from the single-node one imho.
embedding the pyxis-container as the only way to do things is good for Nvidia, but bad for people who arrive on a system without that.

Things I skipped:

setting up the worker-count. I would say people should just figure this out on their own. the os.sched_getaffinity is cool, but this is a very general problem too and I essentially do it manually now.
anything GRES-related. I doubt that anyone growing into this and installing slurm on their computers will not grok this. Every multi-user, public research system has GRES I'd say (and everyone has their internal docs).

- now describes usage with a sbatch-file - alternatively still describes using the pyxis-container - building upon the singlenode-multigpu-example

for more information, see https://pre-commit.ci

docs/source/tutorial/multi_node_multi_gpu_vanilla.rst

akihironitta

I'm not a slurm expert, but LGTM. Just pushed a few commits, but feel free to revert them if you think otherwise.

flxmr · 2023-11-13T19:20:54Z

So, now I tried the pyxis-example on a cluster I got access too (and also to check whether this is worth implementing for our own cluster)... and it doesn't work (I built my own container, but I don't know where the master-adress would come from even in the early-access-NGVC one...). Will wrap this into an sbatch-file too and then do a final version. Maybe @puririshi98 can tell me how this is working (I installed pyg into this)

flxmr · 2023-11-13T23:26:27Z

I didn't find how to make the srun-only example work with the containers. The current example which I just copied doesn't work: there is no environment variables in the pytorch-only NGC-container and I wonder how those would be injected anyway. Also the mounting doesn't make sense. So now it is sbatch only, but verified to work on our local HPC (which also has a tutorial on modifying the enroot-container).

In addition I noticed by trying this, that having multiple processes download data and then trying to unzip it is not working well (it worked previously, because the data was downloaded already). Fixed that too.

puririshi98

LGTM, thanks for the improvements :)

puririshi98

Although it looks good to me we should definitely make sure containers work as my original PR was tested working on a pyxis enabled nvidia cluster using our NVIDIA early access PyG container.

flxmr · 2023-11-14T08:39:25Z

So, I checked again and it seems pyxis/enroot is indeed injecting the LOCAL_RANK, RANK and MASTER_ADRESS into the container: https://github.com/NVIDIA/enroot/blob/master/conf/hooks/extra/50-slurm-pytorch.sh#L32-L37

it seems to be quite a mess though (user→slurm→pyxis→enroot): NVIDIA/pyxis#46 (comment)

Maybe this is site-configuration specific, they have it our HPC centers DGX don't have it? Would be nice if you check this, then you can revert the doc-rewrite (it still needs a single process downloading though!)

puririshi98 · 2023-11-14T18:32:26Z

@flxmr I will investigate and get back to you as soon as I can

puririshi98 · 2023-11-14T22:39:22Z

@flxmr i asked around internally with our enroot team

Pytorch is supported through this enroot hook: https://github.com/NVIDIA/enroot/blob/master/conf/hooks/extra/50-slurm-pytorch.sh
It is shipped by default and just need to be enabled by the administrator

does this help? I can follow back up if not.

flxmr · 2023-11-16T09:09:29Z

So, hope this works now for everyone. I added a link to this issue because I suppose if our HPC didn't like the hook, others might not like it too.

puririshi98

LGTM now, thanks @flxmr for this great PR

flxmr added 2 commits November 9, 2023 15:06

clarified the docs for the single-node, multi-gpu setup

5188bd3

fixed the documentation for a multi-node, multi-gpu setup

c783393

- now describes usage with a sbatch-file - alternatively still describes using the pyxis-container - building upon the singlenode-multigpu-example

flxmr requested review from wsad1 and rusty1s as code owners November 9, 2023 14:18

pre-commit-ci bot and others added 3 commits November 9, 2023 14:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

8452261

for more information, see https://pre-commit.ci

fixed import for pre-commit

e63ccf1

[pre-commit.ci] auto fixes from pre-commit.com hooks

5b680ad

for more information, see https://pre-commit.ci

EdisonLeeeee assigned flxmr Nov 10, 2023

EdisonLeeeee added documentation 1 - Priority P1 example labels Nov 10, 2023

Merge branch 'master' into fix-multigpu-docs

9ae2377

akihironitta reviewed Nov 13, 2023

View reviewed changes

akihironitta added 5 commits November 13, 2023 23:56

Apply suggestions from code review

1599d1a

Update multi_gpu_vanilla.rst

3751f9c

update

6c17009

update

22254a1

Update docs/source/tutorial/multi_node_multi_gpu_vanilla.rst

da6e1af

akihironitta approved these changes Nov 13, 2023

View reviewed changes

akihironitta changed the title ~~Fix multigpu docs~~ Improve multi-node multi-gpu tutorial Nov 13, 2023

akihironitta requested a review from puririshi98 November 13, 2023 16:28

flxmr added 4 commits November 13, 2023 23:29

fixed pyxis example

167cac9

fixed sbatch to docs

202ea76

fixed typo

fa58b44

pre-commit

cbfe3e9

puririshi98 approved these changes Nov 14, 2023

View reviewed changes

puririshi98 requested changes Nov 14, 2023

View reviewed changes

Merge branch 'master' into fix-multigpu-docs

a0ee1d7

akihironitta added the skip-changelog label Nov 14, 2023

flxmr and others added 2 commits November 16, 2023 09:51

Merge branch 'master' into fix-multigpu-docs

09853b7

reverted pyxis-section to properly configured one

c120339

flxmr and others added 2 commits November 16, 2023 10:14

added sensible change

21c53a0

Merge branch 'master' into fix-multigpu-docs

5130a21

puririshi98 approved these changes Nov 16, 2023

View reviewed changes

Merge branch 'master' into fix-multigpu-docs

a262a20

puririshi98 enabled auto-merge (squash) November 16, 2023 18:26

puririshi98 merged commit 3af88bd into pyg-team:master Nov 16, 2023
13 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve multi-node multi-gpu tutorial #8353

Improve multi-node multi-gpu tutorial #8353

flxmr commented Nov 9, 2023

akihironitta left a comment

flxmr commented Nov 13, 2023 •

edited

flxmr commented Nov 13, 2023

puririshi98 left a comment

puririshi98 left a comment

flxmr commented Nov 14, 2023 •

edited

puririshi98 commented Nov 14, 2023

puririshi98 commented Nov 14, 2023

flxmr commented Nov 16, 2023

puririshi98 left a comment

Improve multi-node multi-gpu tutorial #8353

Improve multi-node multi-gpu tutorial #8353

Conversation

flxmr commented Nov 9, 2023

akihironitta left a comment

Choose a reason for hiding this comment

flxmr commented Nov 13, 2023 • edited

flxmr commented Nov 13, 2023

puririshi98 left a comment

Choose a reason for hiding this comment

puririshi98 left a comment

Choose a reason for hiding this comment

flxmr commented Nov 14, 2023 • edited

puririshi98 commented Nov 14, 2023

puririshi98 commented Nov 14, 2023

flxmr commented Nov 16, 2023

puririshi98 left a comment

Choose a reason for hiding this comment

flxmr commented Nov 13, 2023 •

edited

flxmr commented Nov 14, 2023 •

edited