Fix multi-node DDP training #2101

lucadellalib · 2023-08-06T17:46:58Z

NOTE: requires further testing.

speechbrain.utils.distributed.if_main_process defines the main process as the one with global rank (RANK environment variable) equal to 0. This works when using DDP on a single node, because the global rank of each process is the same as the local rank within the node. However, this fails when using multiple nodes. Indeed, I/O operations like data preparation, fitting SentencePiece tokenizer, etc. are run only on the master node (where the process with global rank 0 runs), but not on the worker nodes (where processes with global rank > 0 run). Therefore intermediate artifacts such as the data manifest files and SentencePiece checkpoint are created only on the master node but not on the worker nodes, which makes the processes on worker nodes fail (e.g. FileNotFoundError). Checking against the local rank (LOCAL_RANK environment variable) should fix the issue (this way I/O operations are run on the main process of each node).

YunchaoYang/Blogs#3

mravanelli · 2023-08-07T13:06:24Z

@pplantinga, could you please take a look at this?

pplantinga

I agree with this change. Although there may be some cases where you wish code to run only on the master node (e.g. saving a checkpoint) it seems like in the majority of cases the preferred behavior would be to run on every node, and the cost of running on all nodes even in those cases where you might not want to is small.

Fix distributed training

cb5fc1d

Adel-Moumen requested a review from pplantinga August 6, 2023 17:52

mravanelli assigned lucadellalib Aug 7, 2023

mravanelli added the bug Something isn't working label Aug 7, 2023

pplantinga approved these changes Aug 8, 2023

View reviewed changes

mravanelli merged commit 0e8b81e into develop Aug 18, 2023
5 checks passed

mravanelli deleted the lucadellalib-distributed-training branch August 18, 2023 22:46

lucadellalib mentioned this pull request Apr 13, 2024

fix LOCAL_RANK to be RANK in if_main_process #2506

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-node DDP training #2101

Fix multi-node DDP training #2101

lucadellalib commented Aug 6, 2023

mravanelli commented Aug 7, 2023

pplantinga left a comment

Fix multi-node DDP training #2101

Fix multi-node DDP training #2101

Conversation

lucadellalib commented Aug 6, 2023

mravanelli commented Aug 7, 2023

pplantinga left a comment

Choose a reason for hiding this comment