Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi-node DDP training #2101

Merged
merged 1 commit into from
Aug 18, 2023
Merged

Conversation

lucadellalib
Copy link
Collaborator

NOTE: requires further testing.

speechbrain.utils.distributed.if_main_process defines the main process as the one with global rank (RANK environment variable) equal to 0. This works when using DDP on a single node, because the global rank of each process is the same as the local rank within the node. However, this fails when using multiple nodes. Indeed, I/O operations like data preparation, fitting SentencePiece tokenizer, etc. are run only on the master node (where the process with global rank 0 runs), but not on the worker nodes (where processes with global rank > 0 run). Therefore intermediate artifacts such as the data manifest files and SentencePiece checkpoint are created only on the master node but not on the worker nodes, which makes the processes on worker nodes fail (e.g. FileNotFoundError). Checking against the local rank (LOCAL_RANK environment variable) should fix the issue (this way I/O operations are run on the main process of each node).
pytorch-ddp
YunchaoYang/Blogs#3

@mravanelli mravanelli added the bug Something isn't working label Aug 7, 2023
@mravanelli
Copy link
Collaborator

@pplantinga, could you please take a look at this?

Copy link
Collaborator

@pplantinga pplantinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this change. Although there may be some cases where you wish code to run only on the master node (e.g. saving a checkpoint) it seems like in the majority of cases the preferred behavior would be to run on every node, and the cost of running on all nodes even in those cases where you might not want to is small.

@mravanelli mravanelli merged commit 0e8b81e into develop Aug 18, 2023
5 checks passed
@mravanelli mravanelli deleted the lucadellalib-distributed-training branch August 18, 2023 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants