[fix] fix rank setup, solve the problem that multiple processes cannot be started, test multi-node training #131
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Compare the log between multi-node and single-node training (above single node):
![image](https://user-images.githubusercontent.com/29651739/218915415-e956a81c-f7ba-4a36-8ef9-3c4571a72338.png)
To train the code on multi-node, we can directly change the line
torchrun --standalone --nnodes=1 --nproc_per_node=$num_gpus
in run.sh to
torchrun --nnodes $node_num --nproc_per_node $num_process_per_node --rdzv_backend=c10d --rdzv_endpoint=ip_address_of_one_machine:port
node_num
: the number of the node (machine)num_process_per_node
: the process number running on the machineip_address_of_one_machine
: randomly select one machine for this setupport
: the port for communicationThe above command should be run individually on each machine (node).
For more detailed information, refer to MULTINODE TRAINING.