[fix] fix rank setup, solve the problem that multiple processes cannot be started, test multi-node training #131

czy97 · 2023-02-15T02:54:49Z

Compare the log between multi-node and single-node training (above single node):

To train the code on multi-node, we can directly change the line
torchrun --standalone --nnodes=1 --nproc_per_node=$num_gpus
in run.sh to
torchrun --nnodes $node_num --nproc_per_node $num_process_per_node --rdzv_backend=c10d --rdzv_endpoint=ip_address_of_one_machine:port

node_num: the number of the node (machine)
num_process_per_node: the process number running on the machine
ip_address_of_one_machine: randomly select one machine for this setup
port: the port for communication

The above command should be run individually on each machine (node).
For more detailed information, refer to MULTINODE TRAINING.

…t be started, test multi-node training

JiJiJiang · 2023-02-15T03:00:30Z

Well done!

[fix] fix rank setup, solve the problem that multiple processes canno…

fb5307b

…t be started, test multi-node training

czy97 requested a review from JiJiJiang February 15, 2023 02:58

czy97 assigned JiJiJiang Feb 15, 2023

JiJiJiang approved these changes Feb 15, 2023

View reviewed changes

JiJiJiang merged commit df5d84d into master Feb 15, 2023

JiJiJiang deleted the dev_czy branch February 15, 2023 03:01

JiJiJiang mentioned this pull request Nov 30, 2023

Questions about distributed training using large-scale DB #240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] fix rank setup, solve the problem that multiple processes cannot be started, test multi-node training #131

[fix] fix rank setup, solve the problem that multiple processes cannot be started, test multi-node training #131

czy97 commented Feb 15, 2023 •

edited

Loading

JiJiJiang commented Feb 15, 2023

[fix] fix rank setup, solve the problem that multiple processes cannot be started, test multi-node training #131

[fix] fix rank setup, solve the problem that multiple processes cannot be started, test multi-node training #131

Conversation

czy97 commented Feb 15, 2023 • edited Loading

JiJiJiang commented Feb 15, 2023

czy97 commented Feb 15, 2023 •

edited

Loading