Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] fix rank setup, solve the problem that multiple processes cannot be started, test multi-node training #131

Merged
merged 1 commit into from
Feb 15, 2023

Conversation

czy97
Copy link
Collaborator

@czy97 czy97 commented Feb 15, 2023

Compare the log between multi-node and single-node training (above single node):
image
To train the code on multi-node, we can directly change the line
torchrun --standalone --nnodes=1 --nproc_per_node=$num_gpus
in run.sh to
torchrun --nnodes $node_num --nproc_per_node $num_process_per_node --rdzv_backend=c10d --rdzv_endpoint=ip_address_of_one_machine:port

  • node_num: the number of the node (machine)
  • num_process_per_node: the process number running on the machine
  • ip_address_of_one_machine: randomly select one machine for this setup
  • port: the port for communication

The above command should be run individually on each machine (node).
For more detailed information, refer to MULTINODE TRAINING.

@JiJiJiang
Copy link
Collaborator

Well done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants