Distributed training in stage 3.3 keeps hanging #11

srzer · 2024-06-11T07:41:19Z

In stage 3.3, when I set distributed_type as NO, the code runs well; while when I try distributed_type as DEEPSPEED or MULTI_GPU, the code gets stuck when loading training_args = TrainingArguments(. For DEEPSPEED, the terminal stucks when showing

[2024-06-11 00:23:36,254] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-11 00:23:36,254] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-11 00:23:36,296] [INFO] [comm.py:637:init_distributed] cdb=None

Do you have some idea? My cuda version is 12.4

The text was updated successfully, but these errors were encountered:

WeiXiongUST · 2024-06-11T18:27:24Z

Once upon a time, I encountered this issue when there are other codes are running and the codes use deepspeed or accelerate but I am not sure whether this is related to your situation.

You may look into this potential solution
microsoft/DeepSpeed#3416

srzer · 2024-06-12T21:56:38Z

Thank you for your suggestions. And I finally find that this issue was due to my adding one line 'main_process_port: 0' in the zerox.yaml configs.

srzer closed this as completed Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training in stage 3.3 keeps hanging #11

Distributed training in stage 3.3 keeps hanging #11

srzer commented Jun 11, 2024

WeiXiongUST commented Jun 11, 2024

srzer commented Jun 12, 2024

Distributed training in stage 3.3 keeps hanging #11

Distributed training in stage 3.3 keeps hanging #11

Comments

srzer commented Jun 11, 2024

WeiXiongUST commented Jun 11, 2024

srzer commented Jun 12, 2024