-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Description
❓ Questions and Help
Dear Pytorch Team:
I've been reading the documents you provided these days about distributed training. I tried to use mp.spawn
and torch.distributed.launch
to start training. I found that using mp.spawn
is slower than torch.distributed.launch
, mainly in the early stage of each epoch data read.
For example, when using torch.distributed.launch
, it only takes 8 seconds to train an epoch. When using mp.spawn
, it takes 17 seconds to train an epoch, of which the first 9 seconds have been waiting (GPU util is 0%).
And I found that when using torch.distributed.launch
, I can see multiple processes through ps -ef | grep train_multi
instruction, but when I use mp.spawn
, I can only see one processes.
I don't know if I use it incorrectly. I hope I can get your advice. Looking forward to your reply.
environments:
OS: Centos7
Python: 3.6
Pytorch: 1.7 GPU
CUDA: 10.1
GPU: Tesla V100
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd