Skip to content

Single Node distributed multiprocessing training is not working #447

@kuanhan1

Description

@kuanhan1

Hi, I wanted to train a model on imagenet with single node distributed multiprocssing. Here is the error log:

Start training
Start training
Start training
Start training
Traceback (most recent call last):
File "main_multiprocessing_distributed.py", line 483, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/libi/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 141, in spawn
while not spawn_context.join():
File "/home/libi/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 91, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/libi/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 11, in _wrap
fn(i, *args)
File "/home/han424/projects/PCN/PCN_imagenet/main_multiprocessing_distributed.py", line 254, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File "/home/han424/projects/PCN/PCN_imagenet/main_multiprocessing_distributed.py", line 289, in train
for i, (input, target) in enumerate(train_loader):
File "/home/libi/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in iter
return _DataLoaderIter(self)
File "/home/libi/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 560, in init
w.start()
File "/home/libi/anaconda3/lib/python3.7/multiprocessing/process.py", line 110, in start
'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children

I would be thankful to any advice on it.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions