-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Description
Hi, I wanted to train a model on imagenet with single node distributed multiprocssing. Here is the error log:
Start training
Start training
Start training
Start training
Traceback (most recent call last):
File "main_multiprocessing_distributed.py", line 483, in
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/libi/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 141, in spawn
while not spawn_context.join():
File "/home/libi/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 91, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/libi/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 11, in _wrap
fn(i, *args)
File "/home/han424/projects/PCN/PCN_imagenet/main_multiprocessing_distributed.py", line 254, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File "/home/han424/projects/PCN/PCN_imagenet/main_multiprocessing_distributed.py", line 289, in train
for i, (input, target) in enumerate(train_loader):
File "/home/libi/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in iter
return _DataLoaderIter(self)
File "/home/libi/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 560, in init
w.start()
File "/home/libi/anaconda3/lib/python3.7/multiprocessing/process.py", line 110, in start
'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
I would be thankful to any advice on it.