https://github.com/pytorch/examples/blob/main/imagenet/main.py
RuntimeError: Tensors must be CUDA and dense
python main.py --dist-url 'tcp://xxx.xxx.xxx.xx:23456' --multiprocessing-distributed --world-size 1 --rank 0
...
Traceback (most recent call last):
File "main_spawn.py", line 490, in <module>
main()
File "main_spawn.py", line 119, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/chris/codes/speed_check_ddp/main_spawn.py", line 265, in main_worker
acc1 = validate(val_loader, model, criterion, args)
File "/home/chris/codes/speed_check_ddp/main_spawn.py", line 376, in validate
top1.all_reduce()
File "/home/chris/codes/speed_check_ddp/main_spawn.py", line 425, in all_reduce
dist.all_reduce(total, dist.ReduceOp.SUM, async_op=True)
File "/home/chris/anaconda3/envs/plts/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1169, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
Your issue may already be reported!
Please search on the issue tracker before creating one.
Context
Your Environment
LINK
Expected Behavior
It should run
Current Behavior
It outputs:
when it computes
https://github.com/pytorch/examples/blob/main/imagenet/main.py#L419
Possible Solution
Steps to Reproduce
...
Failure Logs [if any]