You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using https://github.com/facebookresearch/maskrcnn-benchmark for object detection, I want to use box convolutions, when I add a box convolution after some layer, training with 1 GPU is OK, while training with multiple GPUs in distributed mode failed, the error is very similar to this issue, I do not know how to fix, have some ideas? @shrubb
2019-02-18 16:09:15,187 maskrcnn_benchmark.trainer INFO: Start training
Traceback (most recent call last):
File "tools/train_net.py", line 172, in <module>
main()
File "tools/train_net.py", line 165, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 74, in train
arguments,
File "/srv/data0/hzxubinbin/projects/maskrcnn_benchmark/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 79, in do_train
losses.backward()
File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
self._queue_reduction(bucket_idx)
File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]
Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f0d95248148>, [[tensor([[[[0.]],
1 GPU is too slow, I want to use multiple GPUs
The text was updated successfully, but these errors were encountered:
@jiaozizhao nothing special, just add box convolution as a normal layer. And I get no improvement, the loss gets larger and larger, after try several times, I finally give up by using box convs.
I am using https://github.com/facebookresearch/maskrcnn-benchmark for object detection, I want to use box convolutions, when I add a box convolution after some layer, training with 1 GPU is OK, while training with multiple GPUs in distributed mode failed, the error is very similar to this issue, I do not know how to fix, have some ideas? @shrubb
1 GPU is too slow, I want to use multiple GPUs
The text was updated successfully, but these errors were encountered: