Multi-GPU Training: distributed error encountered #5

freesouls · 2019-02-18T08:28:07Z

I am using https://github.com/facebookresearch/maskrcnn-benchmark for object detection, I want to use box convolutions, when I add a box convolution after some layer, training with 1 GPU is OK, while training with multiple GPUs in distributed mode failed, the error is very similar to this issue, I do not know how to fix, have some ideas? @shrubb

2019-02-18 16:09:15,187 maskrcnn_benchmark.trainer INFO: Start training
Traceback (most recent call last):
  File "tools/train_net.py", line 172, in <module>
    main()
  File "tools/train_net.py", line 165, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 74, in train
    arguments,
  File "/srv/data0/hzxubinbin/projects/maskrcnn_benchmark/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 79, in do_train
    losses.backward()
  File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f0d95248148>, [[tensor([[[[0.]],

1 GPU is too slow, I want to use multiple GPUs

The text was updated successfully, but these errors were encountered:

freesouls · 2019-02-18T09:05:34Z

find another way to use box_convolution properly, close this issue!

jiaozizhao · 2019-02-22T15:50:08Z

Hi,
@freesouls How did you fix the problem? Thanks.

jiaozizhao · 2019-02-22T15:51:56Z

And did you get good performance by using box convolution layer for object detection?

freesouls · 2019-02-23T06:56:01Z

@jiaozizhao nothing special, just add box convolution as a normal layer. And I get no improvement, the loss gets larger and larger, after try several times, I finally give up by using box convs.

freesouls closed this as completed Feb 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Training: distributed error encountered #5

Multi-GPU Training: distributed error encountered #5

freesouls commented Feb 18, 2019 •

edited

freesouls commented Feb 18, 2019

jiaozizhao commented Feb 22, 2019

jiaozizhao commented Feb 22, 2019

freesouls commented Feb 23, 2019

Multi-GPU Training: distributed error encountered #5

Multi-GPU Training: distributed error encountered #5

Comments

freesouls commented Feb 18, 2019 • edited

freesouls commented Feb 18, 2019

jiaozizhao commented Feb 22, 2019

jiaozizhao commented Feb 22, 2019

freesouls commented Feb 23, 2019

freesouls commented Feb 18, 2019 •

edited