Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Training: distributed error encountered #5

Closed
freesouls opened this issue Feb 18, 2019 · 4 comments
Closed

Multi-GPU Training: distributed error encountered #5

freesouls opened this issue Feb 18, 2019 · 4 comments

Comments

@freesouls
Copy link

freesouls commented Feb 18, 2019

I am using https://github.com/facebookresearch/maskrcnn-benchmark for object detection, I want to use box convolutions, when I add a box convolution after some layer, training with 1 GPU is OK, while training with multiple GPUs in distributed mode failed, the error is very similar to this issue, I do not know how to fix, have some ideas? @shrubb

2019-02-18 16:09:15,187 maskrcnn_benchmark.trainer INFO: Start training
Traceback (most recent call last):
  File "tools/train_net.py", line 172, in <module>
    main()
  File "tools/train_net.py", line 165, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 74, in train
    arguments,
  File "/srv/data0/hzxubinbin/projects/maskrcnn_benchmark/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 79, in do_train
    losses.backward()
  File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/home/hzxubinbin/anaconda3.1812/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f0d95248148>, [[tensor([[[[0.]],

1 GPU is too slow, I want to use multiple GPUs

@freesouls
Copy link
Author

find another way to use box_convolution properly, close this issue!

@jiaozizhao
Copy link

Hi,
@freesouls How did you fix the problem? Thanks.

@jiaozizhao
Copy link

And did you get good performance by using box convolution layer for object detection?

@freesouls
Copy link
Author

@jiaozizhao nothing special, just add box convolution as a normal layer. And I get no improvement, the loss gets larger and larger, after try several times, I finally give up by using box convs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants