Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcasting does not work for Quantization aware training with multiple GPUs #37270

Closed
raghuramank100 opened this issue Apr 25, 2020 · 2 comments
Assignees
Labels
oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@raghuramank100
Copy link
Contributor

raghuramank100 commented Apr 25, 2020

Repro code and error info are at:
https://discuss.pytorch.org/t/quantization-awareness-training-multi-gpu-suport/66106

Snippet of error at:
Traceback (most recent call last):
File “train_quantization.py”, line 258, in
main(args)
File “train_quantization.py”, line 77, in main
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
File “xxx/.conda/envs/pytorch1.3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py”, line 298, in init
self.broadcast_bucket_size)
File “xxx/.conda/envs/pytorch1.3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py”, line 480, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
TypeError: _broadcast_coalesced(): incompatible function arguments. The following argument types are supported:

  1. (process_group: torch.distributed.ProcessGroup, tensors: List[at::Tensor], buffer_size: int) -> None

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f943f78dd18>, [tensor([[[[ 1.3185e-02, -4.3213e-03, 1.4823e-02],

Note that problem is present in pytorch 1.5

cc @jerryzh168 @jianyuh @dzhulgakov @raghuramank100 @jamesr66a

@raghuramank100 raghuramank100 added oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 25, 2020
@vkuzo
Copy link
Contributor

vkuzo commented May 7, 2020

pytorch/vision#2191 updates the tutorial to work better with QAT+DDP. There is still work to do in verifying BN correctness, which will be in a separate PR.

@vkuzo
Copy link
Contributor

vkuzo commented Jul 8, 2020

#38587, #39031, #38368, #38478 fixed this issue.

@vkuzo vkuzo closed this as completed Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

2 participants