Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Unconvertible NCCL type Short when sending torch.cuda.ShortTensor. #74734

Open
HaoKang-Timmy opened this issue Mar 25, 2022 · 8 comments
Labels
module: c10d Issues/PRs related to collective communications and process groups oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@HaoKang-Timmy
Copy link

HaoKang-Timmy commented Mar 25, 2022

🐛 Describe the bug

The bug happens when I try to use dist.send to send torch.cuda.ShortTensor.
The code is

import torch.multiprocessing as mp
import torch

import torch.distributed as dist

def testfunc(rank,nothing):
    print(rank)
    dist.init_process_group(
        backend="nccl", init_method="tcp://127.0.0.1:1214", world_size=2, rank=rank,group_name="test"
    )
    if rank == 0:
        # something = torch.rand([1,2]).to(0)
        something = torch.rand([1,2]).type(torch.cuda.ShortTensor).to(0)
        
        dist.send(something,1)
        print(something)
    if rank == 1:
        # something = torch.rand([1,2]).to(3)
        something = torch.rand([1,2]).type(torch.cuda.ShortTensor).to(3)
        
        dist.recv(something,0)


    
    
def main():
    torch.multiprocessing.set_start_method("spawn")
    for i in range(2):
        # print(i)
        p = mp.Process(target = testfunc,args = (i,1))
        p.start()




if __name__ == '__main__':
    main()

The error is

0
1
Process Process-2:
Process Process-1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/haokang/distributed/test.py", line 22, in testfunc
    dist.recv(something,0)
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1002, in recv
    pg.recv([tensor], src, tag).wait()
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
RuntimeError: Unconvertible NCCL type Short
  File "/home/haokang/distributed/test.py", line 16, in testfunc
    dist.send(something,1)
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 959, in send
    default_pg.send([tensor], dst, tag).wait()
RuntimeError: Unconvertible NCCL type Short

Versions

python --version 3.8
pytorch --version 1.12.0(pytorch nightly)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

@VitalyFedyunin VitalyFedyunin added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 25, 2022
@fduwjj
Copy link
Contributor

fduwjj commented Mar 25, 2022

I can repo this issue and looks like short tensor is not supported. Not sure if this is the expected behavior. cc: @cbalioglu

@fduwjj fduwjj added the module: c10d Issues/PRs related to collective communications and process groups label Mar 25, 2022
@fduwjj
Copy link
Contributor

fduwjj commented Mar 25, 2022

New feature request is here: #74528

@fduwjj
Copy link
Contributor

fduwjj commented Mar 31, 2022

cc: @kwen2501, do you mind checking with NCCL folks on this. From today's oncall triage meeting, looks like NCCL does not support 16 bit? Correct me if I am missing anything. Thanks!

@kwen2501
Copy link
Contributor

kwen2501 commented Apr 1, 2022

torch.cuda.ShortTensor refers to 16-bit integer, which NCCL does not support.
NCCL supports 8, 32, and 64 bit signed/unsigned integers, instead.

@timmywanttolearn Just curious -- is there a specific use case that asks for 16-integer support?

Cc @sjeaugey for visibility.

@HaoKang-Timmy
Copy link
Author

Sure, I use uniform quantization. And I need to send some 16bits int tensors

@HaoKang-Timmy
Copy link
Author

torch.cuda.ShortTensor refers to 16-bit integer, which NCCL does not support. NCCL supports 8, 32, and 64 bit signed/unsigned integers, instead.

@timmywanttolearn Just curious -- is there a specific use case that asks for 16-integer support?

Cc @sjeaugey for visibility.
Sure, I use uniform quantization. And I need to send some 16bits int tensors

@sjeaugey
Copy link
Contributor

sjeaugey commented Apr 5, 2022

Indeed NCCL does not support 16-bit integers at the moment, but if the goal is to do send/recv, then there is no real need to wait for specific support. PyTorch can simply implement it using uint8 and doubling the count. We do not implement type-specific NCCL kernels except for reductions. They all map to int8 in the end, simply multiplying the size by the datatype size.

@HaoKang-Timmy
Copy link
Author

I got it. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: c10d Issues/PRs related to collective communications and process groups oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

5 participants