RuntimeError: Unconvertible NCCL type Short when sending torch.cuda.ShortTensor. #74734

HaoKang-Timmy · 2022-03-25T03:15:29Z

🐛 Describe the bug

The bug happens when I try to use dist.send to send torch.cuda.ShortTensor.
The code is

import torch.multiprocessing as mp
import torch

import torch.distributed as dist

def testfunc(rank,nothing):
    print(rank)
    dist.init_process_group(
        backend="nccl", init_method="tcp://127.0.0.1:1214", world_size=2, rank=rank,group_name="test"
    )
    if rank == 0:
        # something = torch.rand([1,2]).to(0)
        something = torch.rand([1,2]).type(torch.cuda.ShortTensor).to(0)
        
        dist.send(something,1)
        print(something)
    if rank == 1:
        # something = torch.rand([1,2]).to(3)
        something = torch.rand([1,2]).type(torch.cuda.ShortTensor).to(3)
        
        dist.recv(something,0)


    
    
def main():
    torch.multiprocessing.set_start_method("spawn")
    for i in range(2):
        # print(i)
        p = mp.Process(target = testfunc,args = (i,1))
        p.start()




if __name__ == '__main__':
    main()

The error is

0
1
Process Process-2:
Process Process-1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/haokang/distributed/test.py", line 22, in testfunc
    dist.recv(something,0)
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1002, in recv
    pg.recv([tensor], src, tag).wait()
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
RuntimeError: Unconvertible NCCL type Short
  File "/home/haokang/distributed/test.py", line 16, in testfunc
    dist.send(something,1)
  File "/home/haokang/anaconda3/envs/kh3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 959, in send
    default_pg.send([tensor], dst, tag).wait()
RuntimeError: Unconvertible NCCL type Short

Versions

python --version 3.8
pytorch --version 1.12.0(pytorch nightly)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

The text was updated successfully, but these errors were encountered:

fduwjj · 2022-03-25T22:39:02Z

I can repo this issue and looks like short tensor is not supported. Not sure if this is the expected behavior. cc: @cbalioglu

fduwjj · 2022-03-25T22:46:03Z

New feature request is here: #74528

fduwjj · 2022-03-31T01:28:46Z

cc: @kwen2501, do you mind checking with NCCL folks on this. From today's oncall triage meeting, looks like NCCL does not support 16 bit? Correct me if I am missing anything. Thanks!

kwen2501 · 2022-04-01T23:53:28Z

torch.cuda.ShortTensor refers to 16-bit integer, which NCCL does not support.
NCCL supports 8, 32, and 64 bit signed/unsigned integers, instead.

@timmywanttolearn Just curious -- is there a specific use case that asks for 16-integer support?

Cc @sjeaugey for visibility.

HaoKang-Timmy · 2022-04-04T06:27:42Z

Sure, I use uniform quantization. And I need to send some 16bits int tensors

HaoKang-Timmy · 2022-04-04T06:27:55Z

torch.cuda.ShortTensor refers to 16-bit integer, which NCCL does not support. NCCL supports 8, 32, and 64 bit signed/unsigned integers, instead.

@timmywanttolearn Just curious -- is there a specific use case that asks for 16-integer support?

Cc @sjeaugey for visibility.
Sure, I use uniform quantization. And I need to send some 16bits int tensors

sjeaugey · 2022-04-05T06:43:29Z

Indeed NCCL does not support 16-bit integers at the moment, but if the goal is to do send/recv, then there is no real need to wait for specific support. PyTorch can simply implement it using uint8 and doubling the count. We do not implement type-specific NCCL kernels except for reductions. They all map to int8 in the end, simply multiplying the size by the datatype size.

HaoKang-Timmy · 2022-04-07T17:18:08Z

I got it. Thank you.

VitalyFedyunin added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 25, 2022

fduwjj added the module: c10d Issues/PRs related to collective communications and process groups label Mar 25, 2022

fduwjj mentioned this issue Mar 25, 2022

Supports for dist.send/dist.recv sending and recving torch.shorttensor #74528

Open

Caenorst mentioned this issue May 1, 2024

Asking support for multiple-gpu training of structed point cloud class NVIDIAGameWorks/kaolin#794

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Unconvertible NCCL type Short when sending torch.cuda.ShortTensor. #74734

RuntimeError: Unconvertible NCCL type Short when sending torch.cuda.ShortTensor. #74734

HaoKang-Timmy commented Mar 25, 2022 •

edited by pytorch-bot bot

Loading

fduwjj commented Mar 25, 2022

fduwjj commented Mar 25, 2022

fduwjj commented Mar 31, 2022

kwen2501 commented Apr 1, 2022

HaoKang-Timmy commented Apr 4, 2022

HaoKang-Timmy commented Apr 4, 2022

sjeaugey commented Apr 5, 2022

HaoKang-Timmy commented Apr 7, 2022

RuntimeError: Unconvertible NCCL type Short when sending torch.cuda.ShortTensor. #74734

RuntimeError: Unconvertible NCCL type Short when sending torch.cuda.ShortTensor. #74734

Comments

HaoKang-Timmy commented Mar 25, 2022 • edited by pytorch-bot bot Loading

🐛 Describe the bug

Versions

fduwjj commented Mar 25, 2022

fduwjj commented Mar 25, 2022

fduwjj commented Mar 31, 2022

kwen2501 commented Apr 1, 2022

HaoKang-Timmy commented Apr 4, 2022

HaoKang-Timmy commented Apr 4, 2022

sjeaugey commented Apr 5, 2022

HaoKang-Timmy commented Apr 7, 2022

HaoKang-Timmy commented Mar 25, 2022 •

edited by pytorch-bot bot

Loading