Cannot share cuda.Tensor on GPU 1 among processes. #1707

yuandong-tian · 2017-06-02T21:24:18Z

Here is a minimal example. Is there any solution?

THCudaCheck FAIL file=/py/conda-bld/pytorch_1493670682084/work/torch/csrc/generic/StorageSharing.cpp line=248 error=11 : invalid argument
Traceback (most recent call last):
File "/home/yuandong/anaconda3/lib/python3.5/multiprocessing/queues.py", line 241, in _feed
obj = ForkingPickler.dumps(obj)
File "/home/yuandong/anaconda3/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
File "/home/yuandong/anaconda3/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 104, in reduce_storage
metadata = storage.share_cuda()
RuntimeError: cuda runtime error (11) : invalid argument at /py/conda-bld/pytorch_1493670682084/work/torch/csrc/generic/StorageSharing.cpp:248

import torch
import torch.multiprocessing as _mp
mp = _mp.get_context('spawn')

def process_main(idx, q, b):
    m = q.get()
    b.wait()
    for i in range(10):
        q.get()
        print("[%d] %f, %f" % (idx, m["a"][0, 0], m["b"][2, 3]))
        b.wait()

if __name__ == "__main__":
    m = dict(a=torch.FloatTensor(2, 3).cuda(1), b=torch.FloatTensor(3, 4))
    total_process = 3

    q = mp.Queue()
    b = mp.Barrier(total_process)

    for i in range(total_process - 1):
        proc = mp.Process(target=process_main, args=(i, q, b))
        proc.start()

    for i in range(total_process - 1):
        q.put(m)
    b.wait()

    for i in range(10):
        m["a"][0, 0] = i
        m["b"][2, 3] = 2*i
        for j in range(total_process-1):
            q.put(1)
        b.wait()

The text was updated successfully, but these errors were encountered:

ngimel · 2017-06-02T22:21:33Z

That happens because context is incorrectly set on the 0-th device.
Replacing this https://github.com/pytorch/pytorch/blob/master/torch/multiprocessing/reductions.py#L104-L106
with

       with torch.cuda.device(storage.get_device()):
            metadata = storage._share_cuda_()
            cache_key = metadata[1]
            rebuild = rebuild_storage_cuda

gets you further, but there's still an error when shared tensor gets freed (also because of an incorrectly set device):

terminate called after throwing an instance of 'THException'
  what():  cuda runtime error (11) : invalid argument at /tmp/pip-95zy2atz-build/torch/lib/THC/generic/THCStorage.c:182
THCudaCheck FAIL file=/tmp/pip-95zy2atz-build/torch/lib/THC/generic/THCStorage.c line=182 error=11 : invalid argument
terminate called after throwing an instance of 'THException'
  what():  cuda runtime error (11) : invalid argument at /tmp/pip-95zy2atz-build/torch/lib/THC/generic/THCStorage.c:182

The correct device must be set when getting the base allocation and when calling cudaIpcCloseMemHandle. Store the device in the allocators context, which was previously always NULL. Fixes pytorch#1707

…b18ba1 (pytorch#15739) Summary: Pull Request resolved: pytorch#15739 Previous import was 765f5ee823a67a866f4bd28a9860e81f3c811ce8 Included changes: - **[8384c78](onnx/onnx@8384c78)**: add constantofshape (pytorch#1582) <Rui Zhu> - **[9afc06c](onnx/onnx@9afc06c)**: Set symbol visibility to hidden for non-Windows (pytorch#1707) <Paul Jesse Hellemn> - **[6f8a9f0](onnx/onnx@6f8a9f0)**: Revert "Add NonMaxSupression operator (pytorch#1695)" (pytorch#1702) <Lu Fang> - **[8b89544](onnx/onnx@8b89544)**: Add NonMaxSupression operator (pytorch#1695) <Hector Li> - **[0a7cc48](onnx/onnx@0a7cc48)**: Add bfloat16 support. (pytorch#1699) <Dmitri Smirnov> - **[da7c50c](onnx/onnx@da7c50c)**: ONNX does not maintain versions for experimental ops (pytorch#1696) <Ke Zhang> - **[0c8d857](onnx/onnx@0c8d857)**: Correct type of value_info in Graph (pytorch#1694) <Maik Riechert> - **[f612532](onnx/onnx@f612532)**: Fix typos (pytorch#1686) <Eundoo Song> Reviewed By: zrphercule Differential Revision: D13581674 fbshipit-source-id: a961667184b09d2822815ba5d3fa4198a4c57e88

…b18ba1 (#15739) Summary: Pull Request resolved: #15739 Previous import was 765f5ee823a67a866f4bd28a9860e81f3c811ce8 Included changes: - **[8384c78](onnx/onnx@8384c78)**: add constantofshape (#1582) <Rui Zhu> - **[9afc06c](onnx/onnx@9afc06c)**: Set symbol visibility to hidden for non-Windows (#1707) <Paul Jesse Hellemn> - **[6f8a9f0](onnx/onnx@6f8a9f0)**: Revert "Add NonMaxSupression operator (#1695)" (#1702) <Lu Fang> - **[8b89544](onnx/onnx@8b89544)**: Add NonMaxSupression operator (#1695) <Hector Li> - **[0a7cc48](onnx/onnx@0a7cc48)**: Add bfloat16 support. (#1699) <Dmitri Smirnov> - **[da7c50c](onnx/onnx@da7c50c)**: ONNX does not maintain versions for experimental ops (#1696) <Ke Zhang> - **[0c8d857](onnx/onnx@0c8d857)**: Correct type of value_info in Graph (#1694) <Maik Riechert> - **[f612532](onnx/onnx@f612532)**: Fix typos (#1686) <Eundoo Song> Reviewed By: zrphercule Differential Revision: D13581674 fbshipit-source-id: 8f8ee86a05a86fe99bf94509148c559ea3df1464

…b18ba1 (pytorch#15739) Summary: Pull Request resolved: pytorch#15739 Previous import was 765f5ee823a67a866f4bd28a9860e81f3c811ce8 Included changes: - **[8384c78](onnx/onnx@8384c78)**: add constantofshape (pytorch#1582) <Rui Zhu> - **[9afc06c](onnx/onnx@9afc06c)**: Set symbol visibility to hidden for non-Windows (pytorch#1707) <Paul Jesse Hellemn> - **[6f8a9f0](onnx/onnx@6f8a9f0)**: Revert "Add NonMaxSupression operator (pytorch#1695)" (pytorch#1702) <Lu Fang> - **[8b89544](onnx/onnx@8b89544)**: Add NonMaxSupression operator (pytorch#1695) <Hector Li> - **[0a7cc48](onnx/onnx@0a7cc48)**: Add bfloat16 support. (pytorch#1699) <Dmitri Smirnov> - **[da7c50c](onnx/onnx@da7c50c)**: ONNX does not maintain versions for experimental ops (pytorch#1696) <Ke Zhang> - **[0c8d857](onnx/onnx@0c8d857)**: Correct type of value_info in Graph (pytorch#1694) <Maik Riechert> - **[f612532](onnx/onnx@f612532)**: Fix typos (pytorch#1686) <Eundoo Song> Reviewed By: zrphercule Differential Revision: D13581674 fbshipit-source-id: 8f8ee86a05a86fe99bf94509148c559ea3df1464

apaszke added bug high priority labels Jun 3, 2017

colesbury self-assigned this Jun 5, 2017

colesbury mentioned this issue Jun 5, 2017

Fix sharing of CUDA tensors on non-current devices #1726

Closed

soumith closed this as completed in 85a95d8 Jun 5, 2017

houseroad mentioned this issue Jan 4, 2019

Automatic update of fbcode/onnx to 8384c788939bc65463f9754b6a7a00b212b18ba1 #15739

Closed

jjsjann123 pushed a commit to jjsjann123/pytorch that referenced this issue May 17, 2022

Fix predicate with contig merge and broadcast (pytorch#1707)

db8d2c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot share cuda.Tensor on GPU 1 among processes. #1707

Cannot share cuda.Tensor on GPU 1 among processes. #1707

yuandong-tian commented Jun 2, 2017

ngimel commented Jun 2, 2017

Cannot share cuda.Tensor on GPU 1 among processes. #1707

Cannot share cuda.Tensor on GPU 1 among processes. #1707

Comments

yuandong-tian commented Jun 2, 2017

ngimel commented Jun 2, 2017