Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot share cuda.Tensor on GPU 1 among processes. #1707

Closed
yuandong-tian opened this issue Jun 2, 2017 · 1 comment
Closed

Cannot share cuda.Tensor on GPU 1 among processes. #1707

yuandong-tian opened this issue Jun 2, 2017 · 1 comment
Assignees

Comments

@yuandong-tian
Copy link

Here is a minimal example. Is there any solution?

THCudaCheck FAIL file=/py/conda-bld/pytorch_1493670682084/work/torch/csrc/generic/StorageSharing.cpp line=248 error=11 : invalid argument
Traceback (most recent call last):
File "/home/yuandong/anaconda3/lib/python3.5/multiprocessing/queues.py", line 241, in _feed
obj = ForkingPickler.dumps(obj)
File "/home/yuandong/anaconda3/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
File "/home/yuandong/anaconda3/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 104, in reduce_storage
metadata = storage.share_cuda()
RuntimeError: cuda runtime error (11) : invalid argument at /py/conda-bld/pytorch_1493670682084/work/torch/csrc/generic/StorageSharing.cpp:248

import torch
import torch.multiprocessing as _mp
mp = _mp.get_context('spawn')

def process_main(idx, q, b):
    m = q.get()
    b.wait()
    for i in range(10):
        q.get()
        print("[%d] %f, %f" % (idx, m["a"][0, 0], m["b"][2, 3]))
        b.wait()

if __name__ == "__main__":
    m = dict(a=torch.FloatTensor(2, 3).cuda(1), b=torch.FloatTensor(3, 4))
    total_process = 3

    q = mp.Queue()
    b = mp.Barrier(total_process)

    for i in range(total_process - 1):
        proc = mp.Process(target=process_main, args=(i, q, b))
        proc.start()

    for i in range(total_process - 1):
        q.put(m)
    b.wait()

    for i in range(10):
        m["a"][0, 0] = i
        m["b"][2, 3] = 2*i
        for j in range(total_process-1):
            q.put(1)
        b.wait()
@ngimel
Copy link
Collaborator

ngimel commented Jun 2, 2017

That happens because context is incorrectly set on the 0-th device.
Replacing this https://github.com/pytorch/pytorch/blob/master/torch/multiprocessing/reductions.py#L104-L106
with

       with torch.cuda.device(storage.get_device()):
            metadata = storage._share_cuda_()
            cache_key = metadata[1]
            rebuild = rebuild_storage_cuda

gets you further, but there's still an error when shared tensor gets freed (also because of an incorrectly set device):

terminate called after throwing an instance of 'THException'
  what():  cuda runtime error (11) : invalid argument at /tmp/pip-95zy2atz-build/torch/lib/THC/generic/THCStorage.c:182
THCudaCheck FAIL file=/tmp/pip-95zy2atz-build/torch/lib/THC/generic/THCStorage.c line=182 error=11 : invalid argument
terminate called after throwing an instance of 'THException'
  what():  cuda runtime error (11) : invalid argument at /tmp/pip-95zy2atz-build/torch/lib/THC/generic/THCStorage.c:182

@colesbury colesbury self-assigned this Jun 5, 2017
colesbury added a commit to colesbury/pytorch that referenced this issue Jun 5, 2017
The correct device must be set when getting the base allocation and when
calling cudaIpcCloseMemHandle. Store the device in the allocators
context, which was previously always NULL.

Fixes pytorch#1707
@soumith soumith closed this as completed in 85a95d8 Jun 5, 2017
houseroad added a commit to houseroad/pytorch that referenced this issue Jan 4, 2019
…b18ba1 (pytorch#15739)

Summary:
Pull Request resolved: pytorch#15739

Previous import was 765f5ee823a67a866f4bd28a9860e81f3c811ce8

Included changes:
- **[8384c78](onnx/onnx@8384c78)**: add constantofshape (pytorch#1582) <Rui Zhu>
- **[9afc06c](onnx/onnx@9afc06c)**: Set symbol visibility to hidden for non-Windows (pytorch#1707) <Paul Jesse Hellemn>
- **[6f8a9f0](onnx/onnx@6f8a9f0)**: Revert "Add NonMaxSupression operator (pytorch#1695)" (pytorch#1702) <Lu Fang>
- **[8b89544](onnx/onnx@8b89544)**: Add NonMaxSupression operator (pytorch#1695) <Hector Li>
- **[0a7cc48](onnx/onnx@0a7cc48)**: Add bfloat16 support. (pytorch#1699) <Dmitri Smirnov>
- **[da7c50c](onnx/onnx@da7c50c)**: ONNX does not maintain versions for experimental ops (pytorch#1696) <Ke Zhang>
- **[0c8d857](onnx/onnx@0c8d857)**: Correct type of value_info in Graph (pytorch#1694) <Maik Riechert>
- **[f612532](onnx/onnx@f612532)**: Fix typos (pytorch#1686) <Eundoo Song>

Reviewed By: zrphercule

Differential Revision: D13581674

fbshipit-source-id: a961667184b09d2822815ba5d3fa4198a4c57e88
facebook-github-bot pushed a commit that referenced this issue Jan 4, 2019
…b18ba1 (#15739)

Summary:
Pull Request resolved: #15739

Previous import was 765f5ee823a67a866f4bd28a9860e81f3c811ce8

Included changes:
- **[8384c78](onnx/onnx@8384c78)**: add constantofshape (#1582) <Rui Zhu>
- **[9afc06c](onnx/onnx@9afc06c)**: Set symbol visibility to hidden for non-Windows (#1707) <Paul Jesse Hellemn>
- **[6f8a9f0](onnx/onnx@6f8a9f0)**: Revert "Add NonMaxSupression operator (#1695)" (#1702) <Lu Fang>
- **[8b89544](onnx/onnx@8b89544)**: Add NonMaxSupression operator (#1695) <Hector Li>
- **[0a7cc48](onnx/onnx@0a7cc48)**: Add bfloat16 support. (#1699) <Dmitri Smirnov>
- **[da7c50c](onnx/onnx@da7c50c)**: ONNX does not maintain versions for experimental ops (#1696) <Ke Zhang>
- **[0c8d857](onnx/onnx@0c8d857)**: Correct type of value_info in Graph (#1694) <Maik Riechert>
- **[f612532](onnx/onnx@f612532)**: Fix typos (#1686) <Eundoo Song>

Reviewed By: zrphercule

Differential Revision: D13581674

fbshipit-source-id: 8f8ee86a05a86fe99bf94509148c559ea3df1464
mrshenli pushed a commit to mrshenli/pytorch that referenced this issue Jan 6, 2019
…b18ba1 (pytorch#15739)

Summary:
Pull Request resolved: pytorch#15739

Previous import was 765f5ee823a67a866f4bd28a9860e81f3c811ce8

Included changes:
- **[8384c78](onnx/onnx@8384c78)**: add constantofshape (pytorch#1582) <Rui Zhu>
- **[9afc06c](onnx/onnx@9afc06c)**: Set symbol visibility to hidden for non-Windows (pytorch#1707) <Paul Jesse Hellemn>
- **[6f8a9f0](onnx/onnx@6f8a9f0)**: Revert "Add NonMaxSupression operator (pytorch#1695)" (pytorch#1702) <Lu Fang>
- **[8b89544](onnx/onnx@8b89544)**: Add NonMaxSupression operator (pytorch#1695) <Hector Li>
- **[0a7cc48](onnx/onnx@0a7cc48)**: Add bfloat16 support. (pytorch#1699) <Dmitri Smirnov>
- **[da7c50c](onnx/onnx@da7c50c)**: ONNX does not maintain versions for experimental ops (pytorch#1696) <Ke Zhang>
- **[0c8d857](onnx/onnx@0c8d857)**: Correct type of value_info in Graph (pytorch#1694) <Maik Riechert>
- **[f612532](onnx/onnx@f612532)**: Fix typos (pytorch#1686) <Eundoo Song>

Reviewed By: zrphercule

Differential Revision: D13581674

fbshipit-source-id: 8f8ee86a05a86fe99bf94509148c559ea3df1464
jjsjann123 pushed a commit to jjsjann123/pytorch that referenced this issue May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants