Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with multiprocessing, custom __getstate__ with Tensors and forkserver #32351

Open
fmassa opened this issue Jan 17, 2020 · 4 comments
Open
Labels
module: multiprocessing Related to torch.multiprocessing module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@fmassa
Copy link
Member

fmassa commented Jan 17, 2020

馃悰 Bug

TL;DR: multiprocessing forkserver / spawn + custom class with __getstate__ returning tensors give

RuntimeError: unable to resize file <filename not specified> to the right size

Context

Suppose the user created a custom class which holds a (large) list of torch tensors inside.
In order to handle the large list of tensors as a single tensor (in order to overcome the limit on the number of file descriptors in the system), the user implemented a custom __getstate__ and __setstate__.
Pretty smart, this avoids the limitation on file descriptors without having to change the internal representation of the class!

Everything works fine, until the multiprocessing method is changed from fork to either forkserver or spawn, in which case we get a the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/private/home/fmassa/.conda/envs/classyvision/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/private/home/fmassa/.conda/envs/classyvision/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
  File "/private/home/fmassa/.conda/envs/classyvision/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 299, in rebuild_storage_fd
    storage = cls._new_shared_fd(fd, size)
RuntimeError: unable to resize file <filename not specified> to the right size

To Reproduce

The following code reproduces the problem. I illustrate the issue in two cases (which are actually the same): one when using the DataLoader, and another one just when creating a new process with forkserver.

import torch


class D:

    def __init__(self):
        # the size of the tensor doesn't matter here
        self.x = torch.rand(2).unbind(0)

    def __getitem__(self, idx):
        return self.x[idx]

    def __len__(self):
        return len(self.x)

    # let's handle our (?)large list of tensors
    def __getstate__(self):
        x = torch.stack(self.x)
        return x

    def __setstate__(self, d):
        self.x = d.unbind(0)


def loop(ds):
    for i in range(2):
        print(ds[i])


if __name__ == "__main__":
    torch.multiprocessing.set_start_method('forkserver')
    d = D()
    # two cases
    if False:
        # users normally hit this error in the dataloader
        # but it is not specific to it
        dl = torch.utils.data.DataLoader(d, batch_size=4, num_workers=1)
        it = iter(dl)
        print(next(it))
    else:
        # as we can see, it also happen when just creating
        # a process
        import multiprocessing
        w = multiprocessing.Process(
            target=loop,
            args=(d,))
        w.start()

Expected behavior

No error? :-)

Environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 9.2

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.2.88
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 418.116.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.17.2
[pip] torch==1.4.0
[pip] torchvision==0.5.0a0+07cbb46
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.4                      243
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.14           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] pytorch                   1.4.0           py3.7_cuda9.2.148_cudnn7.6.3_0    pytorch-nightly
[conda] torchvision               0.5.0a0+07cbb46           <pip>

Additional context

This has been reported in the past, see #20409, but without a repro so it was hard to act on it

cc @ssnl

@fmassa fmassa added module: multiprocessing Related to torch.multiprocessing module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects module: dataloader Related to torch.utils.data.DataLoader and Sampler labels Jan 17, 2020
@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 17, 2020
@ssnl ssnl removed the module: dataloader Related to torch.utils.data.DataLoader and Sampler label Jan 17, 2020
@fmassa
Copy link
Member Author

fmassa commented Jan 17, 2020

@ssnl sorry about adding the module: dataloader tag, it was just that there seemed to be no-one responsible for both multiprocessing and serialization modules :-)

@ssnl
Copy link
Collaborator

ssnl commented Jan 17, 2020

@fmassa no worries at all :) thanks for explaining!

@dzhulgakov
Copy link
Collaborator

It's probably some weird interaction of forking pickler and PyTorch tensor serialization. @driazati might have some idea.

@VoVAllen
Copy link

I found if set the tensor created in the def __getstate__(self) as the instance attribute, it would work.

class B:
    def __getstate__(self): # this work
        self.a = th.tensor([1,2,3])
        return self.a

   def __getstate__(self): # this doesn't work
        a = th.tensor([1,2,3])
        return a
    
    def __setstate__(self, state):
        print("lalalal")
    

I guess the problem is due to the object reference counting issue, that the tensor might be invalid before being pickled to another process

stephenyan1231 added a commit to stephenyan1231/vision that referenced this issue Dec 3, 2020
Summary:
There are issues with multiprocessing, custom __getstate__ with Tensors and forkserver.

See details in PR below

- pytorch/pytorch#32351

To temporarily mitigate it, expose the `_multiprocessing_context` argument.

Differential Revision: D24644136

fbshipit-source-id: ae786f363f5406d268770aa83cd9d86395194ee6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: multiprocessing Related to torch.multiprocessing module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants