Problem with multiprocessing, custom getstate with Tensors and forkserver #32351

fmassa · 2020-01-17T14:32:43Z

🐛 Bug

TL;DR: multiprocessing forkserver / spawn + custom class with __getstate__ returning tensors give

RuntimeError: unable to resize file <filename not specified> to the right size

Context

Suppose the user created a custom class which holds a (large) list of torch tensors inside.
In order to handle the large list of tensors as a single tensor (in order to overcome the limit on the number of file descriptors in the system), the user implemented a custom __getstate__ and __setstate__.
Pretty smart, this avoids the limitation on file descriptors without having to change the internal representation of the class!

Everything works fine, until the multiprocessing method is changed from fork to either forkserver or spawn, in which case we get a the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/private/home/fmassa/.conda/envs/classyvision/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/private/home/fmassa/.conda/envs/classyvision/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
  File "/private/home/fmassa/.conda/envs/classyvision/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 299, in rebuild_storage_fd
    storage = cls._new_shared_fd(fd, size)
RuntimeError: unable to resize file <filename not specified> to the right size

To Reproduce

The following code reproduces the problem. I illustrate the issue in two cases (which are actually the same): one when using the DataLoader, and another one just when creating a new process with forkserver.

import torch


class D:

    def __init__(self):
        # the size of the tensor doesn't matter here
        self.x = torch.rand(2).unbind(0)

    def __getitem__(self, idx):
        return self.x[idx]

    def __len__(self):
        return len(self.x)

    # let's handle our (?)large list of tensors
    def __getstate__(self):
        x = torch.stack(self.x)
        return x

    def __setstate__(self, d):
        self.x = d.unbind(0)


def loop(ds):
    for i in range(2):
        print(ds[i])


if __name__ == "__main__":
    torch.multiprocessing.set_start_method('forkserver')
    d = D()
    # two cases
    if False:
        # users normally hit this error in the dataloader
        # but it is not specific to it
        dl = torch.utils.data.DataLoader(d, batch_size=4, num_workers=1)
        it = iter(dl)
        print(next(it))
    else:
        # as we can see, it also happen when just creating
        # a process
        import multiprocessing
        w = multiprocessing.Process(
            target=loop,
            args=(d,))
        w.start()

Expected behavior

No error? :-)

Environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 9.2

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.2.88
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 418.116.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.17.2
[pip] torch==1.4.0
[pip] torchvision==0.5.0a0+07cbb46
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.4                      243
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.14           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] pytorch                   1.4.0           py3.7_cuda9.2.148_cudnn7.6.3_0    pytorch-nightly
[conda] torchvision               0.5.0a0+07cbb46           <pip>

Additional context

This has been reported in the past, see #20409, but without a repro so it was hard to act on it

cc @ssnl

The text was updated successfully, but these errors were encountered:

fmassa · 2020-01-17T20:37:20Z

@ssnl sorry about adding the module: dataloader tag, it was just that there seemed to be no-one responsible for both multiprocessing and serialization modules :-)

ssnl · 2020-01-17T21:01:56Z

@fmassa no worries at all :) thanks for explaining!

dzhulgakov · 2020-01-22T21:20:33Z

It's probably some weird interaction of forking pickler and PyTorch tensor serialization. @driazati might have some idea.

VoVAllen · 2020-07-24T08:25:42Z

I found if set the tensor created in the def __getstate__(self) as the instance attribute, it would work.

class B:
    def __getstate__(self): # this work
        self.a = th.tensor([1,2,3])
        return self.a

   def __getstate__(self): # this doesn't work
        a = th.tensor([1,2,3])
        return a
    
    def __setstate__(self, state):
        print("lalalal")

I guess the problem is due to the object reference counting issue, that the tensor might be invalid before being pickled to another process

Summary: There are issues with multiprocessing, custom __getstate__ with Tensors and forkserver. See details in PR below - pytorch/pytorch#32351 To temporarily mitigate it, expose the `_multiprocessing_context` argument. Differential Revision: D24644136 fbshipit-source-id: ae786f363f5406d268770aa83cd9d86395194ee6

fmassa added module: multiprocessing Related to torch.multiprocessing module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects module: dataloader Related to torch.utils.data.DataLoader and Sampler labels Jan 17, 2020

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 17, 2020

ssnl removed the module: dataloader Related to torch.utils.data.DataLoader and Sampler label Jan 17, 2020

VoVAllen mentioned this issue Jul 24, 2020

[Torch] Fix inter-process communication for PyTorch dmlc/dgl#1858

Merged

6 tasks

stephenyan1231 mentioned this issue Dec 3, 2020

expose _multiprocessing_context in VideoClips class pytorch/vision#3100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with multiprocessing, custom getstate with Tensors and forkserver #32351

Problem with multiprocessing, custom getstate with Tensors and forkserver #32351

fmassa commented Jan 17, 2020 •

edited by pytorch-probot bot

fmassa commented Jan 17, 2020

ssnl commented Jan 17, 2020

dzhulgakov commented Jan 22, 2020

VoVAllen commented Jul 24, 2020

Problem with multiprocessing, custom __getstate__ with Tensors and forkserver #32351

Problem with multiprocessing, custom __getstate__ with Tensors and forkserver #32351

Comments

fmassa commented Jan 17, 2020 • edited by pytorch-probot bot

🐛 Bug

Context

To Reproduce

Expected behavior

Environment

Additional context

fmassa commented Jan 17, 2020

ssnl commented Jan 17, 2020

dzhulgakov commented Jan 22, 2020

VoVAllen commented Jul 24, 2020

Problem with multiprocessing, custom getstate with Tensors and forkserver #32351

Problem with multiprocessing, custom getstate with Tensors and forkserver #32351

fmassa commented Jan 17, 2020 •

edited by pytorch-probot bot