Skip to content

DataLoader.__del__ bug with mpi4py #5355

@stsievert

Description

@stsievert

I'm creating a torch.utils.data.DataLoader in my script (train.py), and I get an error when this script is close to ending. I am using MPI through mpi4py, not the MPI PyTorch distributed backend. This error only happens when under mpirun python train.py, not under python train.py (which completes successfully).

Nothing in the traceback reveals this is an issue with my code, and I haven't touched the DataLoader at all. I've been running this code for a couple months in the same environment, and I haven't seen this bug pop before.

That is, this command does not raise an error:

$ python train.py --epochs=0
# works fine; no error

However, this command does raise an error:

$ mpirun python train.py --epochs=0
# normal output, almost finished running script
# produces traceback below

Traceback:

Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x7fd7da6d1668>>
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 333, in __del__
    self._shutdown_workers()
  File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 319, in _shutdown_workers
    self.data_queue.get()
  File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/queues.py", line 345, in get
    return _ForkingPickler.loads(res)
  File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory
  • PyTorch version: 0.3.1.post2
  • How you installed PyTorch (conda, pip, source): conda
  • Python version: 3.6.1
  • OS/CUDA/cuDNN version: A modification of the Deep Learning AMI
  • GPU models and configuration: Amazon p2.xlarge instance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions