-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Closed
Description
I'm creating a torch.utils.data.DataLoader in my script (train.py), and I get an error when this script is close to ending. I am using MPI through mpi4py, not the MPI PyTorch distributed backend. This error only happens when under mpirun python train.py, not under python train.py (which completes successfully).
Nothing in the traceback reveals this is an issue with my code, and I haven't touched the DataLoader at all. I've been running this code for a couple months in the same environment, and I haven't seen this bug pop before.
That is, this command does not raise an error:
$ python train.py --epochs=0
# works fine; no errorHowever, this command does raise an error:
$ mpirun python train.py --epochs=0
# normal output, almost finished running script
# produces traceback belowTraceback:
Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x7fd7da6d1668>>
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 333, in __del__
self._shutdown_workers()
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 319, in _shutdown_workers
self.data_queue.get()
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/queues.py", line 345, in get
return _ForkingPickler.loads(res)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
fd = df.detach()
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 487, in Client
c = SocketClient(address)
File "/home/ec2-user/anaconda3/lib/python3.6/multiprocessing/connection.py", line 614, in SocketClient
s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory
- PyTorch version: 0.3.1.post2
- How you installed PyTorch (conda, pip, source): conda
- Python version: 3.6.1
- OS/CUDA/cuDNN version: A modification of the Deep Learning AMI
- GPU models and configuration: Amazon p2.xlarge instance
Metadata
Metadata
Assignees
Labels
No labels