Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded DataLoader sometimes hits "Connection reset by peer" #1551

Closed
vadimkantorov opened this issue May 14, 2017 · 26 comments
Closed
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@vadimkantorov
Copy link
Contributor

vadimkantorov commented May 14, 2017

I am using torch.utils.data.DataLoader with num_workers = 4 and sometimes getting this exception (in a single-threaded mode it works fine). I will try to get minimal repro.

I suppose it may be related to https://discuss.pytorch.org/t/using-torch-tensor-over-multiprocessing-queue-process-fails/2847/3 but in this case it's not custom Queue handling.

Traceback (most recent call last):
........
  File ".../torch/utils/data/dataloader.py", line 206, in __next__
    idx, batch = self.data_queue.get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get
    return recv()
  File ".../torch/multiprocessing/queue.py", line 22, in recv
    return pickle.loads(buf)
  File "/usr/lib/python2.7/pickle.py", line 1388, in loads
    return Unpickler(file).load()
  File "/usr/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
    value = func(*args)
  File ".../torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
    fd = multiprocessing.reduction.rebuild_handle(df)
  File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
    conn = Client(address, authkey=current_process().authkey)
  File "/usr/lib/python2.7/multiprocessing/connection.py", line 175, in Client
    answer_challenge(c, authkey)
  File "/usr/lib/python2.7/multiprocessing/connection.py", line 432, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
IOError: [Errno 104] Connection reset by peer

cc @ssnl @VitalyFedyunin @ejguan

@apaszke apaszke added the needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user label May 14, 2017
@mdering
Copy link
Contributor

mdering commented May 15, 2017

I also had this issue, and turned off multithreading because it seemed to have trouble with deadlocking

@soumith soumith added this to Uncategorized in Issue Status Aug 23, 2017
@soumith soumith added this to Crashes / Segfaults / Errors in Issue Categories Aug 31, 2017
@vadimkantorov
Copy link
Contributor Author

I think this may have been related to hung workers due to the OpenCV not being fork-safe. Anyway, don't have a repro anymore.

@YuJiang01
Copy link

Also get this error. Is there a way to avoid it?

@vrt1shjwlkr
Copy link

Has anyone found the reason for this? #9127 doesn't solve it for me.

@andrei-pokrovsky
Copy link

andrei-pokrovsky commented Apr 1, 2021

I think this is a simple repro

import multiprocessing as mp
import torch
import time

q = mp.Queue()

def sender():
    global q
    q.put(torch.zeros(20, 20, 20))
   # time.sleep(20.0)


pp = mp.Process(target=sender, args=())
pp.start()
res = q.get()
print("Got tensor with mean", res.mean())

if time.sleep is enabled then there's no crash.
It seems to be specific to torch.Tensor because if i send a "asdf"*100000 instead it doesn't crash.

@vadimkantorov
Copy link
Contributor Author

@andrei-pokrovsky Could you please also paste your stack-trace if you have it?

@vadimkantorov vadimkantorov reopened this Apr 1, 2021
@andrei-pokrovsky
Copy link

Traceback (most recent call last):
  File "testmp.py", line 15, in <module>
    res = q.get()
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
    fd = df.detach()
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 498, in Client
    answer_challenge(c, authkey)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 741, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

@andrei-pokrovsky
Copy link

andrei-pokrovsky commented Apr 1, 2021

Interestingly if i put tensor.numpy() on the queue and later get() then there's no error and torch.Tensor can then be recreated with torch.from_numpy.

@andrei-pokrovsky
Copy link

Another thing that can be related is persistent_workers parameter in DataLoader class. It seems if it's set to true this error can be potentially avoided.

@gchanan gchanan added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 1, 2021
@joker512-tmp
Copy link

I have the same issue and "persistent-workers=True" option doesn't help me. Reproducing code works for me as well. It's also happens only for num_workers>0. Does anybody find a working fix for it?

@ejguan
Copy link
Contributor

ejguan commented Sep 29, 2021

@joker512-tmp
If you are talking about the code above, the process of sender needs to be alive when you get data from main process. So, the sender needs to wait until data is read from main process.

And, it should not happen for DataLoader.

@ravi-mosaicml
Copy link

I have experienced this issue as well where the dataloader exits with a ConnectionResetError: [Errno 104] Connection reset by peer error. I observed that this error goes away away with either a) adding a sleep, or b) using larger batch sizes. I suspect there is race condition that is triggered if the dataloader completes very quickly. I am running Pytorch 1.10.

@ejguan
Copy link
Contributor

ejguan commented Jan 18, 2022

@joker512-tmp @ravi-mosaicml Could you please send us a minimum code for us to reproduce the Error?

@ravi-mosaicml
Copy link

ravi-mosaicml commented Jan 18, 2022

Hi @ejguan, here is an example that generates the error for me.

import torch
import torch.utils.data
import torchvision
import time

torch.cuda.set_device(0)

dataloader = torch.utils.data.DataLoader(
    torchvision.datasets.MNIST(
            "~/datasets",
            train=True,
            download=True,
            transform=torchvision.transforms.Compose([torchvision.transforms.ToTensor()]),
    ),
    pin_memory=True,
    batch_size=1024,
    persistent_workers=True,
    num_workers=8,
)

for _ in dataloader:
    break

# print("Sleeping for 5 seconds"); time.sleep(5) # Uncomment this line to make the error disappear

print("Finished!")

Here is the error:

(composer) ravi@ravi-g2-c30:~/composer$ python3 example.py
Finished!
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 514, in Client
    deliver_challenge(c, authkey)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 745, in deliver_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

@ejguan
Copy link
Contributor

ejguan commented Jan 19, 2022

Hi @ejguan, here is an example that generates the error for me.

import torch
import torch.utils.data
import torchvision
import time

torch.cuda.set_device(0)

dataloader = torch.utils.data.DataLoader(
    torchvision.datasets.MNIST(
            "~/datasets",
            train=True,
            download=True,
            transform=torchvision.transforms.Compose([torchvision.transforms.ToTensor()]),
    ),
    pin_memory=True,
    batch_size=1024,
    persistent_workers=True,
    num_workers=8,
)

for _ in dataloader:
    break

# print("Sleeping for 5 seconds"); time.sleep(5) # Uncomment this line to make the error disappear

print("Finished!")

Here is the error:

(composer) ravi@ravi-g2-c30:~/composer$ python3 example.py
Finished!
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 514, in Client
    deliver_challenge(c, authkey)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 745, in deliver_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Thanks for posting the script. I can reproduce such issue. This can be related to worker process (data queue) is closed before pin_memory_thread. I will take a deeper look tomorrow.

@ejguan
Copy link
Contributor

ejguan commented Jan 20, 2022

I believe the issue is only triggered for the case that both persistent_workers and pin_memory are turned on and iteration is terminated at the time that worker is sending data to queue.
First, persistent worker would keep iterator with workers running without proper cleaning up (using __del__ in _MultiProcessingDataLoaderIter. And, if any background worker (daemon process) is terminated when it is sending data to the _worker_result_queue, such Error would be triggered as the pin_memory_thread want to get such data from Queue.

I can send a PR

ejguan added a commit that referenced this issue Jan 20, 2022
Fixes #1551


As the comment in the code, adding a function to manually delete the iterator with persistent worker. It would invoke `__del__` inside Iterator object and make sure pin_memory_thread exits before worker process.

I choose using `atexit` rather than adding `__del__` to DataLoader because it's not safe as the destructor function may not be invoked when Python interpreter exits.

[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 20, 2022
Fixes #1551


As the comment in the code, adding a function to manually delete the iterator with persistent worker. It would invoke `__del__` inside Iterator object and make sure pin_memory_thread exits before worker process.

I choose using `atexit` rather than adding `__del__` to DataLoader because it's not safe as the destructor function may not be invoked when Python interpreter exits.

[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 26, 2022
…ory thread"

Fixes #1551


As the comment in the code, register a function to terminate persistent workers. 
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.


[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 26, 2022
Fixes #1551


As the comment in the code, register a function to terminate persistent workers. 
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.


[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 26, 2022
…ory thread"

Fixes #1551


As the comment in the code, register a function to terminate persistent workers. 
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.


[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 26, 2022
Fixes #1551


As the comment in the code, register a function to terminate persistent workers. 
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.


[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 31, 2022
…ory thread"

Fixes #1551


As the comment in the code, register a function to terminate persistent workers. 
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.


[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 31, 2022
Fixes #1551


As the comment in the code, register a function to terminate persistent workers. 
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.


[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 31, 2022
…ory thread"

Fixes #1551


As the comment in the code, register a function to terminate persistent workers. 
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.


[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 31, 2022
Fixes #1551


As the comment in the code, register a function to terminate persistent workers. 
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.


[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 31, 2022
…ory thread"

Fixes #1551


As the comment in the code, register a function to terminate persistent workers. 
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.

Differential Revision: [D33896537](https://our.internmc.facebook.com/intern/diff/D33896537)

[ghstack-poisoned]
ejguan added a commit that referenced this issue Jan 31, 2022
Fixes #1551


As the comment in the code, register a function to terminate persistent workers. 
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.

Differential Revision: [D33896537](https://our.internmc.facebook.com/intern/diff/D33896537)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this issue Feb 1, 2022
Summary:
Pull Request resolved: #71579

Fixes #1551

As the comment in the code, register a function to terminate persistent workers.
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D33896537

Pulled By: ejguan

fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae
pytorchmergebot pushed a commit that referenced this issue Feb 1, 2022
Summary:
Pull Request resolved: #71579

Fixes #1551

As the comment in the code, register a function to terminate persistent workers.
By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits.
And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D33896537

Pulled By: ejguan

fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae
(cherry picked from commit 05add2a)
@darkdevahm
Copy link

Traceback (most recent call last):
  File "testmp.py", line 15, in <module>
    res = q.get()
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
    fd = df.detach()
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 498, in Client
    answer_challenge(c, authkey)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 741, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Hello I have the same error as you, can you please let me know if you fixed it?

@ejguan
Copy link
Contributor

ejguan commented Dec 2, 2022

Hello I have the same error as you, can you please let me know if you fixed it?

@darkdevahm Could you please share your env?

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
python collect_env.py

@darkdevahm
Copy link

darkdevahm commented Dec 2, 2022

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py

Kindly find my env below:
PS: I'm not using the torch installed on conda, I'm using the one installed using pip3.

`Collecting environment information...
PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.3 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.10

Python version: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.0-48-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 515.65.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.13.0
[pip3] torchvision==0.14.0
[conda] numpy 1.18.5 py37h8960a57_0 conda-forge
[conda] torch 1.13.0 pypi_0 pypi
[conda] torchvision 0.14.0 pypi_0 pypi`

@ejguan
Copy link
Contributor

ejguan commented Dec 2, 2022

@darkdevahm Could you pls also share a minimum reproducible script for me to investigate? Otherwise, it's hard to say the culprit as the original issue has been fixed in torch 1.11 and 1.13 should be good.

@darkdevahm
Copy link

darkdevahm commented Dec 2, 2022

uhh that's a bit difficult since the error only happens mid-way through the training process of this repo Coperception. It starts the first epoch fine; then it stops at the middle (after few iterations in the first epoch), throwing this error:

train_codet(args)
File "train_codet.py", line 486, in main
for sample in t:
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1129, in iter
for obj in iterable:
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/tmp/coperception/coperception/datasets/V2XSimDet.py", line 406, in getitem
res.append(self.pick_single_agent(i, idx))
File "/tmp/coperception/coperception/datasets/V2XSimDet.py", line 258, in pick_single_agent
if len(self.cache[agent_id]) < self.cache_size:
File "", line 2, in len
File "/opt/conda/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod
kind, result = conn.recv()
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

@ejguan
Copy link
Contributor

ejguan commented Dec 2, 2022

@darkdevahm
Since I am not able to access your linked repo and I don't know exactly how V2XSimDet is implemented , by looking at your stack-trace, I can only encourage you to try to use "spawn" as the starting method for DataLoader.
I assume you are not using ipython interactive shell.

@vadimkantorov
Copy link
Contributor Author

Also, often this happens because of runfs /tmp filesystem being full, or number of open file descriptors being exceeded; and then some worker dies out. But it's much better to have a better diagnostics for all these cases.

@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented Dec 2, 2022

It might be a good idea to somehow get a stack trace (including native C stack trace, not just Python stack trace) for any threads when pytorch crashes with an exception like this. Or have a recipe for doing this in Python (probably will require ptrace-permissions)

@LYL534
Copy link

LYL534 commented Dec 26, 2023

I have experienced this issue as well where the dataloader exits with a ConnectionResetError: [Errno 104] Connection reset by peer error. I observed that this error goes away away with either a) adding a sleep, or b) using larger batch sizes. I suspect there is race condition that is triggered if the dataloader completes very quickly. I am running Pytorch 1.10.

Where to add the time.sleep()

@troymyname
Copy link

troymyname commented Jan 25, 2024

Hi all, I am working on a key-point detection problem where I am also running into a similar issue. When I changed the device to "cpu" to further investigate the problem (because duh with device = torch.device("cuda") the traceback doesn't help at all), I get the following traceback.

Exception in thread Thread-3 (_pin_memory_loop):
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/tempest5/vision_system/env/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
    do_one_step()
  File "/home/tempest5/vision_system/env/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tempest5/vision_system/env/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 524, in Client
    answer_challenge(c, authkey)
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 773, in answer_challenge
    response = connection.recv_bytes(256)        # reject large message
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 395, in _recv
    chunk = read(handle, remaining)
            ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

I get more stuff above this traceback, but I am just posting this much for now. Does anyone have a solution to this problem yet or a temporary workaround? I tried setting num_workers = 0 (single thread), but that didn't help me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Issue Status
Uncategorized
Development

No branches or pull requests