Multithreaded DataLoader sometimes hits "Connection reset by peer" #1551

vadimkantorov · 2017-05-14T12:38:43Z

I am using torch.utils.data.DataLoader with num_workers = 4 and sometimes getting this exception (in a single-threaded mode it works fine). I will try to get minimal repro.

I suppose it may be related to https://discuss.pytorch.org/t/using-torch-tensor-over-multiprocessing-queue-process-fails/2847/3 but in this case it's not custom Queue handling.

Traceback (most recent call last):
........
  File ".../torch/utils/data/dataloader.py", line 206, in __next__
    idx, batch = self.data_queue.get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get
    return recv()
  File ".../torch/multiprocessing/queue.py", line 22, in recv
    return pickle.loads(buf)
  File "/usr/lib/python2.7/pickle.py", line 1388, in loads
    return Unpickler(file).load()
  File "/usr/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
    value = func(*args)
  File ".../torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
    fd = multiprocessing.reduction.rebuild_handle(df)
  File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
    conn = Client(address, authkey=current_process().authkey)
  File "/usr/lib/python2.7/multiprocessing/connection.py", line 175, in Client
    answer_challenge(c, authkey)
  File "/usr/lib/python2.7/multiprocessing/connection.py", line 432, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
IOError: [Errno 104] Connection reset by peer

cc @ssnl @VitalyFedyunin @ejguan

The text was updated successfully, but these errors were encountered:

mdering · 2017-05-15T14:20:40Z

I also had this issue, and turned off multithreading because it seemed to have trouble with deadlocking

vadimkantorov · 2017-11-13T03:26:28Z

I think this may have been related to hung workers due to the OpenCV not being fork-safe. Anyway, don't have a repro anymore.

YuJiang01 · 2018-08-07T20:26:51Z

Also get this error. Is there a way to avoid it?

vrt1shjwlkr · 2018-08-13T01:49:34Z

Has anyone found the reason for this? #9127 doesn't solve it for me.

andrei-pokrovsky · 2021-04-01T10:52:52Z

I think this is a simple repro

import multiprocessing as mp
import torch
import time

q = mp.Queue()

def sender():
    global q
    q.put(torch.zeros(20, 20, 20))
   # time.sleep(20.0)


pp = mp.Process(target=sender, args=())
pp.start()
res = q.get()
print("Got tensor with mean", res.mean())

if time.sleep is enabled then there's no crash.
It seems to be specific to torch.Tensor because if i send a "asdf"*100000 instead it doesn't crash.

vadimkantorov · 2021-04-01T10:55:37Z

@andrei-pokrovsky Could you please also paste your stack-trace if you have it?

andrei-pokrovsky · 2021-04-01T10:57:18Z

Traceback (most recent call last):
  File "testmp.py", line 15, in <module>
    res = q.get()
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
    fd = df.detach()
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 498, in Client
    answer_challenge(c, authkey)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 741, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

andrei-pokrovsky · 2021-04-01T12:11:28Z

Interestingly if i put tensor.numpy() on the queue and later get() then there's no error and torch.Tensor can then be recreated with torch.from_numpy.

andrei-pokrovsky · 2021-04-01T12:13:48Z

Another thing that can be related is persistent_workers parameter in DataLoader class. It seems if it's set to true this error can be potentially avoided.

joker512-tmp · 2021-09-28T11:34:13Z

I have the same issue and "persistent-workers=True" option doesn't help me. Reproducing code works for me as well. It's also happens only for num_workers>0. Does anybody find a working fix for it?

ejguan · 2021-09-29T19:56:06Z

@joker512-tmp
If you are talking about the code above, the process of sender needs to be alive when you get data from main process. So, the sender needs to wait until data is read from main process.

And, it should not happen for DataLoader.

ravi-mosaicml · 2022-01-18T19:38:08Z

I have experienced this issue as well where the dataloader exits with a ConnectionResetError: [Errno 104] Connection reset by peer error. I observed that this error goes away away with either a) adding a sleep, or b) using larger batch sizes. I suspect there is race condition that is triggered if the dataloader completes very quickly. I am running Pytorch 1.10.

ejguan · 2022-01-18T19:42:19Z

@joker512-tmp @ravi-mosaicml Could you please send us a minimum code for us to reproduce the Error?

ravi-mosaicml · 2022-01-18T20:17:38Z

Hi @ejguan, here is an example that generates the error for me.

import torch
import torch.utils.data
import torchvision
import time

torch.cuda.set_device(0)

dataloader = torch.utils.data.DataLoader(
    torchvision.datasets.MNIST(
            "~/datasets",
            train=True,
            download=True,
            transform=torchvision.transforms.Compose([torchvision.transforms.ToTensor()]),
    ),
    pin_memory=True,
    batch_size=1024,
    persistent_workers=True,
    num_workers=8,
)

for _ in dataloader:
    break

# print("Sleeping for 5 seconds"); time.sleep(5) # Uncomment this line to make the error disappear

print("Finished!")

Here is the error:

(composer) ravi@ravi-g2-c30:~/composer$ python3 example.py
Finished!
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 514, in Client
    deliver_challenge(c, authkey)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 745, in deliver_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

ejguan · 2022-01-19T22:12:47Z

Hi @ejguan, here is an example that generates the error for me.

import torch
import torch.utils.data
import torchvision
import time

torch.cuda.set_device(0)

dataloader = torch.utils.data.DataLoader(
    torchvision.datasets.MNIST(
            "~/datasets",
            train=True,
            download=True,
            transform=torchvision.transforms.Compose([torchvision.transforms.ToTensor()]),
    ),
    pin_memory=True,
    batch_size=1024,
    persistent_workers=True,
    num_workers=8,
)

for _ in dataloader:
    break

# print("Sleeping for 5 seconds"); time.sleep(5) # Uncomment this line to make the error disappear

print("Finished!")

Here is the error:

(composer) ravi@ravi-g2-c30:~/composer$ python3 example.py
Finished!
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 514, in Client
    deliver_challenge(c, authkey)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 745, in deliver_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Thanks for posting the script. I can reproduce such issue. This can be related to worker process (data queue) is closed before pin_memory_thread. I will take a deeper look tomorrow.

ejguan · 2022-01-20T16:25:27Z

I believe the issue is only triggered for the case that both persistent_workers and pin_memory are turned on and iteration is terminated at the time that worker is sending data to queue.
First, persistent worker would keep iterator with workers running without proper cleaning up (using __del__ in _MultiProcessingDataLoaderIter. And, if any background worker (daemon process) is terminated when it is sending data to the _worker_result_queue, such Error would be triggered as the pin_memory_thread want to get such data from Queue.

I can send a PR

Fixes #1551 As the comment in the code, adding a function to manually delete the iterator with persistent worker. It would invoke `__del__` inside Iterator object and make sure pin_memory_thread exits before worker process. I choose using `atexit` rather than adding `__del__` to DataLoader because it's not safe as the destructor function may not be invoked when Python interpreter exits. [ghstack-poisoned]

…ory thread" Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]

Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]

…ory thread" Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]

Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]

…ory thread" Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]

Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]

…ory thread" Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]

Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]

…ory thread" Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Differential Revision: [D33896537](https://our.internmc.facebook.com/intern/diff/D33896537) [ghstack-poisoned]

Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Differential Revision: [D33896537](https://our.internmc.facebook.com/intern/diff/D33896537) [ghstack-poisoned]

Summary: Pull Request resolved: #71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33896537 Pulled By: ejguan fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae

Summary: Pull Request resolved: #71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33896537 Pulled By: ejguan fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae (cherry picked from commit 05add2a)

darkdevahm · 2022-12-02T14:45:07Z

Traceback (most recent call last):
  File "testmp.py", line 15, in <module>
    res = q.get()
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
    fd = df.detach()
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 498, in Client
    answer_challenge(c, authkey)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 741, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/aapokrovsky/anaconda3/envs/pytorch-build/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Hello I have the same error as you, can you please let me know if you fixed it?

ejguan · 2022-12-02T14:48:35Z

Hello I have the same error as you, can you please let me know if you fixed it?

@darkdevahm Could you please share your env?

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
python collect_env.py

darkdevahm · 2022-12-02T15:22:16Z

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py

Kindly find my env below:
PS: I'm not using the torch installed on conda, I'm using the one installed using pip3.

`Collecting environment information...
PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.3 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.10

Python version: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.0-48-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 515.65.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.13.0
[pip3] torchvision==0.14.0
[conda] numpy 1.18.5 py37h8960a57_0 conda-forge
[conda] torch 1.13.0 pypi_0 pypi
[conda] torchvision 0.14.0 pypi_0 pypi`

ejguan · 2022-12-02T15:31:40Z

@darkdevahm Could you pls also share a minimum reproducible script for me to investigate? Otherwise, it's hard to say the culprit as the original issue has been fixed in torch 1.11 and 1.13 should be good.

darkdevahm · 2022-12-02T15:41:53Z

uhh that's a bit difficult since the error only happens mid-way through the training process of this repo Coperception. It starts the first epoch fine; then it stops at the middle (after few iterations in the first epoch), throwing this error:

train_codet(args)
File "train_codet.py", line 486, in main
for sample in t:
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1129, in iter
for obj in iterable:
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/tmp/coperception/coperception/datasets/V2XSimDet.py", line 406, in getitem
res.append(self.pick_single_agent(i, idx))
File "/tmp/coperception/coperception/datasets/V2XSimDet.py", line 258, in pick_single_agent
if len(self.cache[agent_id]) < self.cache_size:
File "", line 2, in len
File "/opt/conda/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod
kind, result = conn.recv()
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

ejguan · 2022-12-02T16:10:18Z

@darkdevahm
Since I am not able to access your linked repo and I don't know exactly how V2XSimDet is implemented , by looking at your stack-trace, I can only encourage you to try to use "spawn" as the starting method for DataLoader.
I assume you are not using ipython interactive shell.

vadimkantorov · 2022-12-02T16:59:52Z

Also, often this happens because of runfs /tmp filesystem being full, or number of open file descriptors being exceeded; and then some worker dies out. But it's much better to have a better diagnostics for all these cases.

vadimkantorov · 2022-12-02T17:02:17Z

It might be a good idea to somehow get a stack trace (including native C stack trace, not just Python stack trace) for any threads when pytorch crashes with an exception like this. Or have a recipe for doing this in Python (probably will require ptrace-permissions)

LYL534 · 2023-12-26T08:28:54Z

I have experienced this issue as well where the dataloader exits with a ConnectionResetError: [Errno 104] Connection reset by peer error. I observed that this error goes away away with either a) adding a sleep, or b) using larger batch sizes. I suspect there is race condition that is triggered if the dataloader completes very quickly. I am running Pytorch 1.10.

Where to add the time.sleep()

troymyname · 2024-01-25T18:45:05Z

Hi all, I am working on a key-point detection problem where I am also running into a similar issue. When I changed the device to "cpu" to further investigate the problem (because duh with device = torch.device("cuda") the traceback doesn't help at all), I get the following traceback.

Exception in thread Thread-3 (_pin_memory_loop):
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/tempest5/vision_system/env/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
    do_one_step()
  File "/home/tempest5/vision_system/env/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tempest5/vision_system/env/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 524, in Client
    answer_challenge(c, authkey)
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 773, in answer_challenge
    response = connection.recv_bytes(256)        # reject large message
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 395, in _recv
    chunk = read(handle, remaining)
            ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

I get more stuff above this traceback, but I am just posting this much for now. Does anyone have a solution to this problem yet or a temporary workaround? I tried setting num_workers = 0 (single thread), but that didn't help me.

apaszke added the needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user label May 14, 2017

soumith added this to Uncategorized in Issue Status Aug 23, 2017

soumith added this to Crashes / Segfaults / Errors in Issue Categories Aug 31, 2017

vadimkantorov closed this as completed Nov 13, 2017

soumith removed this from Crashes / Segfaults / Errors in Issue Categories Nov 16, 2017

Angel-Jia mentioned this issue Aug 6, 2018

Always get error "ConnectionResetError: [Errno 104] Connection reset by peer" #9127

Closed

rkooo567 mentioned this issue Feb 21, 2021

Log Monitor TypeError ray-project/ray#12565

Closed

vadimkantorov reopened this Apr 1, 2021

gchanan added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 1, 2021

ravi-mosaicml mentioned this issue Jan 18, 2022

Training Loop Profiler mosaicml/composer#97

Merged

13 tasks

ejguan mentioned this issue Jan 20, 2022

Fix persistent worker exits before pin_memory thread #71579

Closed

pytorchmergebot closed this as completed in e2191e7 Jan 21, 2022

ravi-mosaicml mentioned this issue Feb 1, 2022

Fix checkpointing when saving checkpoints locally mosaicml/composer#276

Closed

ejguan mentioned this issue Feb 3, 2022

[v.1.11.0] Release Tracker #72267

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded DataLoader sometimes hits "Connection reset by peer" #1551

Multithreaded DataLoader sometimes hits "Connection reset by peer" #1551

vadimkantorov commented May 14, 2017 •

edited by pytorch-probot bot

mdering commented May 15, 2017

vadimkantorov commented Nov 13, 2017

YuJiang01 commented Aug 7, 2018

vrt1shjwlkr commented Aug 13, 2018

andrei-pokrovsky commented Apr 1, 2021 •

edited

vadimkantorov commented Apr 1, 2021

andrei-pokrovsky commented Apr 1, 2021

andrei-pokrovsky commented Apr 1, 2021 •

edited

andrei-pokrovsky commented Apr 1, 2021

joker512-tmp commented Sep 28, 2021

ejguan commented Sep 29, 2021

ravi-mosaicml commented Jan 18, 2022

ejguan commented Jan 18, 2022

ravi-mosaicml commented Jan 18, 2022 •

edited

ejguan commented Jan 19, 2022

ejguan commented Jan 20, 2022

darkdevahm commented Dec 2, 2022

ejguan commented Dec 2, 2022

darkdevahm commented Dec 2, 2022 •

edited

ejguan commented Dec 2, 2022

darkdevahm commented Dec 2, 2022 •

edited

ejguan commented Dec 2, 2022

vadimkantorov commented Dec 2, 2022

vadimkantorov commented Dec 2, 2022 •

edited

LYL534 commented Dec 26, 2023

troymyname commented Jan 25, 2024 •

edited

Multithreaded DataLoader sometimes hits "Connection reset by peer" #1551

Multithreaded DataLoader sometimes hits "Connection reset by peer" #1551

Comments

vadimkantorov commented May 14, 2017 • edited by pytorch-probot bot

mdering commented May 15, 2017

vadimkantorov commented Nov 13, 2017

YuJiang01 commented Aug 7, 2018

vrt1shjwlkr commented Aug 13, 2018

andrei-pokrovsky commented Apr 1, 2021 • edited

vadimkantorov commented Apr 1, 2021

andrei-pokrovsky commented Apr 1, 2021

andrei-pokrovsky commented Apr 1, 2021 • edited

andrei-pokrovsky commented Apr 1, 2021

joker512-tmp commented Sep 28, 2021

ejguan commented Sep 29, 2021

ravi-mosaicml commented Jan 18, 2022

ejguan commented Jan 18, 2022

ravi-mosaicml commented Jan 18, 2022 • edited

ejguan commented Jan 19, 2022

ejguan commented Jan 20, 2022

darkdevahm commented Dec 2, 2022

ejguan commented Dec 2, 2022

darkdevahm commented Dec 2, 2022 • edited

ejguan commented Dec 2, 2022

darkdevahm commented Dec 2, 2022 • edited

ejguan commented Dec 2, 2022

vadimkantorov commented Dec 2, 2022

vadimkantorov commented Dec 2, 2022 • edited

LYL534 commented Dec 26, 2023

troymyname commented Jan 25, 2024 • edited

vadimkantorov commented May 14, 2017 •

edited by pytorch-probot bot

andrei-pokrovsky commented Apr 1, 2021 •

edited

andrei-pokrovsky commented Apr 1, 2021 •

edited

ravi-mosaicml commented Jan 18, 2022 •

edited

darkdevahm commented Dec 2, 2022 •

edited

darkdevahm commented Dec 2, 2022 •

edited

vadimkantorov commented Dec 2, 2022 •

edited

troymyname commented Jan 25, 2024 •

edited