New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreaded DataLoader sometimes hits "Connection reset by peer" #1551
Comments
I also had this issue, and turned off multithreading because it seemed to have trouble with deadlocking |
I think this may have been related to hung workers due to the OpenCV not being fork-safe. Anyway, don't have a repro anymore. |
Also get this error. Is there a way to avoid it? |
Has anyone found the reason for this? #9127 doesn't solve it for me. |
I think this is a simple repro
if time.sleep is enabled then there's no crash. |
@andrei-pokrovsky Could you please also paste your stack-trace if you have it? |
|
Interestingly if i put tensor.numpy() on the queue and later get() then there's no error and torch.Tensor can then be recreated with torch.from_numpy. |
Another thing that can be related is persistent_workers parameter in DataLoader class. It seems if it's set to true this error can be potentially avoided. |
I have the same issue and "persistent-workers=True" option doesn't help me. Reproducing code works for me as well. It's also happens only for num_workers>0. Does anybody find a working fix for it? |
@joker512-tmp And, it should not happen for DataLoader. |
I have experienced this issue as well where the dataloader exits with a |
@joker512-tmp @ravi-mosaicml Could you please send us a minimum code for us to reproduce the Error? |
Hi @ejguan, here is an example that generates the error for me.
Here is the error:
|
Thanks for posting the script. I can reproduce such issue. This can be related to worker process (data queue) is closed before |
I believe the issue is only triggered for the case that both I can send a PR |
Fixes #1551 As the comment in the code, adding a function to manually delete the iterator with persistent worker. It would invoke `__del__` inside Iterator object and make sure pin_memory_thread exits before worker process. I choose using `atexit` rather than adding `__del__` to DataLoader because it's not safe as the destructor function may not be invoked when Python interpreter exits. [ghstack-poisoned]
Fixes #1551 As the comment in the code, adding a function to manually delete the iterator with persistent worker. It would invoke `__del__` inside Iterator object and make sure pin_memory_thread exits before worker process. I choose using `atexit` rather than adding `__del__` to DataLoader because it's not safe as the destructor function may not be invoked when Python interpreter exits. [ghstack-poisoned]
…ory thread" Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]
Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]
…ory thread" Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]
Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]
…ory thread" Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]
Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]
…ory thread" Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]
Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]
…ory thread" Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Differential Revision: [D33896537](https://our.internmc.facebook.com/intern/diff/D33896537) [ghstack-poisoned]
Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Differential Revision: [D33896537](https://our.internmc.facebook.com/intern/diff/D33896537) [ghstack-poisoned]
Summary: Pull Request resolved: #71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33896537 Pulled By: ejguan fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae
Summary: Pull Request resolved: #71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33896537 Pulled By: ejguan fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae (cherry picked from commit 05add2a)
Hello I have the same error as you, can you please let me know if you fixed it? |
@darkdevahm Could you please share your env?
|
Kindly find my env below: `Collecting environment information... OS: Ubuntu 18.04.3 LTS (x86_64) Python version: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] (64-bit runtime) Versions of relevant libraries: |
@darkdevahm Could you pls also share a minimum reproducible script for me to investigate? Otherwise, it's hard to say the culprit as the original issue has been fixed in torch 1.11 and 1.13 should be good. |
uhh that's a bit difficult since the error only happens mid-way through the training process of this repo Coperception. It starts the first epoch fine; then it stops at the middle (after few iterations in the first epoch), throwing this error: train_codet(args) |
@darkdevahm |
Also, often this happens because of runfs /tmp filesystem being full, or number of open file descriptors being exceeded; and then some worker dies out. But it's much better to have a better diagnostics for all these cases. |
It might be a good idea to somehow get a stack trace (including native C stack trace, not just Python stack trace) for any threads when pytorch crashes with an exception like this. Or have a recipe for doing this in Python (probably will require ptrace-permissions) |
Where to add the time.sleep() |
Hi all, I am working on a key-point detection problem where I am also running into a similar issue. When I changed the device to "cpu" to further investigate the problem (because duh with device = torch.device("cuda") the traceback doesn't help at all), I get the following traceback.
I get more stuff above this traceback, but I am just posting this much for now. Does anyone have a solution to this problem yet or a temporary workaround? I tried setting num_workers = 0 (single thread), but that didn't help me. |
I am using
torch.utils.data.DataLoader
withnum_workers = 4
and sometimes getting this exception (in a single-threaded mode it works fine). I will try to get minimal repro.I suppose it may be related to https://discuss.pytorch.org/t/using-torch-tensor-over-multiprocessing-queue-process-fails/2847/3 but in this case it's not custom Queue handling.
cc @ssnl @VitalyFedyunin @ejguan
The text was updated successfully, but these errors were encountered: