New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataloader killed by signal at the end of training #43455
Comments
So far I only found 2 ways to get it work:
|
well actually now that I think about it. the patch you linked should not matter per pytorch/torch/csrc/DataLoader.cpp Lines 66 to 87 in 7024ce8
|
@ssnl |
@xkszltl It means that if the worker is killed by the parent, it won't report an error. Thus the patch you linked should be irrelevant. Additionally, the killing added in that patch should generally not happen if all processes properly communicate with each other and really be viewed as a last resort. |
I don't know why the handler does not intercept sigterm in this case. To be clear, the behavior can be stably repro on our side and the only difference between working/non-working code is the change of MP_STATUS_CHECK_INTERVAL. So it is not something in our code crashed or killed by accident. |
A lot of places depend on that constant and are you sure that it is caused by that part? I also do not think that waiting for 5 minutes per worker is reasonable. 5 seconds should be fine unless something went wrong with interprocess comm. |
That's exactly what I'm concerned about.
My understanding is, for dataloader worker, you're not just waiting for
"overhead", but actual data processing code.
The 5 sec is reasonable for IPC overhead + some small working code, so it's
good for other places.
However for dataloader, there're `sampler`, `batch_sampler`, `collate_fn`
provided in constructor.
These functions are user-provided with batching logic and some data
processing. which can easily run over 5 sec.
In our case we have different size inputs and we use bins to dynamically
batch similar size input.
This means there'll be a "warm-up" period when we gradually fill up bins,
which may also make these functions even slower.
I'm not sure if this PR is the root cause, but given:
1. we were able to run with pytorch master ~2 months ago
2. 5-sec => 5-min helps
It's likely to be something related to this timeout constant and we found
this PR to be the most suspicious one.
…On Wed, Aug 26, 2020 at 2:41 AM Tongzhou Wang ***@***.***> wrote:
A lot of places depend on that constant and are you sure that it is caused
by that part? I also do not think that waiting for 5 minutes per worker is
reasonable. 5 seconds should be fine unless something went wrong with
interprocess comm.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43455 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHWIUNUQ4E2ZXGBERQWAXDSCQAWRANCNFSM4QID32BA>
.
--
From LTL
|
That is good point. But still, a SIGTERM from the parent process shouldn't trigger the error. So it's unclear to me that what causes the problem. Maybe you can try the linked patch and see if that solves the issue? |
2 questions about that fix:
|
…torch#43462) Summary: Fixes pytorch#43455 Pull Request resolved: pytorch#43462 Reviewed By: bdhirsh Differential Revision: D25277759 Pulled By: VitalyFedyunin fbshipit-source-id: 0bb0d87374c0403853d71aac2c242374bfc7acf2
🐛 Bug
Recently we have a new issue after updating pytorch:
Here's the stack trace:
I think this PR is probably the reason: #39869
It may be too aggressive.
Since the issue is limited to very small datasets (< 100 images), maybe it is still busy warming up and don't really want to quit that fast?
And unlike other places using
MP_STATUS_CHECK_INTERVAL
, dataloader has custom callbacks (e.g. batcher).It doesn't make sense to have a simple 5 sec timeout.
Environment
conda
,pip
, source): sourcecc @ssnl @VitalyFedyunin
The text was updated successfully, but these errors were encountered: