-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid hanging forever after a parallel job was killed #7834
Conversation
…tor to avoid deadlocks. In a multiprocessing.pool, if a process terminates in a non-clean fashion (for example, due to OOM or a segmentation fault), the pool will silently replace said process, but the work that the process was supposed to do will never be done, causing pylint to hang indefinitely. The concurrent.futures.ProcessPoolExecutor will raise a BrokenProcessPool exception in that case, avoiding the hang.
This comment has been minimized.
This comment has been minimized.
This look promising, thank you for opening the PR ! |
Thank you for your work maintaining this great tool 🙏 The tests were failing due to a silly mistake, should be fixed. |
Pull Request Test Coverage Report for Build 3675169755
💛 - Coveralls |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing the tests :) Would you mind adding a changelog's fragment please ? Also I don't know if it's feasible but could we imagine an automated test where we kill a job to test the original issue?
…esn't lead to a deadlock.
I added the changelog fragment and also added a test that an error in the initializer for the parallel jobs, doesn't lead to a deadlock (as it would have, before this PR). However, I'm not yet sure how to design the test where we kill a job to test the original issue. One could start a linting run that needs a little bit longer (What would be the best way to do that?) and then execute something like |
This comment has been minimized.
This comment has been minimized.
I don't know either. We're going to hit timing /concurrency issues and flakiness if we do it with an external command to kill a process launched inside our test. I never did that with pytest I was hoping a magical pytest fixture would exists to be honest 😄Maybe a manual test is enough ? |
@daniel-wer Do you understand the failing test? |
@DanielNoord I'm not entirely sure. My first suspicion was that it is timing-related. Since I configured the test to timeout after 1 second, it could happen that the time until the first pool process is initialized (and the However, the captured stderr indicates that the We could try to configure a more generous timeout of 5 seconds since the timeout will only be triggered if the test fails and won't affect normal test run times. This way we could rule out timing-related issues. What do you think? |
Let's try the higher timeout indeed! Sorry for the last response! |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did a check for references to multiprocessing
and saw that we have a couple of checks that test whether it is available. Is this also a concern for concurrency
?
Edit: It seems like it is, so we should probably change all those references and tests as well.
…mscripten and WASI.
@DanielNoord Thank you, I was not aware of that. Fixed. |
Although this is probably also important, I thought it might be better to remove those |
Could you give me a pointer to the relevant files? I had a look and from what I can see the The only thing I can see is that in |
The last thing was indeed the only thing I thinking of. Just to double check: does the determining of cpus with |
…ssing for single process fallback.
Ok great, I adapted the check.
The docs state that the "ProcessPoolExecutor uses the multiprocessing module", so yes that should be the case. Also, this kind of functionality was not duplicated for the
Using the latest version of this PR, I again confirmed that killing a random pylint worker process during a parallel linting no longer leads to a deadlock, but immediately exits with a |
This comment has been minimized.
This comment has been minimized.
I hope to find time to review this somewhere this weekend! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed two changes to docstrings.
Two last (for real this time) questions.
Sorry for taking so long. Parallelism is hard and I don't want to leave any stuff from the old implementation behind after merging this.
@@ -23,19 +23,23 @@ | |||
except ImportError: | |||
multiprocessing = None # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one use of this still, it is used to get the id of the process. Can we do that with concurrent
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that's not possible with concurrent
. As I wrote in my last comment, "this kind of functionality was not duplicated for the concurrent.futures
module which is rather an alternative interface for interacting with a process pool but not changing the underlying process fundamentals like current_process()
or cpu_count()
". The concurrent
module is using multiprocessing
internally, it is not supposed to be a replacement.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. (I was worried that the backport would conflict with #7814 but we did not merge it)
Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com>
for more information, see https://pre-commit.ci
🤖 According to the primer, this change has no effect on the checked open source code. 🤖🎉 This comment was generated for commit 69729db |
Thanks for all the work here @daniel-wer |
* Replace multiprocessing.pool with concurrent.futures.ProcessPoolExecutor to avoid deadlocks. In a multiprocessing.pool, if a process terminates in a non-clean fashion (for example, due to OOM or a segmentation fault), the pool will silently replace said process, but the work that the process was supposed to do will never be done, causing pylint to hang indefinitely. The concurrent.futures.ProcessPoolExecutor will raise a BrokenProcessPool exception in that case, avoiding the hang. Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> (cherry picked from commit 5eca8ec)
* Replace multiprocessing.pool with concurrent.futures.ProcessPoolExecutor to avoid deadlocks. In a multiprocessing.pool, if a process terminates in a non-clean fashion (for example, due to OOM or a segmentation fault), the pool will silently replace said process, but the work that the process was supposed to do will never be done, causing pylint to hang indefinitely. The concurrent.futures.ProcessPoolExecutor will raise a BrokenProcessPool exception in that case, avoiding the hang. Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com> (cherry picked from commit 5eca8ec) Co-authored-by: Daniel <daniel.werner@scalableminds.com>
Type of Changes
Description
Replace
multiprocessing.pool
with concurrent.futures.ProcessPoolExecutor to avoid deadlocks.In a
multiprocessing.pool
, if a process terminates in a non-clean fashion (for example, due to OOM or a segmentation fault), the pool will silently replace said process, but the work that the process was supposed to do will never be done, causing pylint to hang indefinitely. Theconcurrent.futures.ProcessPoolExecutor
will raise a BrokenProcessPool exception in that case, avoiding the hang.See #3899 (comment) for reproduction instructions.
The hangs caused by this issue ate up quite a few CI minutes for us over time. Although this doesn't fix the underlying sporadic issue (which is likely due to OOM) at least pylint will fail early instead of hanging forever.
A followup could be to recover from this type of error by catching the
BrokenProcessPool
exception and then instantiating a newProcessPoolExecutor
to continue the linting.Closes #3899