-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Make test_proper_exit more robust #16249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
|
||
JOIN_TIMEOUT = 17.0 if (IS_WINDOWS or IS_PPC) else 11.0 | ||
JOIN_TIMEOUT = 17.0 if (IS_WINDOWS or IS_PPC) else 13.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic numbers in the air :o)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha yeah, but I doubt the problem is insufficient timeout. I suspect that there is some random system error happening. The improve error message should tell us more if it happens again.
self.fail(fail_msg + ', and had exception {}'.format(loader_p.exception)) | ||
else: | ||
self.fail(fail_msg + ', and had no exception') | ||
_, alive = psutil.wait_procs(worker_psutil_p, timeout=(MP_STATUS_CHECK_INTERVAL + JOIN_TIMEOUT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... I know this is not part of this patch, but I have to ask. Is worker_psutil_p
a list of PIDs? Don't we have a race condition here, where another process on the system could be allocated the same PID as the previous one, and then we hang indefinitely? (This shouldn't actually be the problem here, because you'd have to be allocating a lot of PIDs to wrap around)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worker_psutil_p is a list of psutil.Process objects. In _test_proper_exit, we ensure that the workers are alive when these objects are created. So they accurately track the status of worker processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, this list is created at line 766 above, and then tester_setup_event is set. In _test_proper_exit (the function that the loader process is running), the loader process waits on tester_setup_event, and then makes sure that the workers are still is_alive() at lines 359-363.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@ezyang Did this fail to land..? |
Yes. Something push-blocking is broken in fbcode. Waiting on the rebase. |
Attempt to fix #14501
cc @ezyang