Skip to content

Conversation

ssnl
Copy link
Collaborator

@ssnl ssnl commented Jan 23, 2019

  1. Improve error message for better debugging info
  2. Increase timeout
  3. Also apply the windows worker failure detection mechanism on non-Windows platforms, for better robustness

Attempt to fix #14501

cc @ezyang



JOIN_TIMEOUT = 17.0 if (IS_WINDOWS or IS_PPC) else 11.0
JOIN_TIMEOUT = 17.0 if (IS_WINDOWS or IS_PPC) else 13.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic numbers in the air :o)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha yeah, but I doubt the problem is insufficient timeout. I suspect that there is some random system error happening. The improve error message should tell us more if it happens again.

self.fail(fail_msg + ', and had exception {}'.format(loader_p.exception))
else:
self.fail(fail_msg + ', and had no exception')
_, alive = psutil.wait_procs(worker_psutil_p, timeout=(MP_STATUS_CHECK_INTERVAL + JOIN_TIMEOUT))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... I know this is not part of this patch, but I have to ask. Is worker_psutil_p a list of PIDs? Don't we have a race condition here, where another process on the system could be allocated the same PID as the previous one, and then we hang indefinitely? (This shouldn't actually be the problem here, because you'd have to be allocating a lot of PIDs to wrap around)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker_psutil_p is a list of psutil.Process objects. In _test_proper_exit, we ensure that the workers are alive when these objects are created. So they accurately track the status of worker processes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, this list is created at line 766 above, and then tester_setup_event is set. In _test_proper_exit (the function that the loader process is running), the loader process waits on tester_setup_event, and then makes sure that the workers are still is_alive() at lines 359-363.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 23, 2019

@ezyang Did this fail to land..?

@ezyang
Copy link
Contributor

ezyang commented Jan 24, 2019

Yes. Something push-blocking is broken in fbcode. Waiting on the rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_proper_exit intermittently fails

3 participants