New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RPC] print exception message on workers that run python functions #46372
Conversation
Currently, in `_run_function`, we catch an exception from the python function which is run, and report it back to the master. However in some large scale training jobs, it would be valuable to also log the error on the trainer itself for faster debugging. Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)! [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 65dad6a (more details on the Dr. CI page):
1 failure not recognized by patterns:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 3 times. |
Currently, in `_run_function`, we catch an exception from the python function which is run, and report it back to the master. However in some large scale training jobs, it would be valuable to also log the error on the trainer itself for faster debugging. Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)! ghstack-source-id: 114351298 Pull Request resolved: #46372
💊 CI failures summary and remediationsAs of commit 51b50a1 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 22 times. |
These were caused by upstream breakages that are already fixed |
…unctions" Currently, in `_run_function`, we catch an exception from the python function which is run, and report it back to the master. However in some large scale training jobs, it would be valuable to also log the error on the trainer itself for faster debugging. Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)! [ghstack-poisoned]
…unctions" Currently, in `_run_function`, we catch an exception from the python function which is run, and report it back to the master. However in some large scale training jobs, it would be valuable to also log the error on the trainer itself for faster debugging. Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)! [ghstack-poisoned]
…unctions" Currently, in `_run_function`, we catch an exception from the python function which is run, and report it back to the master. However in some large scale training jobs, it would be valuable to also log the error on the trainer itself for faster debugging. Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)! [ghstack-poisoned]
…unctions" Currently, in `_run_function`, we catch an exception from the python function which is run, and report it back to the master. However in some large scale training jobs, it would be valuable to also log the error on the trainer itself for faster debugging. Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)! [ghstack-poisoned]
Pull Request resolved: #46372 Currently, in `_run_function`, we catch an exception from the python function which is run, and report it back to the master. However in some large scale training jobs, it would be valuable to also log the error on the trainer itself for faster debugging. Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)! ghstack-source-id: 114854333
…unctions" Currently, in `_run_function`, we catch an exception from the python function which is run, and report it back to the master. However in some large scale training jobs, it would be valuable to also log the error on the trainer itself for faster debugging. Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)! [ghstack-poisoned]
Pull Request resolved: #46372 Currently, in `_run_function`, we catch an exception from the python function which is run, and report it back to the master. However in some large scale training jobs, it would be valuable to also log the error on the trainer itself for faster debugging. Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)! ghstack-source-id: 114897589
This pull request has been merged in 25dc005. |
Stack from ghstack:
Currently, in
_run_function
, we catch an exception from the pythonfunction which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.
Differential Revision: D24324578
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!