Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RPC] print exception message on workers that run python functions #46372

Closed
wants to merge 6 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Oct 15, 2020

Stack from ghstack:

Currently, in _run_function, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Differential Revision: D24324578

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)!

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Oct 15, 2020

💊 CI failures summary and remediations

As of commit 65dad6a (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR---

1 failure not recognized by patterns:

Job Step Action
CircleCI docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c Check if image should be built 🔁 rerun
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 3 times.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 15, 2020
rohan-varma added a commit that referenced this pull request Oct 15, 2020
Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)!

ghstack-source-id: 114351298
Pull Request resolved: #46372
@dr-ci
Copy link

dr-ci bot commented Oct 15, 2020

💊 CI failures summary and remediations

As of commit 51b50a1 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 22 times.

@zeeekhan77
Copy link

These were caused by upstream breakages that are already fixed

…unctions"

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)!

[ghstack-poisoned]
…unctions"

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)!

[ghstack-poisoned]
…unctions"

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)!

[ghstack-poisoned]
…unctions"

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 21, 2020
Pull Request resolved: #46372

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)!
ghstack-source-id: 114854333
…unctions"

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 22, 2020
Pull Request resolved: #46372

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Differential Revision: [D24324578](https://our.internmc.facebook.com/intern/diff/D24324578/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24324578/)!
ghstack-source-id: 114897589
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 25dc005.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/186/head branch October 26, 2020 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants