Skip to content

Conversation

rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Jan 28, 2020

Closes #27368.
Previously, if a function 'func did not exist on worker A but existed in B, and the user ran rpc.rpc_sync(A, func), A would crash with a segmentation fault since it is not able to find the function. B would eventually timeout since RPCs by default time out in 60s.

At the root this comes from an unhandled exception when trying to deserialize the PythonUDF to run.

This PR makes it so that we can recover from this error, and A reports back a RemoteException to B indicating that the function was not found. Now, A will no longer crash and B can handle the exception appropriately and with more information.

@kostmo
Copy link
Member

kostmo commented Jan 28, 2020

💊 CircleCI build failures summary and remediations

As of commit d5f6bf3:

Commit d5f6bf3 was recently pushed. Waiting for builds...


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 14 times.

@rohan-varma rohan-varma changed the title [rpc] don't crash callee when function does not exist on it [rpc] don't crash callee when function does not exist on it, instead return Exception Jan 28, 2020
@rohan-varma rohan-varma requested a review from xush6528 January 29, 2020 01:46
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

# Use delattr to remove the binding of a func on callee nodes
import sys
this_module = sys.modules[__name__]
delattr(this_module, "foo_add")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I wonder do we have to explicitly delete the attr? Does it work if we define foo_add as a local function within the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli If we define the function locally, then on the caller side (worker0 in this case) the pickler fails with a Cannot pickle local function error. So it looks like that case already has error handling.

@mrshenli
Copy link
Contributor

There is a conflict in torch/testing/_internal/distributed/rpc/rpc_test.py, need to address that before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
…return Exception (pytorch#32726)

Summary:
Closes pytorch#27368.
Previously, if a function `'func` did not exist on worker A but existed in B, and the user ran `rpc.rpc_sync(A,  func)`, A would crash with a segmentation fault since it is not able to find the function. B would eventually timeout since RPCs by default time out in 60s.

At the root this comes from an unhandled exception when trying to deserialize the `PythonUDF` to run.

This PR makes it so that we can recover from this error, and A reports back a `RemoteException` to B indicating that the function was not found. Now, A will no longer crash and B can handle the exception appropriately and with more information.
Pull Request resolved: pytorch#32726

Differential Revision: D19648825

Pulled By: rohan-varma

fbshipit-source-id: 53847f4bfb68187db41c61d69ddac13613e814b4
@facebook-github-bot facebook-github-bot deleted the handle_attribute_error branch July 13, 2020 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve error message if Python function is not found on callee instead of crashing
4 participants