-
Notifications
You must be signed in to change notification settings - Fork 25.4k
[rpc] don't crash callee when function does not exist on it, instead return Exception #32726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CircleCI build failures summary and remediationsAs of commit d5f6bf3: Commit d5f6bf3 was recently pushed. Waiting for builds... This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 14 times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
# Use delattr to remove the binding of a func on callee nodes | ||
import sys | ||
this_module = sys.modules[__name__] | ||
delattr(this_module, "foo_add") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I wonder do we have to explicitly delete the attr? Does it work if we define foo_add
as a local function within the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli If we define the function locally, then on the caller side (worker0 in this case) the pickler fails with a Cannot pickle local function
error. So it looks like that case already has error handling.
There is a conflict in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…return Exception (pytorch#32726) Summary: Closes pytorch#27368. Previously, if a function `'func` did not exist on worker A but existed in B, and the user ran `rpc.rpc_sync(A, func)`, A would crash with a segmentation fault since it is not able to find the function. B would eventually timeout since RPCs by default time out in 60s. At the root this comes from an unhandled exception when trying to deserialize the `PythonUDF` to run. This PR makes it so that we can recover from this error, and A reports back a `RemoteException` to B indicating that the function was not found. Now, A will no longer crash and B can handle the exception appropriately and with more information. Pull Request resolved: pytorch#32726 Differential Revision: D19648825 Pulled By: rohan-varma fbshipit-source-id: 53847f4bfb68187db41c61d69ddac13613e814b4
Closes #27368.
Previously, if a function
'func
did not exist on worker A but existed in B, and the user ranrpc.rpc_sync(A, func)
, A would crash with a segmentation fault since it is not able to find the function. B would eventually timeout since RPCs by default time out in 60s.At the root this comes from an unhandled exception when trying to deserialize the
PythonUDF
to run.This PR makes it so that we can recover from this error, and A reports back a
RemoteException
to B indicating that the function was not found. Now, A will no longer crash and B can handle the exception appropriately and with more information.