Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RPC Framework] Forbid process group backend in RPC #56854

Closed
wants to merge 2 commits into from

Conversation

wayi1
Copy link
Contributor

@wayi1 wayi1 commented Apr 24, 2021

Stack from ghstack:

To resolve #51670, forbid process group PRC backend. This can avoid the need of checking the current backend in the torch script remote_module_template.

Otherwise, we need to check the current RPC backend to determine whether we want to move the forward output back to CPU.

  1. If the RPC backend is process group, then move the forward output on CUDA back to CPU, because process group backend cannot support CUDA backend.
  2. If the RPC backend is TensorPipe, then we don't need to move the forward output on CUDA, as long as a device map is provided.

Differential Revision: D27984658

To resolve #51670, forbid process group PRC backend. This can avoid the need of checking the current backend in the torch script `remote_module_template`.

Differential Revision: [D27984658](https://our.internmc.facebook.com/intern/diff/D27984658/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 24, 2021

💊 CI failures summary and remediations

As of commit 36f1a3f (more details on the Dr. CI page):


  • 7/7 failures possibly* introduced in this PR
    • 1/7 non-scanned failure(s)

🕵️ 6 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build mypy (1/6)

Step: "Run mypy" (full log | diagnosis details | 🔁 rerun)

2021-04-24T06:22:11.8312914Z torch/testing/_int...pe]" has no attribute "TENSORPIPE" [attr-defined]
2021-04-24T06:21:01.8062288Z env:
2021-04-24T06:21:01.8062867Z   pythonLocation: /opt/hostedtoolcache/Python/3.8.9/x64
2021-04-24T06:21:01.8063658Z   LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.9/x64/lib
2021-04-24T06:21:01.8064227Z   LOCAL_FILES: 
2021-04-24T06:21:01.8064649Z ##[endgroup]
2021-04-24T06:21:01.8151644Z + for CONFIG in mypy*.ini
2021-04-24T06:21:01.8153617Z + mypy --config=mypy-strict.ini
2021-04-24T06:21:21.6558103Z Success: no issues found in 77 source files
2021-04-24T06:21:22.3732526Z + for CONFIG in mypy*.ini
2021-04-24T06:21:22.3733605Z + mypy --config=mypy.ini
2021-04-24T06:22:11.8312914Z torch/testing/_internal/dist_utils.py:94: error: "Type[BackendType]" has no attribute "TENSORPIPE"  [attr-defined]
2021-04-24T06:22:42.3090191Z Found 1 error in 1 file (checked 1316 source files)
2021-04-24T06:22:43.5308930Z ##[error]Process completed with exit code 1.
2021-04-24T06:22:43.5413992Z Post job cleanup.
2021-04-24T06:22:43.6538305Z [command]/usr/bin/git version
2021-04-24T06:22:43.6598033Z git version 2.31.1
2021-04-24T06:22:43.6644014Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2021-04-24T06:22:43.6688439Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :
2021-04-24T06:22:43.6979564Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2021-04-24T06:22:43.7023956Z http.https://github.com/.extraheader
2021-04-24T06:22:43.7033465Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 (2/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 24 08:22:37 AssertionError: "RPC backend on...ation worker3, but found tensor on device: cuda:0"
Apr 24 08:22:37   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 93, in wrapper
Apr 24 08:22:37     return func(*args, **kwargs)
Apr 24 08:22:37   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 100, in new_test_method
Apr 24 08:22:37     return_value = old_test_method(self, *arg, **kwargs)
Apr 24 08:22:37   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 4302, in test_cuda
Apr 24 08:22:37     rpc.rpc_sync(dst, torch.add, args=(t1, t2))
Apr 24 08:22:37   File "/opt/conda/lib/python3.6/unittest/case.py", line 217, in __exit__
Apr 24 08:22:37     expected_regex.pattern, str(exc_value)))
Apr 24 08:22:37   File "/opt/conda/lib/python3.6/unittest/case.py", line 135, in _raiseFailure
Apr 24 08:22:37     raise self.test_case.failureException(msg)
Apr 24 08:22:37 AssertionError: "RPC backend only supports CPU tensors.*Found tensor on device: cuda:0" does not match "TensorPipe RPC backend only supports CPU tensors by default, please move your tensors to CPU before sending them over RPC, or call `set_device_map` on `TensorPipeRpcBackendOptions` to explicitly configure device mapping. Request device mapping is not available for destination worker3, but found tensor on device: cuda:0"
Apr 24 08:22:37 
Apr 24 08:22:37 
Apr 24 08:22:37 
Apr 24 08:22:37 ----------------------------------------------------------------------
Apr 24 08:22:37 Ran 16 tests in 35.574s
Apr 24 08:22:37 
Apr 24 08:22:37 FAILED (errors=1, skipped=9)
Apr 24 08:22:37 
Apr 24 08:22:37 Generating XML reports...
Apr 24 08:22:37 Generated XML report: test-reports/dist-gloo/distributed.rpc.cuda.test_process_group_agent/TEST-ProcessGroupCudaDistAutogradTestWithSpawn-20210424082201.xml

See CircleCI build pytorch_macos_10_13_py3_test (3/6)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Apr 24 07:22:37 ERROR [1.932s]: test_timeout_in...on (__main__.FaultyJitFaultyAgentRpcTestWithSpawn)
Apr 24 07:22:37     fn()
Apr 24 07:22:37   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/dist_utils.py", line 97, in new_test_method
Apr 24 07:22:37     rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method=self.init_method),
Apr 24 07:22:37   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 120, in init_rpc
Apr 24 07:22:37     raise TypeError("Argument backend must be a member of BackendType")
Apr 24 07:22:37 TypeError: Argument backend must be a member of BackendType
Apr 24 07:22:37 
Apr 24 07:22:37 
Apr 24 07:22:37 
Apr 24 07:22:37 ======================================================================
Apr 24 07:22:37 ERROR [1.932s]: test_timeout_in_torchscript_function (__main__.FaultyJitFaultyAgentRpcTestWithSpawn)
Apr 24 07:22:37 ----------------------------------------------------------------------
Apr 24 07:22:37 Traceback (most recent call last):
Apr 24 07:22:37   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 374, in wrapper
Apr 24 07:22:37     self._join_processes(fn)
Apr 24 07:22:37   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 567, in _join_processes
Apr 24 07:22:37     self._check_return_codes(elapsed_time)
Apr 24 07:22:37   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 610, in _check_return_codes
Apr 24 07:22:37     raise RuntimeError(error)
Apr 24 07:22:37 RuntimeError: Process 0 exited with error code 10 and exception:
Apr 24 07:22:37 Traceback (most recent call last):

See CircleCI build pytorch_linux_bionic_py3_8_gcc9_coverage_test2 (4/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 24 08:10:49 AssertionError: 'agent.num_pend...us': '0.000000', 'agent.client_active_calls': '0'}
Apr 24 08:10:49   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 376, in wrapper
Apr 24 08:10:49     fn()
Apr 24 08:10:49   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/dist_utils.py", line 100, in new_test_method
Apr 24 08:10:49     return_value = old_test_method(self, *arg, **kwargs)
Apr 24 08:10:49   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 4175, in test_process_group_debug_info
Apr 24 08:10:49     self.assertIn("agent.num_pending_requests", info)
Apr 24 08:10:49   File "/opt/conda/lib/python3.8/unittest/case.py", line 1179, in assertIn
Apr 24 08:10:49     self.fail(self._formatMessage(msg, standardMsg))
Apr 24 08:10:49   File "/opt/conda/lib/python3.8/unittest/case.py", line 753, in fail
Apr 24 08:10:49     raise self.failureException(msg)
Apr 24 08:10:49 AssertionError: 'agent.num_pending_requests' not found in {'agent.num_idle_threads': '16', 'agent.server_active_calls': '0', 'agent.thread_pool_size': '16', 'agent.server_active_async_calls': '0', 'agent.gil_average_wait_time_us': '0.000000', 'agent.client_active_calls': '0'}
Apr 24 08:10:49 
Apr 24 08:10:49 
Apr 24 08:10:49 
Apr 24 08:10:49 ======================================================================
Apr 24 08:10:49 ERROR [2.964s]: test_process_group_set_default_timeout (__main__.ProcessGroupProcessGroupAgentRpcTestWithSpawn)
Apr 24 08:10:49 ----------------------------------------------------------------------
Apr 24 08:10:49 Traceback (most recent call last):
Apr 24 08:10:49   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 374, in wrapper
Apr 24 08:10:49     self._join_processes(fn)
Apr 24 08:10:49   File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 567, in _join_processes

See CircleCI build pytorch_linux_bionic_py3_6_clang9_noarch_test (5/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 24 07:10:27 ERROR [1.836s]: test_timeout_in...on (__main__.FaultyJitFaultyAgentRpcTestWithSpawn)
Apr 24 07:10:27     fn()
Apr 24 07:10:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 97, in new_test_method
Apr 24 07:10:27     rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method=self.init_method),
Apr 24 07:10:27   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/__init__.py", line 120, in init_rpc
Apr 24 07:10:27     raise TypeError("Argument backend must be a member of BackendType")
Apr 24 07:10:27 TypeError: Argument backend must be a member of BackendType
Apr 24 07:10:27 
Apr 24 07:10:27 
Apr 24 07:10:27 
Apr 24 07:10:27 ======================================================================
Apr 24 07:10:27 ERROR [1.836s]: test_timeout_in_torchscript_function (__main__.FaultyJitFaultyAgentRpcTestWithSpawn)
Apr 24 07:10:27 ----------------------------------------------------------------------
Apr 24 07:10:27 Traceback (most recent call last):
Apr 24 07:10:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 374, in wrapper
Apr 24 07:10:27     self._join_processes(fn)
Apr 24 07:10:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 567, in _join_processes
Apr 24 07:10:27     self._check_return_codes(elapsed_time)
Apr 24 07:10:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 610, in _check_return_codes
Apr 24 07:10:27     raise RuntimeError(error)
Apr 24 07:10:27 RuntimeError: Process 0 exited with error code 10 and exception:
Apr 24 07:10:27 Traceback (most recent call last):

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (6/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 24 08:00:03 ERROR [2.149s]: test_timeout_in...on (__main__.FaultyJitFaultyAgentRpcTestWithSpawn)
Apr 24 08:00:03     fn()
Apr 24 08:00:03   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 97, in new_test_method
Apr 24 08:00:03     rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method=self.init_method),
Apr 24 08:00:03   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/__init__.py", line 120, in init_rpc
Apr 24 08:00:03     raise TypeError("Argument backend must be a member of BackendType")
Apr 24 08:00:03 TypeError: Argument backend must be a member of BackendType
Apr 24 08:00:03 
Apr 24 08:00:03 
Apr 24 08:00:03 
Apr 24 08:00:03 ======================================================================
Apr 24 08:00:03 ERROR [2.149s]: test_timeout_in_torchscript_function (__main__.FaultyJitFaultyAgentRpcTestWithSpawn)
Apr 24 08:00:03 ----------------------------------------------------------------------
Apr 24 08:00:03 Traceback (most recent call last):
Apr 24 08:00:03   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 374, in wrapper
Apr 24 08:00:03     self._join_processes(fn)
Apr 24 08:00:03   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 567, in _join_processes
Apr 24 08:00:03     self._check_return_codes(elapsed_time)
Apr 24 08:00:03   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 610, in _check_return_codes
Apr 24 08:00:03     raise RuntimeError(error)
Apr 24 08:00:03 RuntimeError: Process 0 exited with error code 10 and exception:
Apr 24 08:00:03 Traceback (most recent call last):

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Apr 24, 2021
To resolve #51670, forbid process group PRC backend. This can avoid the need of checking the current backend in the torch script `remote_module_template`.

Differential Revision: [D27984658](https://our.internmc.facebook.com/intern/diff/D27984658/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Apr 24, 2021
Pull Request resolved: #56854

To resolve #51670, forbid process group PRC backend. This can avoid the need of checking the current backend in the torch script `remote_module_template`.
ghstack-source-id: 127345847

Differential Revision: [D27984658](https://our.internmc.facebook.com/intern/diff/D27984658/)
Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that in 1.9, PG backend will stay in a deprecated state and be the last release where it exists, as per #55616. So afaik we probably want to keep supporting it with RPC framework for now? cc @mrshenli

@wayi1 wayi1 requested a review from lw April 26, 2021 17:29
@mrshenli
Copy link
Contributor

I was under the impression that in 1.9, PG backend will stay in a deprecated state and be the last release where it exists, as per #55616. So afaik we probably want to keep supporting it with RPC framework for now?

Yep the plan is to keep PG RPC backend in v1.9.

If the goal is to make RemoteModule only work with TensorPipe backend, I think it is fine, as RemoteModule is still beta release and we have the flexibility to make changes. However, that means the check needs to be done in RemoteModule, instead of disabling PG backend for RPC.

@wayi1
Copy link
Contributor Author

wayi1 commented Apr 26, 2021

I was under the impression that in 1.9, PG backend will stay in a deprecated state and be the last release where it exists, as per #55616. So afaik we probably want to keep supporting it with RPC framework for now?

Yep the plan is to keep PG RPC backend in v1.9.

If the goal is to make RemoteModule only work with TensorPipe backend, I think it is fine, as RemoteModule is still beta release and we have the flexibility to make changes. However, that means the check needs to be done in RemoteModule, instead of disabling PG backend for RPC.

Yes, I understand that this PR is not completely deprecating PG backend for RPC. This is just the minimum work to support a follow-up PR. Plan to discuss with @lw and create a separate PR to complete remove PG backend.

@mrshenli
Copy link
Contributor

mrshenli commented Apr 26, 2021

Yes, I understand that this PR is not completely deprecating PG backend for RPC. This is just the minimum work to support a follow-up PR.

I might miss sth. IIUC, this PR will throw a ValueError when the RPC backend is PG in init_rpc. So, from user's perspective, we are completely deprecating PG backend for RPC?

  1. If the RPC backend is process group, then move the forward output on CUDA back to CPU, because process group backend cannot support CUDA backend.

If this is what you would like to avoid, is it acceptable to let RemoteModule error out when the backend is PG?

@mrshenli
Copy link
Contributor

  1. If the RPC backend is TensorPipe, then we don't need to move the forward output on CUDA, as long as a device map is provided.

Curious, how does RemoteModule know whether the device map is provided? Read from the RPC agent state?

@wayi1
Copy link
Contributor Author

wayi1 commented Apr 26, 2021

Yes, I understand that this PR is not completely deprecating PG backend for RPC. This is just the minimum work to support a follow-up PR.

I might miss sth. IIUC, this PR will throw a ValueError when the RPC backend is PG in init_rpc. So, from user's perspective, we are completely deprecating PG backend for RPC?

  1. If the RPC backend is process group, then move the forward output on CUDA back to CPU, because process group backend cannot support CUDA backend.

If this is what you would like to avoid, is it acceptable to let RemoteModule error out when the backend is PG?

Yes, per discussion with @pritamdamania87, we plan to error out PG backend on RemoteModule.

@lw
Copy link
Contributor

lw commented Apr 26, 2021

I agree with @mrshenli and @rohan-varma, we should not forbid the PG backend in 1.9. If some component requires the RPC agent to support CUDA tensors, we can add runtime checks for this in that component. If it helps, I think we could consider adding a supportsCuda() method to the RPC agent API to help with this kind of checks.

@wayi1
Copy link
Contributor Author

wayi1 commented Apr 26, 2021

I agree with @mrshenli and @rohan-varma, we should not forbid the PG backend in 1.9. If some component requires the RPC agent to support CUDA tensors, we can add runtime checks for this in that component. If it helps, I think we could consider adding a supportsCuda() method to the RPC agent API to help with this kind of checks.

Thanks for the suggestion! Not sure if it's worthwhile adding a new method here temporarily for only version 1.9. If we don't want to forbid PG backend before 1.9, we can probably bypass this issue at this time.

I created #56943 that always moves forward output back to CPU for now, and added some TODOs to fix after 1.9.

@wayi1 wayi1 changed the title [RPC Framework] Forbid process group backend in RPC [DO NOT LAND BEFORE 1.9] [RPC Framework] Forbid process group backend in RPC Apr 26, 2021
@wayi1
Copy link
Contributor Author

wayi1 commented Apr 26, 2021

  1. If the RPC backend is TensorPipe, then we don't need to move the forward output on CUDA, as long as a device map is provided.

Curious, how does RemoteModule know whether the device map is provided? Read from the RPC agent state?

Same thought, but none of the getters in RpcAgent class can provide a device map at this time.

@lw: could you provide a get_device_map() method in RpcAgent class?

@lw
Copy link
Contributor

lw commented Apr 27, 2021

@lw: could you provide a get_device_map() method in RpcAgent class?

@mrshenli @pritamdamania87 Doesn't dist autograd also need to retrieve the device maps from the RPC agent in order to store the for use in the backwards pass? How does it do so? If there's already some code to do that, we should reuse it. If not, I'm fine with adding that method.

@mrshenli
Copy link
Contributor

Yes, per discussion with @pritamdamania87, we plan to error out PG backend on RemoteModule.

This will break the promise we made in: #55615

@wayi1
Copy link
Contributor Author

wayi1 commented Apr 27, 2021

Yes, per discussion with @pritamdamania87, we plan to error out PG backend on RemoteModule.

This will break the promise we made in: #55615

@mrshenli Thanks for referring to this issue! I don't think there is a need to forbid PG backend at this time.

Now I realize that I don't really need to check the PRC backend in RemoteModule. Since TensorPipe backend can only support sending GPU tensors over the wire with a device map, what I only need to check in RemoteModule is whether a device map is set.

We can close this PR once @lw's comment #56854 (comment) is addressed.

@pritamdamania87
Copy link
Contributor

@mrshenli @pritamdamania87 Doesn't dist autograd also need to retrieve the device maps from the RPC agent in order to store the for use in the backwards pass? How does it do so? If there's already some code to do that, we should reuse it. If not, I'm fine with adding that method.

Yes, I added support for this in #44859.

@wayi1
Copy link
Contributor Author

wayi1 commented Apr 27, 2021

@mrshenli @pritamdamania87 Doesn't dist autograd also need to retrieve the device maps from the RPC agent in order to store the for use in the backwards pass? How does it do so? If there's already some code to do that, we should reuse it. If not, I'm fine with adding that method.

Yes, I added support for this in #44859.

Thanks for the pointer!

@lw So it seems that all we need now is just exposing a Python API for C++ getDeviceMap.

wayi1 pushed a commit that referenced this pull request Apr 28, 2021
Expose a Python API to get the device map and unblock RemoteModule work.

See: #56854 (comment)

Additionally, add a const decorator for the C++ getter.

Differential Revision: [D28070160](https://our.internmc.facebook.com/intern/diff/D28070160/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Apr 28, 2021
Expose a Python API to get the device map and unblock RemoteModule work.

See: #56854 (comment)

Additionally, add a const decorator for the C++ getter.

#Original PR issue: #51670

Differential Revision: [D28070160](https://our.internmc.facebook.com/intern/diff/D28070160/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Apr 28, 2021
Pull Request resolved: #57179

Expose a Python API to get the device map and unblock RemoteModule work.

See: #56854 (comment)

Additionally, add a const decorator for the C++ getter.

#Original PR issue: #51670
ghstack-source-id: 127684266

Differential Revision: [D28070160](https://our.internmc.facebook.com/intern/diff/D28070160/)
facebook-github-bot pushed a commit that referenced this pull request Apr 29, 2021
Summary:
Pull Request resolved: #57179

Expose a Python API to get the device map and unblock RemoteModule work.

See: #56854 (comment)

Additionally, add a const decorator for the C++ getter.

#Original PR issue: #51670
ghstack-source-id: 127684266

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D28070160

fbshipit-source-id: 624d14552d82b99487f72e16428fa75c7a47f61f
@wayi1
Copy link
Contributor Author

wayi1 commented May 1, 2021

Abandon this PR.

No need to forbid Process Group RPC backend. Whether to allow sending the GPU tensors over the wire is now is determined by whether a device map on the remote worker is set or not.

@wayi1 wayi1 closed this May 1, 2021
@wayi1 wayi1 changed the title [DO NOT LAND BEFORE 1.9] [RPC Framework] Forbid process group backend in RPC [RPC Framework] Forbid process group backend in RPC May 1, 2021
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
)

Summary:
Pull Request resolved: pytorch#57179

Expose a Python API to get the device map and unblock RemoteModule work.

See: pytorch#56854 (comment)

Additionally, add a const decorator for the C++ getter.

#Original PR issue: pytorch#51670
ghstack-source-id: 127684266

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D28070160

fbshipit-source-id: 624d14552d82b99487f72e16428fa75c7a47f61f
@facebook-github-bot facebook-github-bot deleted the gh/SciPioneer/114/head branch May 31, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants