Add more RRef CUDA RPC tests #56757

mrshenli · 2021-04-23T02:33:03Z

Stack from ghstack:

Add more RRef CUDA RPC tests #56757 Add more RRef CUDA RPC tests

Differential Revision: D27959592

[ghstack-poisoned]

facebook-github-bot · 2021-04-23T02:33:11Z

💊 CI failures summary and remediations

As of commit 8801144 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

[ghstack-poisoned]

Differential Revision: [D27959592](https://our.internmc.facebook.com/intern/diff/D27959592) [ghstack-poisoned]

ghstack-source-id: 1758ab5 Pull Request resolved: #56757

H-Huang · 2021-04-23T15:07:24Z

torch/testing/_internal/distributed/rpc/rpc_test.py

+    @skip_if_lt_x_gpu(2)
+    def test_rref_forward_synchronization4(self):
+        self._test_rref_forward_synchronization("cuda:1", "cuda:1")


Wondering if something like below could be used to replace all 4 test cases.

def test_rref_forward_synchronization(self): devices = ["cuda:0", "cuda:1"] for device1, device2 in itertools.product(devices, repeat=2): with self.subTest(device1=device1, device2=device2): self._test_rref_forward_synchronization(device1, device2)

Probably not because we wouldn't be re-initializing RPC workers each test right?

Yep, we currently does not support modifying device maps and re-initializing RPC is not well tested either. Another reason is that we might want to keep each test small and specific.

facebook-github-bot · 2021-04-23T15:42:02Z

@mrshenli merged this pull request in acca89e.

rohan-varma · 2021-04-23T18:00:45Z

torch/testing/_internal/distributed/rpc/rpc_test.py

+                    out_relay,
+                    TensorPipeAgentCudaRpcTest._rref_relay,
+                    args=(rref_out,)
+                ).to_here()


~~I think that the issue @mrzzd was running into is that the RRef is forwarded to another worker (in this case out_relay), and then on that worker to_here() is used, the synchronization has issues?~~

~~So would it be useful if this test also tested calling to_here() on out_relay node?~~

Nevermind, it looks like this is what _rref_relay tests.

Summary: Pull Request resolved: pytorch#56757 Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D27959592 Pulled By: mrshenli fbshipit-source-id: b72c873bcaef4515b0fc8d48ae539477e1850a40

Add more RRef CUDA RPC tests

3186ae4

[ghstack-poisoned]

mrshenli requested review from H-Huang, pritamdamania87, rohan-varma, wayi1 and zhaojuanmao as code owners April 23, 2021 02:33

facebook-github-bot added the cla signed label Apr 23, 2021

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 23, 2021

Update on "Add more RRef CUDA RPC tests"

684e5d2

[ghstack-poisoned]

Update on "Add more RRef CUDA RPC tests"

8801144

Differential Revision: [D27959592](https://our.internmc.facebook.com/intern/diff/D27959592) [ghstack-poisoned]

mrshenli added a commit that referenced this pull request Apr 23, 2021

Add more RRef CUDA RPC tests

be223f6

ghstack-source-id: 1758ab5 Pull Request resolved: #56757

H-Huang approved these changes Apr 23, 2021

View reviewed changes

facebook-github-bot closed this in acca89e Apr 23, 2021

facebook-github-bot added the Merged label Apr 23, 2021

rohan-varma reviewed Apr 23, 2021

View reviewed changes

facebook-github-bot deleted the gh/mrshenli/313/head branch April 27, 2021 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more RRef CUDA RPC tests #56757

Add more RRef CUDA RPC tests #56757

Uh oh!

mrshenli commented Apr 23, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 23, 2021 •

edited

Loading

Uh oh!

H-Huang Apr 23, 2021

Uh oh!

mrshenli Apr 23, 2021

Uh oh!

facebook-github-bot commented Apr 23, 2021

Uh oh!

rohan-varma Apr 23, 2021 •

edited

Loading

Uh oh!

Uh oh!

Add more RRef CUDA RPC tests #56757

Add more RRef CUDA RPC tests #56757

Uh oh!

Conversation

mrshenli commented Apr 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Apr 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

H-Huang Apr 23, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli Apr 23, 2021

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 23, 2021

Uh oh!

rohan-varma Apr 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrshenli commented Apr 23, 2021 •

edited

Loading

facebook-github-bot commented Apr 23, 2021 •

edited

Loading

rohan-varma Apr 23, 2021 •

edited

Loading