Skip to content

Conversation

mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Apr 23, 2021

Stack from ghstack:

Differential Revision: D27959592

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 23, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 23, 2021

💊 CI failures summary and remediations

As of commit 8801144 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

mrshenli added a commit that referenced this pull request Apr 23, 2021
ghstack-source-id: 1758ab5
Pull Request resolved: #56757
Comment on lines +5720 to +5722
@skip_if_lt_x_gpu(2)
def test_rref_forward_synchronization4(self):
self._test_rref_forward_synchronization("cuda:1", "cuda:1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if something like below could be used to replace all 4 test cases.

def test_rref_forward_synchronization(self):
  devices = ["cuda:0", "cuda:1"]
  for device1, device2 in itertools.product(devices, repeat=2):
      with self.subTest(device1=device1, device2=device2):
          self._test_rref_forward_synchronization(device1, device2)

Probably not because we wouldn't be re-initializing RPC workers each test right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we currently does not support modifying device maps and re-initializing RPC is not well tested either. Another reason is that we might want to keep each test small and specific.

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in acca89e.

out_relay,
TensorPipeAgentCudaRpcTest._rref_relay,
args=(rref_out,)
).to_here()
Copy link
Contributor

@rohan-varma rohan-varma Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the issue @mrzzd was running into is that the RRef is forwarded to another worker (in this case out_relay), and then on that worker to_here() is used, the synchronization has issues?

So would it be useful if this test also tested calling to_here() on out_relay node?

Nevermind, it looks like this is what _rref_relay tests.

@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/313/head branch April 27, 2021 14:16
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary: Pull Request resolved: pytorch#56757

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27959592

Pulled By: mrshenli

fbshipit-source-id: b72c873bcaef4515b0fc8d48ae539477e1850a40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants