-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Add more RRef CUDA RPC tests #56757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more RRef CUDA RPC tests #56757
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 8801144 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
[ghstack-poisoned]
Differential Revision: [D27959592](https://our.internmc.facebook.com/intern/diff/D27959592) [ghstack-poisoned]
@skip_if_lt_x_gpu(2) | ||
def test_rref_forward_synchronization4(self): | ||
self._test_rref_forward_synchronization("cuda:1", "cuda:1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if something like below could be used to replace all 4 test cases.
def test_rref_forward_synchronization(self):
devices = ["cuda:0", "cuda:1"]
for device1, device2 in itertools.product(devices, repeat=2):
with self.subTest(device1=device1, device2=device2):
self._test_rref_forward_synchronization(device1, device2)
Probably not because we wouldn't be re-initializing RPC workers each test right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we currently does not support modifying device maps and re-initializing RPC is not well tested either. Another reason is that we might want to keep each test small and specific.
out_relay, | ||
TensorPipeAgentCudaRpcTest._rref_relay, | ||
args=(rref_out,) | ||
).to_here() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the issue @mrzzd was running into is that the RRef is forwarded to another worker (in this case out_relay
), and then on that worker to_here()
is used, the synchronization has issues?
So would it be useful if this test also tested calling to_here()
on out_relay node?
Nevermind, it looks like this is what _rref_relay
tests.
Summary: Pull Request resolved: pytorch#56757 Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D27959592 Pulled By: mrshenli fbshipit-source-id: b72c873bcaef4515b0fc8d48ae539477e1850a40
Stack from ghstack:
Differential Revision: D27959592