-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[DDP] Test inference works with eval() and no_grad() #59666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 3364263 (more details on the Dr. CI page and at hud.pytorch.org/pr/59666):
1 failure not recognized by patterns:
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) ghstack-source-id: 130878421 Pull Request resolved: #59666
cc @mrshenli |
Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]
Pull Request resolved: #59666 Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. ghstack-source-id: 131262081 Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/)
if self.rank == 0: | ||
with torch.no_grad(): | ||
for _ in range(6): | ||
ddp_out = model(inp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If you rename model
as ddp_model
, then you can just use one line here:
self.assertEqual(ddp_model(inp), local_model(inp))
Just more concise. It's optional.
# or eval setting and there is no hang. | ||
rank = self.rank | ||
torch.cuda.set_device(rank) | ||
model = Net().cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I believe you can just create a for loop over two tuples of <model, input>, where the models are Net().cuda()
and nn.SyncBatchNorm
. This can save some duplicate code and improve the readability.
self.assertEqual(ddp_out, local_out) | ||
torch.cuda.synchronize() | ||
|
||
self._barrier(timeout=30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all the 3 barriers here necessary, even after cuda.sync
? Why do they need a non-default higher timeout here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the calls to synchronize
are necessary actually, will remove those.
The calls to barrier are there because we're only running inference on rank 0, if the inference unexpectedly takes too log, the default barrier which I believe has a timeout of 10s can timeout, leading to a false positive failure.
In this test the inference does not take nearly the full 30s but wanted to have plenty of buffer to avoid flakiness.
Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]
Pull Request resolved: #59666 Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. ghstack-source-id: 131561892 Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/)
Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]
Pull Request resolved: #59666 Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. ghstack-source-id: 131723578 Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/)
Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]
Pull Request resolved: #59666 Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. ghstack-source-id: 131749203 Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/)
device_ids=[rank] | ||
) | ||
inp = torch.randn(10, 2, device=rank) | ||
inp_syncbn = torch.randn(10, 2, 4, 4, device=rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the abbreviation of "inp"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is short for "input"
Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]
Pull Request resolved: #59666 Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. ghstack-source-id: 131906625 Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/)
This pull request has been merged in 0131a59. |
Stack from ghstack:
Tests that inference with DDP model won't hang when user sets eval()
or no_grad(). Note that if the model has a syncBN layer, they need both eval()
and no_grad() as eval() makes SyncBN work like a regular BN layer.
Differential Revision: D28974146