[DDP] Test inference works with eval() and no_grad() #59666

rohan-varma · 2021-06-08T21:01:59Z

Stack from ghstack:

[DDP] Test inference works with eval() and no_grad() #59666 [DDP] Test inference works with eval() and no_grad()

Tests that inference with DDP model won't hang when user sets eval()
or no_grad(). Note that if the model has a syncBN layer, they need both eval()
and no_grad() as eval() makes SyncBN work like a regular BN layer.

Differential Revision: D28974146

Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]

facebook-github-bot · 2021-06-08T21:02:03Z

💊 CI failures summary and remediations

As of commit 3364263 (more details on the Dr. CI page and at hud.pytorch.org/pr/59666):

3/3 failures possibly* introduced in this PR
- 2/3 non-scanned failure(s)

1 failure not recognized by patterns:

Job	Step	Action
^{Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) / calculate-docker-image}	^Unknown	🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.2-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) ghstack-source-id: 130878421 Pull Request resolved: #59666

rohan-varma · 2021-06-08T21:07:39Z

cc @mrshenli

Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]

Pull Request resolved: #59666 Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. ghstack-source-id: 131262081 Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/)

wayi1 · 2021-06-12T05:00:08Z

torch/testing/_internal/distributed/distributed_test.py

+            if self.rank == 0:
+                with torch.no_grad():
+                    for _ in range(6):
+                        ddp_out = model(inp)


Nit: If you rename model as ddp_model, then you can just use one line here:
self.assertEqual(ddp_model(inp), local_model(inp))

Just more concise. It's optional.

wayi1 · 2021-06-12T05:03:59Z

torch/testing/_internal/distributed/distributed_test.py

+            # or eval setting and there is no hang.
+            rank = self.rank
+            torch.cuda.set_device(rank)
+            model = Net().cuda()


Nit: I believe you can just create a for loop over two tuples of <model, input>, where the models are Net().cuda() and nn.SyncBatchNorm. This can save some duplicate code and improve the readability.

wayi1 · 2021-06-12T05:13:19Z

torch/testing/_internal/distributed/distributed_test.py

+                    self.assertEqual(ddp_out, local_out)
+                torch.cuda.synchronize()
+
+            self._barrier(timeout=30)


Are all the 3 barriers here necessary, even after cuda.sync? Why do they need a non-default higher timeout here?

I don't think the calls to synchronize are necessary actually, will remove those.

The calls to barrier are there because we're only running inference on rank 0, if the inference unexpectedly takes too log, the default barrier which I believe has a timeout of 10s can timeout, leading to a false positive failure.

In this test the inference does not take nearly the full 30s but wanted to have plenty of buffer to avoid flakiness.

Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]

Pull Request resolved: #59666 Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. ghstack-source-id: 131561892 Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/)

Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]

Pull Request resolved: #59666 Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. ghstack-source-id: 131723578 Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/)

Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]

Pull Request resolved: #59666 Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. ghstack-source-id: 131749203 Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/)

wayi1 · 2021-06-18T05:13:00Z

torch/testing/_internal/distributed/distributed_test.py

+                device_ids=[rank]
+            )
+            inp = torch.randn(10, 2, device=rank)
+            inp_syncbn = torch.randn(10, 2, 4, 4, device=rank)


What's the abbreviation of "inp"?

it is short for "input"

Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/) [ghstack-poisoned]

Pull Request resolved: #59666 Tests that inference with DDP model won't hang when user sets eval() or no_grad(). Note that if the model has a syncBN layer, they need both eval() and no_grad() as eval() makes SyncBN work like a regular BN layer. ghstack-source-id: 131906625 Differential Revision: [D28974146](https://our.internmc.facebook.com/intern/diff/D28974146/)

facebook-github-bot · 2021-06-20T19:04:27Z

This pull request has been merged in 0131a59.

rohan-varma requested review from H-Huang, cbalioglu, mrshenli, pritamdamania87, wayi1 and zhaojuanmao as code owners June 8, 2021 21:02

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Jun 8, 2021

This was referenced Jun 8, 2021

[c10d] Use TORCH_CHECK for monitored barrier error #59667

Closed

Replace throw std::runtime_error with torch_check in torch/csrc/distributed #59683

Closed

torch/lib/c10d: Use torch_check instead of throwing runtime_error #59684

Closed

wayi1 reviewed Jun 12, 2021

View reviewed changes

rohan-varma requested a review from wayi1 June 17, 2021 17:30

wayi1 reviewed Jun 18, 2021

View reviewed changes

wayi1 approved these changes Jun 18, 2021

View reviewed changes

facebook-github-bot closed this in 0131a59 Jun 20, 2021

facebook-github-bot added the Merged label Jun 20, 2021

facebook-github-bot deleted the gh/rohan-varma/324/head branch June 24, 2021 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DDP] Test inference works with eval() and no_grad() #59666

[DDP] Test inference works with eval() and no_grad() #59666

Uh oh!

rohan-varma commented Jun 8, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jun 8, 2021 •

edited

Loading

Uh oh!

rohan-varma commented Jun 8, 2021

Uh oh!

wayi1 Jun 12, 2021 •

edited

Loading

Uh oh!

wayi1 Jun 12, 2021

Uh oh!

wayi1 Jun 12, 2021

Uh oh!

rohan-varma Jun 16, 2021

Uh oh!

wayi1 Jun 18, 2021

Uh oh!

rohan-varma Jun 18, 2021

Uh oh!

facebook-github-bot commented Jun 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[DDP] Test inference works with eval() and no_grad() #59666

[DDP] Test inference works with eval() and no_grad() #59666

Uh oh!

Conversation

rohan-varma commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

1 failure not recognized by patterns:

ci.pytorch.org: 1 failed

Uh oh!

rohan-varma commented Jun 8, 2021

Uh oh!

wayi1 Jun 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wayi1 Jun 12, 2021

Choose a reason for hiding this comment

Uh oh!

wayi1 Jun 12, 2021

Choose a reason for hiding this comment

Uh oh!

rohan-varma Jun 16, 2021

Choose a reason for hiding this comment

Uh oh!

wayi1 Jun 18, 2021

Choose a reason for hiding this comment

Uh oh!

rohan-varma Jun 18, 2021

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rohan-varma commented Jun 8, 2021 •

edited

Loading

facebook-github-bot commented Jun 8, 2021 •

edited

Loading

wayi1 Jun 12, 2021 •

edited

Loading