Added logging for the Reducer's non-member functions. #65023

averywang21 · 2021-09-14T22:03:28Z

Context: #59338

Summary:
Added an optional logging parameter for non-member functions compute_bucket_assignment_by_size and verify_replica0_across_processes. If a logger is provided then TORCH_CHECK assertions are replaced with a wrapper that logs the error to the DDP reducer's logger before calling TORCH_CHECK. If a logger is not provided TORCH_CHECK is still called.

Modified python-side calls to _compute_bucket_assignment_by_size and _verify_model_across_ranks to include a logger whenever possible. A notable exception is when these non-member functions are called in DDP's constructor - we cannot pass in a logger as they may have not been initialized yet.

We also added 4 new tests: test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger which tests the _compute_bucket_assignment_by_size function to ensure that sparse tensors are rejected and the errors are logged. test_verify_model_across_rank_{with, without}_logger calls _verify_model_across_ranks to ensure that ill-formed models (different ranks have different number of parameters compared to rank 0) are rejected and the errors are logged. The test test_ddp_model_diff_across_ranks remains unchanged - while it does construct a ill-formed DDP instance which triggers the error in _verify_model_across_ranks, we cannot check the logger because this error appears in the constructor.

Lastly, did some cleanup of the test_ddp_model_diff_across_ranks function to make the logic of choosing which context manager and error message to use more clean.

Test Plan:
Build commands
buck build mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --keep-going

buck build mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --keep-going

Test commands
Test for _compute_bucket_assignment_by_size (Python)/ compute_bucket_assignment_by_size (C++)
BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger

Test for _verify_model_across_ranks (Python)/verify_replicas0_across_process (C++)
BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_verify_model_across_ranks_{with, without}_logger

Test that constructs an ill-formed DDP instance. Only did cleanup of this function.
BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_ddp_model_diff_across_ranks

Differential Revision: D30924790

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @cbalioglu @gcramer23

facebook-github-bot · 2021-09-14T22:03:34Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/65023
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 72c929b (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

facebook-github-bot · 2021-09-14T22:03:55Z

This pull request was exported from Phabricator. Differential Revision: D30924790

torch/csrc/distributed/c10d/init.cpp

rohan-varma · 2021-09-14T23:20:41Z

note to reviewers - Longer term, we'll need to refactor the reducer so that we can call things like _verify_model_across_ranks with a valid logger, which will increase the usefulness of error logging. However it is out of scope for this PR and this PR can serve as the basic building blogs of adjusting the reducer logic appropriately if there is a logger.

torch/csrc/distributed/c10d/reducer.cpp

torch/testing/_internal/distributed/distributed_test.py

torch/csrc/distributed/c10d/reducer.hpp

torch/testing/_internal/distributed/distributed_test.py

codecov · 2021-09-15T03:04:44Z

Codecov Report

Merging #65023 (72c929b) into master (96cb05b) will decrease coverage by 0.02%.
The diff coverage is 37.17%.

@@            Coverage Diff             @@
##           master   #65023      +/-   ##
==========================================
- Coverage   66.36%   66.34%   -0.03%     
==========================================
  Files         730      730              
  Lines       93604    93674      +70     
==========================================
+ Hits        62123    62146      +23     
- Misses      31481    31528      +47

rohan-varma · 2021-09-15T07:08:18Z

torch/testing/_internal/distributed/distributed_test.py

            with ctx:
                net = torch.nn.parallel.DistributedDataParallel(
                    net.to(self.rank),
                    device_ids=[self.rank],
                    process_group=group_to_use,
                )
+


remove empty line

rohan-varma · 2021-09-15T07:09:07Z

torch/testing/_internal/distributed/distributed_test.py

+                    expected_err = "Caught collective operation timeout"
+                    ctx = self.assertRaisesRegex(RuntimeError, expected_err)
+                else:
+                    expected_err = ""


Don't need to set this in this branch right?

torch/testing/_internal/distributed/distributed_test.py

facebook-github-bot · 2021-09-15T23:21:54Z

This pull request was exported from Phabricator. Differential Revision: D30924790

facebook-github-bot · 2021-09-15T23:33:44Z

This pull request was exported from Phabricator. Differential Revision: D30924790

facebook-github-bot · 2021-09-15T23:36:55Z

This pull request was exported from Phabricator. Differential Revision: D30924790

rohan-varma

Great work! Please wait for all internal and OSS tests to pass before landing .

rohan-varma · 2021-09-15T23:59:23Z

torch/testing/_internal/distributed/distributed_test.py

+        @skip_if_lt_x_gpu(2)
+        @skip_if_rocm
+        def test_compute_bucket_assignment_by_size_sparse_error_with_logger(self):
+            self._test_compute_bucket_assignment_by_size(use_logger=True)


Awesome work on these tests!

rohan-varma · 2021-09-16T18:00:32Z

CI failures are unrelated.

Summary: Pull Request resolved: pytorch#65023 Added an optional logging parameter for non-member functions `compute_bucket_assignment_by_size` and `verify_replica0_across_processes`. If a logger is provided then `TORCH_CHECK` assertions are replaced with a wrapper that logs the error to the DDP reducer's logger before calling `TORCH_CHECK`. If a logger is not provided `TORCH_CHECK` is still called. Modified python-side calls to `_compute_bucket_assignment_by_size` and `_verify_model_across_ranks` to include a logger whenever possible. A notable exception is when these non-member functions are called in DDP's constructor - we cannot pass in a logger as they may have not been initialized yet. We also added 4 new tests: `test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger` which tests the `_compute_bucket_assignment_by_size` function to ensure that sparse tensors are rejected and the errors are logged. `test_verify_model_across_rank_{with, without}_logger` calls `_verify_model_across_ranks` to ensure that ill-formed models (different ranks have different number of parameters compared to rank 0) are rejected and the errors are logged. The test `test_ddp_model_diff_across_ranks` remains unchanged - while it does construct a ill-formed DDP instance which triggers the error in `_verify_model_across_ranks`, we cannot check the logger because this error appears in the constructor. Lastly, did some cleanup of the `test_ddp_model_diff_across_ranks` function to make the logic of choosing which context manager and error message to use more clean. Test Plan: **Build commands** `buck build mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --keep-going` `buck build mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --keep-going` **Test commands** Test for `_compute_bucket_assignment_by_size` (Python)/ `compute_bucket_assignment_by_size` (C++) `BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger` Test for `_verify_model_across_ranks` (Python)/`verify_replicas0_across_process` (C++) `BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_verify_model_across_ranks_{with, without}_logger` Test that constructs an ill-formed DDP instance. Only did cleanup of this function. `BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_ddp_model_diff_across_ranks` Reviewed By: rohan-varma Differential Revision: D30924790 fbshipit-source-id: b65a103d6121a4211549fc3922fe00510194c2b8

facebook-github-bot · 2021-09-16T19:10:09Z

This pull request was exported from Phabricator. Differential Revision: D30924790

facebook-github-bot · 2021-09-16T23:40:55Z

This pull request has been merged in 0a51490.

averywang21 requested review from cbalioglu, H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma, wayi1 and zhaojuanmao as code owners September 14, 2021 22:03

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Sep 14, 2021

facebook-github-bot added the fb-exported label Sep 14, 2021

rohan-varma reviewed Sep 14, 2021

View reviewed changes

torch/csrc/distributed/c10d/init.cpp Show resolved Hide resolved

wayi1 reviewed Sep 14, 2021

View reviewed changes

torch/csrc/distributed/c10d/reducer.cpp Outdated Show resolved Hide resolved

wayi1 reviewed Sep 14, 2021

View reviewed changes

torch/csrc/distributed/c10d/reducer.cpp Outdated Show resolved Hide resolved

wayi1 reviewed Sep 14, 2021

View reviewed changes

torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved

wayi1 reviewed Sep 14, 2021

View reviewed changes

torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved

wayi1 reviewed Sep 14, 2021

View reviewed changes

torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved

wayi1 reviewed Sep 15, 2021

View reviewed changes

torch/csrc/distributed/c10d/reducer.hpp Outdated Show resolved Hide resolved

wayi1 reviewed Sep 15, 2021

View reviewed changes

torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved

wayi1 approved these changes Sep 15, 2021

View reviewed changes

rohan-varma reviewed Sep 15, 2021

View reviewed changes

torch/testing/_internal/distributed/distributed_test.py Show resolved Hide resolved

rohan-varma reviewed Sep 15, 2021

View reviewed changes

torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved

averywang21 force-pushed the export-D30924790 branch from 84672d9 to bddad08 Compare September 15, 2021 23:21

averywang21 force-pushed the export-D30924790 branch from bddad08 to 66d6cd9 Compare September 15, 2021 23:33

averywang21 force-pushed the export-D30924790 branch from 66d6cd9 to d37f4a7 Compare September 15, 2021 23:36

rohan-varma approved these changes Sep 16, 2021

View reviewed changes

averywang21 force-pushed the export-D30924790 branch from d37f4a7 to 72c929b Compare September 16, 2021 19:10

facebook-github-bot closed this in 0a51490 Sep 16, 2021

facebook-github-bot added the Merged label Sep 16, 2021

Added logging for the Reducer's non-member functions. #65023

Added logging for the Reducer's non-member functions. #65023

Uh oh!

Conversation

averywang21 commented Sep 14, 2021 • edited by rohan-varma Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Sep 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

facebook-github-bot commented Sep 14, 2021

Uh oh!

Uh oh!

rohan-varma commented Sep 14, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rohan-varma Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

rohan-varma Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Sep 15, 2021

Uh oh!

facebook-github-bot commented Sep 15, 2021

Uh oh!

facebook-github-bot commented Sep 15, 2021

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented Sep 16, 2021

Uh oh!

facebook-github-bot commented Sep 16, 2021

Uh oh!

facebook-github-bot commented Sep 16, 2021

Uh oh!

Uh oh!

averywang21 commented Sep 14, 2021 •

edited by rohan-varma

Loading

facebook-github-bot commented Sep 14, 2021 •

edited

Loading

codecov bot commented Sep 15, 2021 •

edited

Loading