-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Added logging for the Reducer's non-member functions. #65023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 72c929b (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This pull request was exported from Phabricator. Differential Revision: D30924790 |
note to reviewers - Longer term, we'll need to refactor the reducer so that we can call things like |
Codecov Report
@@ Coverage Diff @@
## master #65023 +/- ##
==========================================
- Coverage 66.36% 66.34% -0.03%
==========================================
Files 730 730
Lines 93604 93674 +70
==========================================
+ Hits 62123 62146 +23
- Misses 31481 31528 +47 |
with ctx: | ||
net = torch.nn.parallel.DistributedDataParallel( | ||
net.to(self.rank), | ||
device_ids=[self.rank], | ||
process_group=group_to_use, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line
expected_err = "Caught collective operation timeout" | ||
ctx = self.assertRaisesRegex(RuntimeError, expected_err) | ||
else: | ||
expected_err = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to set this in this branch right?
This pull request was exported from Phabricator. Differential Revision: D30924790 |
84672d9
to
bddad08
Compare
This pull request was exported from Phabricator. Differential Revision: D30924790 |
bddad08
to
66d6cd9
Compare
This pull request was exported from Phabricator. Differential Revision: D30924790 |
66d6cd9
to
d37f4a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Please wait for all internal and OSS tests to pass before landing .
@skip_if_lt_x_gpu(2) | ||
@skip_if_rocm | ||
def test_compute_bucket_assignment_by_size_sparse_error_with_logger(self): | ||
self._test_compute_bucket_assignment_by_size(use_logger=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work on these tests!
CI failures are unrelated. |
Summary: Pull Request resolved: pytorch#65023 Added an optional logging parameter for non-member functions `compute_bucket_assignment_by_size` and `verify_replica0_across_processes`. If a logger is provided then `TORCH_CHECK` assertions are replaced with a wrapper that logs the error to the DDP reducer's logger before calling `TORCH_CHECK`. If a logger is not provided `TORCH_CHECK` is still called. Modified python-side calls to `_compute_bucket_assignment_by_size` and `_verify_model_across_ranks` to include a logger whenever possible. A notable exception is when these non-member functions are called in DDP's constructor - we cannot pass in a logger as they may have not been initialized yet. We also added 4 new tests: `test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger` which tests the `_compute_bucket_assignment_by_size` function to ensure that sparse tensors are rejected and the errors are logged. `test_verify_model_across_rank_{with, without}_logger` calls `_verify_model_across_ranks` to ensure that ill-formed models (different ranks have different number of parameters compared to rank 0) are rejected and the errors are logged. The test `test_ddp_model_diff_across_ranks` remains unchanged - while it does construct a ill-formed DDP instance which triggers the error in `_verify_model_across_ranks`, we cannot check the logger because this error appears in the constructor. Lastly, did some cleanup of the `test_ddp_model_diff_across_ranks` function to make the logic of choosing which context manager and error message to use more clean. Test Plan: **Build commands** `buck build mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --keep-going` `buck build mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --keep-going` **Test commands** Test for `_compute_bucket_assignment_by_size` (Python)/ `compute_bucket_assignment_by_size` (C++) `BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger` Test for `_verify_model_across_ranks` (Python)/`verify_replicas0_across_process` (C++) `BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_verify_model_across_ranks_{with, without}_logger` Test that constructs an ill-formed DDP instance. Only did cleanup of this function. `BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_ddp_model_diff_across_ranks` Reviewed By: rohan-varma Differential Revision: D30924790 fbshipit-source-id: b65a103d6121a4211549fc3922fe00510194c2b8
d37f4a7
to
72c929b
Compare
This pull request was exported from Phabricator. Differential Revision: D30924790 |
This pull request has been merged in 0a51490. |
Context: #59338
Summary:
Added an optional logging parameter for non-member functions
compute_bucket_assignment_by_size
andverify_replica0_across_processes
. If a logger is provided thenTORCH_CHECK
assertions are replaced with a wrapper that logs the error to the DDP reducer's logger before callingTORCH_CHECK
. If a logger is not providedTORCH_CHECK
is still called.Modified python-side calls to
_compute_bucket_assignment_by_size
and_verify_model_across_ranks
to include a logger whenever possible. A notable exception is when these non-member functions are called in DDP's constructor - we cannot pass in a logger as they may have not been initialized yet.We also added 4 new tests:
test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger
which tests the_compute_bucket_assignment_by_size
function to ensure that sparse tensors are rejected and the errors are logged.test_verify_model_across_rank_{with, without}_logger
calls_verify_model_across_ranks
to ensure that ill-formed models (different ranks have different number of parameters compared to rank 0) are rejected and the errors are logged. The testtest_ddp_model_diff_across_ranks
remains unchanged - while it does construct a ill-formed DDP instance which triggers the error in_verify_model_across_ranks
, we cannot check the logger because this error appears in the constructor.Lastly, did some cleanup of the
test_ddp_model_diff_across_ranks
function to make the logic of choosing which context manager and error message to use more clean.Test Plan:
Build commands
buck build mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --keep-going
buck build mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --keep-going
Test commands
Test for
_compute_bucket_assignment_by_size
(Python)/compute_bucket_assignment_by_size
(C++)BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger
Test for
_verify_model_across_ranks
(Python)/verify_replicas0_across_process
(C++)BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_verify_model_across_ranks_{with, without}_logger
Test that constructs an ill-formed DDP instance. Only did cleanup of this function.
BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_ddp_model_diff_across_ranks
Differential Revision: D30924790
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @cbalioglu @gcramer23