Skip to content

Conversation

averywang21
Copy link
Contributor

@averywang21 averywang21 commented Sep 14, 2021

Context: #59338

Summary:
Added an optional logging parameter for non-member functions compute_bucket_assignment_by_size and verify_replica0_across_processes. If a logger is provided then TORCH_CHECK assertions are replaced with a wrapper that logs the error to the DDP reducer's logger before calling TORCH_CHECK. If a logger is not provided TORCH_CHECK is still called.

Modified python-side calls to _compute_bucket_assignment_by_size and _verify_model_across_ranks to include a logger whenever possible. A notable exception is when these non-member functions are called in DDP's constructor - we cannot pass in a logger as they may have not been initialized yet.

We also added 4 new tests: test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger which tests the _compute_bucket_assignment_by_size function to ensure that sparse tensors are rejected and the errors are logged. test_verify_model_across_rank_{with, without}_logger calls _verify_model_across_ranks to ensure that ill-formed models (different ranks have different number of parameters compared to rank 0) are rejected and the errors are logged. The test test_ddp_model_diff_across_ranks remains unchanged - while it does construct a ill-formed DDP instance which triggers the error in _verify_model_across_ranks, we cannot check the logger because this error appears in the constructor.

Lastly, did some cleanup of the test_ddp_model_diff_across_ranks function to make the logic of choosing which context manager and error message to use more clean.

Test Plan:
Build commands
buck build mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --keep-going

buck build mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --keep-going

Test commands
Test for _compute_bucket_assignment_by_size (Python)/ compute_bucket_assignment_by_size (C++)
BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger

Test for _verify_model_across_ranks (Python)/verify_replicas0_across_process (C++)
BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_verify_model_across_ranks_{with, without}_logger

Test that constructs an ill-formed DDP instance. Only did cleanup of this function.
BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_ddp_model_diff_across_ranks

Differential Revision: D30924790

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @cbalioglu @gcramer23

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 14, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 72c929b (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30924790

@rohan-varma
Copy link
Contributor

note to reviewers - Longer term, we'll need to refactor the reducer so that we can call things like _verify_model_across_ranks with a valid logger, which will increase the usefulness of error logging. However it is out of scope for this PR and this PR can serve as the basic building blogs of adjusting the reducer logic appropriately if there is a logger.

@codecov
Copy link

codecov bot commented Sep 15, 2021

Codecov Report

Merging #65023 (72c929b) into master (96cb05b) will decrease coverage by 0.02%.
The diff coverage is 37.17%.

@@            Coverage Diff             @@
##           master   #65023      +/-   ##
==========================================
- Coverage   66.36%   66.34%   -0.03%     
==========================================
  Files         730      730              
  Lines       93604    93674      +70     
==========================================
+ Hits        62123    62146      +23     
- Misses      31481    31528      +47     

with ctx:
net = torch.nn.parallel.DistributedDataParallel(
net.to(self.rank),
device_ids=[self.rank],
process_group=group_to_use,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

expected_err = "Caught collective operation timeout"
ctx = self.assertRaisesRegex(RuntimeError, expected_err)
else:
expected_err = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to set this in this branch right?

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30924790

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30924790

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30924790

Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Please wait for all internal and OSS tests to pass before landing .

@skip_if_lt_x_gpu(2)
@skip_if_rocm
def test_compute_bucket_assignment_by_size_sparse_error_with_logger(self):
self._test_compute_bucket_assignment_by_size(use_logger=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work on these tests!

@rohan-varma
Copy link
Contributor

CI failures are unrelated.

Summary:
Pull Request resolved: pytorch#65023

Added an optional logging parameter for non-member functions `compute_bucket_assignment_by_size` and `verify_replica0_across_processes`. If a logger is provided then `TORCH_CHECK` assertions are replaced with a wrapper that logs the error to the DDP reducer's logger before calling `TORCH_CHECK`. If a logger is not provided `TORCH_CHECK` is still called.

Modified python-side calls to `_compute_bucket_assignment_by_size` and `_verify_model_across_ranks` to include a logger whenever possible. A notable exception is when these non-member functions are called in DDP's constructor - we cannot pass in a logger as they may have not been initialized yet.

We also added 4 new tests: `test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger` which tests the `_compute_bucket_assignment_by_size` function to ensure that sparse tensors are rejected and the errors are logged.  `test_verify_model_across_rank_{with, without}_logger` calls `_verify_model_across_ranks` to ensure that ill-formed models (different ranks have different number of parameters compared to rank 0) are rejected and the errors are logged. The test `test_ddp_model_diff_across_ranks` remains unchanged - while it does construct a ill-formed DDP instance which triggers the error in `_verify_model_across_ranks`, we cannot check the logger because this error appears in the constructor.

Lastly, did some cleanup of the `test_ddp_model_diff_across_ranks` function to make the logic of choosing which context manager and error message to use more clean.

Test Plan:
**Build commands**
`buck build mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --keep-going`

`buck build mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --keep-going`

**Test commands**
Test for `_compute_bucket_assignment_by_size` (Python)/ `compute_bucket_assignment_by_size` (C++)
`BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_compute_bucket_assignment_by_size_sparse_error_{with, without}_logger`

Test for `_verify_model_across_ranks` (Python)/`verify_replicas0_across_process` (C++)
`BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_verify_model_across_ranks_{with, without}_logger`

Test that constructs an ill-formed DDP instance. Only did cleanup of this function.
`BACKEND={nccl, gloo} WORLD_SIZE=2 ../buck-out/dev/gen/caffe2/test/distributed/distributed_{nccl, gloo}_spawn#binary.par -r test_ddp_model_diff_across_ranks`

Reviewed By: rohan-varma

Differential Revision: D30924790

fbshipit-source-id: b65a103d6121a4211549fc3922fe00510194c2b8
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30924790

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 0a51490.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants