Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Gradient Compression] Replace the key of error_dict in PowerSGD stat…
…e with bucket index (#48867) Summary: Pull Request resolved: #48867 Previously the key of error_dict is the hashcode of tensor. Now replaced with bucket index. Bucket index can have a few advantages over the hashcode of tensor. 1) Error dict in the state never removes any key. If the bucket rebuild process occurs frequently, the size of error dict can increase. For now, such rebuild process is infrequent, so it is probably fine. 2) Integer index has a better readability than hashcode, and it can facilitate debugging. If the user wants to debug the tensor values, usually only a specific bucket needs to be targeted. It's easy to specify such condition (e..g, bucket_index = 0), but it's hard to specify a hashcode in advance, as it can only be determined at runtime. Note that sometimes the buckets can be rebuilt in the forward pass. In this case, the shape of the bucket with the same index will not be consistent with the one in the previous iteration, and hence the error tensor will be re--initialized as a zero tensor of the new shape. Therefore, `and state.error_dict[bucket_index].shape[0] == padded_total_length` is added to the condition of applying the local error from the previous iteration. Deleted the arg type of `dist._GradBucket` in powerSGD_hook.py, because somehow test_run_mypy - TestTypeHints failed: AssertionError: mypy failed: torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py:128: error: "_GradBucket" has no attribute "get_index" [attr-defined] Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117951402 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D25346347 fbshipit-source-id: 8348aa103002ec1c69e3ae759504b431140b3b0d
- Loading branch information