[PowerSGD] Add orthogonalization with QR factorization #72043

younik · 2022-01-30T11:18:45Z

🚀 The feature, motivation and pitch

Following the discussion in #65813, I added the QR factorization to powerSGD_hook.py
Gram-Schmidt orthogonalization can't be fully replaced because torch.linalg.qr doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3.

This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods:

Alternatives

Use torch.orgqr(*torch.geqrf(matrix)). From my tests it performances are similar to torch.linalg.qr.

Additional context

No response

pytorch-bot · 2022-01-30T11:18:49Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/younik/pytorch/blob/c154a76332e2c1d730d2f30e8321c99947137a8a/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
linux-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-manywheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`, `ciflow/xla`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
windows-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-bionic-rocm4.5-py3.7	`ciflow/linux`, `ciflow/rocm`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped

facebook-github-bot · 2022-01-30T11:18:50Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/72043
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 488124f (more details on the Dr. CI page):

2/2 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-02-04T19:20:34.8705153Z AttributeError: 'T...ls' object has no attribute 'onnx_shape_inference'

2022-02-04T19:20:34.8700306Z  �[36mtest/onnx/test_models.py�[0m::TestModels.test_srresnet�[0m �[34ms�[0m             �[31m37% �[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[31m▊�[0m�[40m�[31m      �[0m
2022-02-04T19:20:34.8700703Z 
2022-02-04T19:20:34.8701039Z ――――――――――――――――――――――― TestModels.test_super_resolution ―――――――――――――――――――――――
2022-02-04T19:20:34.8701739Z Traceback (most recent call last):
2022-02-04T19:20:34.8702162Z   File "/var/lib/jenkins/workspace/test/onnx/test_models.py", line 102, in test_super_resolution
2022-02-04T19:20:34.8702796Z     self.exportTest(toC(SuperResolutionNet(upscale_factor=3)), toC(x), atol=1e-6)
2022-02-04T19:20:34.8703308Z   File "/var/lib/jenkins/workspace/test/onnx/test_models_onnxruntime.py", line 17, in exportTest
2022-02-04T19:20:34.8703715Z     input=inputs, rtol=rtol, atol=atol)
2022-02-04T19:20:34.8704166Z   File "/var/lib/jenkins/workspace/test/onnx/test_pytorch_onnx_onnxruntime.py", line 171, in run_model_test
2022-02-04T19:20:34.8704616Z     onnx_shape_inference=self.onnx_shape_inference)
2022-02-04T19:20:34.8705153Z AttributeError: 'TestModels' object has no attribute 'onnx_shape_inference'
2022-02-04T19:20:34.8705497Z 
2022-02-04T19:20:34.8705506Z 
2022-02-04T19:20:34.8707356Z 
2022-02-04T19:20:34.8724243Z  �[36mtest/onnx/test_models.py�[0m::TestModels.test_super_resolution�[0m �[31m⨯�[0m     �[31m37% �[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[31m▊�[0m�[40m�[31m      �[0m
2022-02-04T19:20:34.8724608Z 
2022-02-04T19:20:34.8728567Z 
2022-02-04T19:20:34.8743794Z  �[36mtest/onnx/test_models.py�[0m::TestModels.test_vgg16�[0m �[34ms�[0m                �[31m37% �[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[31m▊�[0m�[40m�[31m      �[0m
2022-02-04T19:20:34.8744098Z 
2022-02-04T19:20:34.8748917Z 
2022-02-04T19:20:34.8763963Z  �[36mtest/onnx/test_models.py�[0m::TestModels.test_vgg16_bn�[0m �[34ms�[0m             �[31m37% �[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[31m▊�[0m�[40m�[31m      �[0m

1 failure not recognized by patterns:

Job	Step	Action
^{linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu)}	^Test	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py

cbalioglu · 2022-02-03T12:28:00Z

Overall looks good to me. One question though; as we discussed offline previously, Thijs experiments showed performance improvement with a rank of 2 as well. Do you know the reason why you couldn't observe the same effect in your experiments?

younik · 2022-02-03T14:43:38Z

Overall looks good to me. One question though; as we discussed offline previously, Thijs experiments showed performance improvement with a rank of 2 as well. Do you know the reason why you couldn't observe the same effect in your experiments?

Great question; I get results similar to Thijs when monitoring only the _orthogonalize method. On the other hand, I obtain the above results monitoring the whole powerSGD_hook. I am not sure why this is happening.
Nevertheless, in the first case, with rank 2, I obtain negligible improvements with the qr method.

cbalioglu · 2022-02-04T16:49:07Z

Sounds good! @younik I believe once you put an assertion statement as suggested by @tvogels, the PR is good to go.

younik · 2022-02-04T18:23:26Z

There should be now

facebook-github-bot · 2022-02-07T14:16:29Z

@cbalioglu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: ### 🚀 The feature, motivation and pitch Following the discussion in #65813, I added the QR factorization to powerSGD_hook.py Gram-Schmidt orthogonalization can't be fully replaced because _torch.linalg.qr_ doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3. This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods: ![Screenshot from 2022-01-31 18-14-00](https://user-images.githubusercontent.com/42100908/151840929-270c67dd-9fe7-4f11-8e70-8bf2d0ba678d.png) ### Alternatives Use _torch.orgqr(*torch.geqrf(matrix))_. From my tests it performances are similar to _torch.linalg.qr_. ### Additional context _No response_ Pull Request resolved: #72043 Reviewed By: albanD Differential Revision: D34042781 Pulled By: cbalioglu fbshipit-source-id: e331179d3b7ac40d445b651fc473b16ae4ead462

github-actions · 2022-02-07T21:19:46Z

Hey @younik.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: ### 🚀 The feature, motivation and pitch Following the discussion in pytorch/pytorch#65813, I added the QR factorization to powerSGD_hook.py Gram-Schmidt orthogonalization can't be fully replaced because _torch.linalg.qr_ doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3. This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods: ![Screenshot from 2022-01-31 18-14-00](https://user-images.githubusercontent.com/42100908/151840929-270c67dd-9fe7-4f11-8e70-8bf2d0ba678d.png) ### Alternatives Use _torch.orgqr(*torch.geqrf(matrix))_. From my tests it performances are similar to _torch.linalg.qr_. ### Additional context _No response_ Pull Request resolved: pytorch/pytorch#72043 Reviewed By: albanD Differential Revision: D34042781 Pulled By: cbalioglu fbshipit-source-id: e331179d3b7ac40d445b651fc473b16ae4ead462 (cherry picked from commit f64bf3839aad795fc0ad12da15fa2e9a0decf5ab)

add orthogonalization with QR

c154a76

younik requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners January 30, 2022 11:18

pytorch-bot bot added the ciflow/default label Jan 30, 2022

facebook-github-bot added the cla signed label Jan 30, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 30, 2022

pytorchbot added the open source label Jan 30, 2022

IvanYashchuk reviewed Jan 31, 2022

View reviewed changes

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py Show resolved Hide resolved

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 1, 2022

add assertion

488124f

cbalioglu self-requested a review February 7, 2022 14:12

cbalioglu approved these changes Feb 7, 2022

View reviewed changes

pytorchmergebot closed this in 25f9fe2 Feb 7, 2022

cbalioglu added release notes: distributed (ddp) release notes category topic: performance topic category labels Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PowerSGD] Add orthogonalization with QR factorization #72043

[PowerSGD] Add orthogonalization with QR factorization #72043

younik commented Jan 30, 2022 •

edited

Loading

pytorch-bot bot commented Jan 30, 2022

⚛️ CI Flow

facebook-github-bot commented Jan 30, 2022 •

edited

Loading

cbalioglu commented Feb 3, 2022

younik commented Feb 3, 2022

cbalioglu commented Feb 4, 2022

younik commented Feb 4, 2022

facebook-github-bot commented Feb 7, 2022

github-actions bot commented Feb 7, 2022

[PowerSGD] Add orthogonalization with QR factorization #72043

[PowerSGD] Add orthogonalization with QR factorization #72043

Conversation

younik commented Jan 30, 2022 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

pytorch-bot bot commented Jan 30, 2022

⚛️ CI Flow

facebook-github-bot commented Jan 30, 2022 • edited Loading

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge) (1/1)

1 failure not recognized by patterns:

cbalioglu commented Feb 3, 2022

younik commented Feb 3, 2022

cbalioglu commented Feb 4, 2022

younik commented Feb 4, 2022

facebook-github-bot commented Feb 7, 2022

github-actions bot commented Feb 7, 2022

younik commented Jan 30, 2022 •

edited

Loading

facebook-github-bot commented Jan 30, 2022 •

edited

Loading