Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PowerSGD] Add orthogonalization with QR factorization #72043

Closed
wants to merge 2 commits into from
Closed

[PowerSGD] Add orthogonalization with QR factorization #72043

wants to merge 2 commits into from

Conversation

younik
Copy link
Contributor

@younik younik commented Jan 30, 2022

🚀 The feature, motivation and pitch

Following the discussion in #65813, I added the QR factorization to powerSGD_hook.py
Gram-Schmidt orthogonalization can't be fully replaced because torch.linalg.qr doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3.

This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods:
Screenshot from 2022-01-31 18-14-00

Alternatives

Use torch.orgqr(*torch.geqrf(matrix)). From my tests it performances are similar to torch.linalg.qr.

Additional context

No response

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 30, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/younik/pytorch/blob/c154a76332e2c1d730d2f30e8321c99947137a8a/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk, ciflow/xla ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-bionic-rocm4.5-py3.7 ciflow/linux, ciflow/rocm 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jan 30, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 488124f (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-02-04T19:20:34.8705153Z AttributeError: 'T...ls' object has no attribute 'onnx_shape_inference'
2022-02-04T19:20:34.8700306Z  �[36mtest/onnx/test_models.py�[0m::TestModels.test_srresnet�[0m �[34ms�[0m             �[31m37% �[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[31m▊�[0m�[40m�[31m      �[0m
2022-02-04T19:20:34.8700703Z 
2022-02-04T19:20:34.8701039Z ――――――――――――――――――――――― TestModels.test_super_resolution ―――――――――――――――――――――――
2022-02-04T19:20:34.8701739Z Traceback (most recent call last):
2022-02-04T19:20:34.8702162Z   File "/var/lib/jenkins/workspace/test/onnx/test_models.py", line 102, in test_super_resolution
2022-02-04T19:20:34.8702796Z     self.exportTest(toC(SuperResolutionNet(upscale_factor=3)), toC(x), atol=1e-6)
2022-02-04T19:20:34.8703308Z   File "/var/lib/jenkins/workspace/test/onnx/test_models_onnxruntime.py", line 17, in exportTest
2022-02-04T19:20:34.8703715Z     input=inputs, rtol=rtol, atol=atol)
2022-02-04T19:20:34.8704166Z   File "/var/lib/jenkins/workspace/test/onnx/test_pytorch_onnx_onnxruntime.py", line 171, in run_model_test
2022-02-04T19:20:34.8704616Z     onnx_shape_inference=self.onnx_shape_inference)
2022-02-04T19:20:34.8705153Z AttributeError: 'TestModels' object has no attribute 'onnx_shape_inference'
2022-02-04T19:20:34.8705497Z 
2022-02-04T19:20:34.8705506Z 
2022-02-04T19:20:34.8707356Z 
2022-02-04T19:20:34.8724243Z  �[36mtest/onnx/test_models.py�[0m::TestModels.test_super_resolution�[0m �[31m⨯�[0m     �[31m37% �[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[31m▊�[0m�[40m�[31m      �[0m
2022-02-04T19:20:34.8724608Z 
2022-02-04T19:20:34.8728567Z 
2022-02-04T19:20:34.8743794Z  �[36mtest/onnx/test_models.py�[0m::TestModels.test_vgg16�[0m �[34ms�[0m                �[31m37% �[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[31m▊�[0m�[40m�[31m      �[0m
2022-02-04T19:20:34.8744098Z 
2022-02-04T19:20:34.8748917Z 
2022-02-04T19:20:34.8763963Z  �[36mtest/onnx/test_models.py�[0m::TestModels.test_vgg16_bn�[0m �[34ms�[0m             �[31m37% �[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[32m█�[0m�[40m�[31m▊�[0m�[40m�[31m      �[0m

1 failure not recognized by patterns:

Job Step Action
GitHub Actions linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu) Test 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 30, 2022
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 1, 2022
@cbalioglu
Copy link
Contributor

Overall looks good to me. One question though; as we discussed offline previously, Thijs experiments showed performance improvement with a rank of 2 as well. Do you know the reason why you couldn't observe the same effect in your experiments?

@younik
Copy link
Contributor Author

younik commented Feb 3, 2022

Overall looks good to me. One question though; as we discussed offline previously, Thijs experiments showed performance improvement with a rank of 2 as well. Do you know the reason why you couldn't observe the same effect in your experiments?

Great question; I get results similar to Thijs when monitoring only the _orthogonalize method. On the other hand, I obtain the above results monitoring the whole powerSGD_hook. I am not sure why this is happening.
Nevertheless, in the first case, with rank 2, I obtain negligible improvements with the qr method.

@cbalioglu
Copy link
Contributor

Sounds good! @younik I believe once you put an assertion statement as suggested by @tvogels, the PR is good to go.

@younik
Copy link
Contributor Author

younik commented Feb 4, 2022

There should be now

@cbalioglu cbalioglu self-requested a review February 7, 2022 14:12
@facebook-github-bot
Copy link
Contributor

@cbalioglu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Feb 7, 2022
Summary:
### 🚀 The feature, motivation and pitch
Following the discussion in #65813, I added the QR factorization to powerSGD_hook.py
Gram-Schmidt orthogonalization can't be fully replaced because _torch.linalg.qr_ doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3.

This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods:
![Screenshot from 2022-01-31 18-14-00](https://user-images.githubusercontent.com/42100908/151840929-270c67dd-9fe7-4f11-8e70-8bf2d0ba678d.png)

### Alternatives
Use _torch.orgqr(*torch.geqrf(matrix))_. From my tests it performances are similar to _torch.linalg.qr_.

### Additional context
_No response_

Pull Request resolved: #72043

Reviewed By: albanD

Differential Revision: D34042781

Pulled By: cbalioglu

fbshipit-source-id: e331179d3b7ac40d445b651fc473b16ae4ead462
@github-actions
Copy link

github-actions bot commented Feb 7, 2022

Hey @younik.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@cbalioglu cbalioglu added release notes: distributed (ddp) release notes category topic: performance topic category labels Feb 7, 2022
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Feb 9, 2022
Summary:
### 🚀 The feature, motivation and pitch
Following the discussion in pytorch/pytorch#65813, I added the QR factorization to powerSGD_hook.py
Gram-Schmidt orthogonalization can't be fully replaced because _torch.linalg.qr_ doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3.

This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods:
![Screenshot from 2022-01-31 18-14-00](https://user-images.githubusercontent.com/42100908/151840929-270c67dd-9fe7-4f11-8e70-8bf2d0ba678d.png)

### Alternatives
Use _torch.orgqr(*torch.geqrf(matrix))_. From my tests it performances are similar to _torch.linalg.qr_.

### Additional context
_No response_

Pull Request resolved: pytorch/pytorch#72043

Reviewed By: albanD

Differential Revision: D34042781

Pulled By: cbalioglu

fbshipit-source-id: e331179d3b7ac40d445b651fc473b16ae4ead462
(cherry picked from commit f64bf3839aad795fc0ad12da15fa2e9a0decf5ab)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Feb 9, 2022
Summary:
### 🚀 The feature, motivation and pitch
Following the discussion in pytorch/pytorch#65813, I added the QR factorization to powerSGD_hook.py
Gram-Schmidt orthogonalization can't be fully replaced because _torch.linalg.qr_ doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3.

This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods:
![Screenshot from 2022-01-31 18-14-00](https://user-images.githubusercontent.com/42100908/151840929-270c67dd-9fe7-4f11-8e70-8bf2d0ba678d.png)

### Alternatives
Use _torch.orgqr(*torch.geqrf(matrix))_. From my tests it performances are similar to _torch.linalg.qr_.

### Additional context
_No response_

Pull Request resolved: pytorch/pytorch#72043

Reviewed By: albanD

Differential Revision: D34042781

Pulled By: cbalioglu

fbshipit-source-id: e331179d3b7ac40d445b651fc473b16ae4ead462
(cherry picked from commit f64bf3839aad795fc0ad12da15fa2e9a0decf5ab)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (ddp) release notes category topic: performance topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants