New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quantize_tensor_per_channel ARM implementation #46018
Conversation
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. Test Plan: Build and push to mobile device ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON adb push build_android/bin/quantize_per_channel /data/local/tmp ``` and then run the benchmark binary over adb shell Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. Test Plan: Build and push to mobile device ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON adb push build_android/bin/quantize_per_channel /data/local/tmp ``` and then run the benchmark binary over adb shell Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 54e77eaa727a7c3c7bc9bd68e674eccba1eb7548 Pull Request resolved: #46018
Can you post benchmarking result here? |
Closing because after some testing it appears that the compiler is performing these optimizations automatically. Edit: I was wrong, I evidently was benchmarking the wrong function because randint returns a float tensor unless otherwise specified. |
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. Test Plan: Build and push to mobile device ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON adb push build_android/bin/quantize_per_channel /data/local/tmp ``` and then run the benchmark binary over adb shell Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 54e77eaa727a7c3c7bc9bd68e674eccba1eb7548 Pull Request resolved: pytorch#46018
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. Test Plan: Build and push to mobile device ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON adb push build_android/bin/quantize_per_channel /data/local/tmp ``` and then run the benchmark binary over adb shell Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. Test Plan: Build and push to mobile device ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON adb push build_android/bin/quantize_per_channel /data/local/tmp ``` and then run the benchmark binary over adb shell Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 21882a2b455e5bf6e5d24c756719c03f8dd25bc5 Pull Request resolved: #46018
Codecov Report
@@ Coverage Diff @@
## gh/ajliu/2/base #46018 +/- ##
================================================
Coverage 60.81% 60.81%
================================================
Files 2748 2748
Lines 254038 254038
================================================
+ Hits 154496 154500 +4
+ Misses 99542 99538 -4 |
## Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. ## Test Plan: Build for ARM Neon ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON ``` Build for ARM64 ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON ``` Then run the benchmark binary over adb shell. Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu. ``` adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor" adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed" adb push build_android/bin/quantize_per_channel /data/local/tmp/ adb shell "/data/local/tmp/quantize_per_channel" ``` Resulting benchmarks are located [here](https://gist.github.com/AJLiu/d1711bb6a5e93b3338eca2c14c8aec9f) Google spreadsheet comparing results [here](https://docs.google.com/spreadsheets/d/1Ky-rEu2CqOqex2a84b67hB1VLAlfEDgAN2ZXe8IlGF8/edit?usp=sharing) **Overall results:** - aarch64 - 2x slowdown on 2d tensor benchmarks - 4x speed up on 4d contiguous tensor benchmarks - 4x speed up on 4d channels last tensor benchmarks - neon - 2x speed up on 2d tensor benchmarks - 20x speed up on 4d contiguous tensor benchmarks - 20x speed up on 4d contiguous tensor benchmarks I suspect that this is an overall performance boost, however it adds a small amount of overhead that becomes very noticeable when quantizing small tensors. Reviewers: kimishpatel Subscribers: Tasks: T76832258 Tags: Differential Revision: [D24286528](https://our.internmc.facebook.com/intern/diff/D24286528) [ghstack-poisoned]
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. Test Plan: Build and push to mobile device ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON adb push build_android/bin/quantize_per_channel /data/local/tmp ``` and then run the benchmark binary over adb shell Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: dd1fa8c7b57b0f57e90d8587f9dbb2756c858ae9 Pull Request resolved: #46018
Latest commit minimizes the overhead slightly. Now when compiling with Neon we see an overall improvement compared to before, but Arm64 still has a bit of a slowdown on small tensors. New benchmark outputs: https://gist.github.com/AJLiu/814232ac153b8d8029a8d37dc074320c Overall results:
|
aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp
Outdated
Show resolved
Hide resolved
aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp
Outdated
Show resolved
Hide resolved
aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments.
Also can you add this test https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/test/quantized_test.cpp to mobile tests and run it through to make sure it passes.
## Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. ## Test Plan: Build for ARM Neon ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON ``` Build for ARM64 ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON ``` Then run the benchmark binary over adb shell. Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu. ``` adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor" adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed" adb push build_android/bin/quantize_per_channel /data/local/tmp/ adb shell "/data/local/tmp/quantize_per_channel" ``` Resulting benchmarks are located [here](https://gist.github.com/AJLiu/d1711bb6a5e93b3338eca2c14c8aec9f) Google spreadsheet comparing results [here](https://docs.google.com/spreadsheets/d/1Ky-rEu2CqOqex2a84b67hB1VLAlfEDgAN2ZXe8IlGF8/edit?usp=sharing) **Overall results:** - aarch64 - 2x slowdown on 2d tensor benchmarks - 4x speed up on 4d contiguous tensor benchmarks - 4x speed up on 4d channels last tensor benchmarks - neon - 2x speed up on 2d tensor benchmarks - 20x speed up on 4d contiguous tensor benchmarks - 20x speed up on 4d contiguous tensor benchmarks I suspect that this is an overall performance boost, however it adds a small amount of overhead that becomes very noticeable when quantizing small tensors. Reviewers: kimishpatel Subscribers: Tasks: T76832258 Tags: Differential Revision: [D24286528](https://our.internmc.facebook.com/intern/diff/D24286528) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit b8958d0 (more details on the Dr. CI page):
Extra GitHub checks: 1 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 7 times. |
## Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. ## Test Plan: Build for ARM Neon ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON ``` Build for ARM64 ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON ``` Then run the benchmark binary over adb shell. Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu. ``` adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor" adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed" adb push build_android/bin/quantize_per_channel /data/local/tmp/ adb shell "/data/local/tmp/quantize_per_channel" ``` Resulting benchmarks are located [here](https://gist.github.com/AJLiu/d1711bb6a5e93b3338eca2c14c8aec9f) Google spreadsheet comparing results [here](https://docs.google.com/spreadsheets/d/1Ky-rEu2CqOqex2a84b67hB1VLAlfEDgAN2ZXe8IlGF8/edit?usp=sharing) **Overall results:** - aarch64 - 2x slowdown on 2d tensor benchmarks - 4x speed up on 4d contiguous tensor benchmarks - 4x speed up on 4d channels last tensor benchmarks - neon - 2x speed up on 2d tensor benchmarks - 20x speed up on 4d contiguous tensor benchmarks - 20x speed up on 4d contiguous tensor benchmarks I suspect that this is an overall performance boost, however it adds a small amount of overhead that becomes very noticeable when quantizing small tensors. Reviewers: kimishpatel Subscribers: Tasks: T76832258 Tags: Differential Revision: [D24286528](https://our.internmc.facebook.com/intern/diff/D24286528) [ghstack-poisoned]
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. Test Plan: Build and push to mobile device ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON adb push build_android/bin/quantize_per_channel /data/local/tmp ``` and then run the benchmark binary over adb shell Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 73198c6b8608833effb1840697f883f7402cf1fc Pull Request resolved: #46018
Addressed requested changes and enabled However it looks like none of the tests in that file actually test quantize_tensor_per_channel so I added two new unit tests for them (to check contiguous and channels_last formats). It looks like both new tests pass for both mobile architectures. |
Thanks for these changes. Looks good overall. I am little perplexed as to why 2D tensors are so slow. Only thing different seems to be new allocations for inv scale and offset and writing those values out which maybe is adding extra memory overhead. |
## Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. ## Test Plan: Build for ARM Neon ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON ``` Build for ARM64 ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON ``` Run the test binary over adb shell with the commands below. ``` adb push build_android/bin/quantized_test /data/local/tmp adb shell "/data/local/tmp/quantized_test" ``` Run the benchmark binary over adb shell with the commands below. ``` adb push build_android/bin/quantize_per_channel /data/local/tmp/ adb shell "/data/local/tmp/quantize_per_channel" ``` Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu. ``` adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor" adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed" ``` Resulting benchmarks are located [here](https://gist.github.com/AJLiu/d1711bb6a5e93b3338eca2c14c8aec9f) Google spreadsheet comparing results [here](https://docs.google.com/spreadsheets/d/1Ky-rEu2CqOqex2a84b67hB1VLAlfEDgAN2ZXe8IlGF8/edit?usp=sharing) **Overall results:** - aarch64 - 2x slowdown on 2d tensor benchmarks - 4x speed up on 4d contiguous tensor benchmarks - 4x speed up on 4d channels last tensor benchmarks - neon - 2x speed up on 2d tensor benchmarks - 20x speed up on 4d contiguous tensor benchmarks - 20x speed up on 4d contiguous tensor benchmarks I suspect that this is an overall performance boost, however it adds a small amount of overhead that becomes very noticeable when quantizing small tensors. Reviewers: kimishpatel Subscribers: Tasks: T76832258 Tags: Differential Revision: [D24286528](https://our.internmc.facebook.com/intern/diff/D24286528) [ghstack-poisoned]
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not. Test Plan: Build and push to mobile device ``` BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON adb push build_android/bin/quantize_per_channel /data/local/tmp ``` and then run the benchmark binary over adb shell Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 05ef282ebe63a48d6a275692bdac59d89446e247 Pull Request resolved: #46018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Make sure CI is clean. I see some lint errors.
Reran the benchmarks and the overhead is gone.
|
Lint errors are unrelated to this pr, looks like b3eb0c8 fixes it |
Sounds good. |
This pull request has been merged in b0e954f. |
Stack from ghstack:
Summary:
Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not.
Test Plan:
Build for ARM Neon
Build for ARM64
Run the test binary over adb shell with the commands below.
Run the benchmark binary over adb shell with the commands below.
Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu.
Resulting benchmarks are located here
Google spreadsheet comparing results here
Overall results:
I suspect that this is an overall performance boost, however it adds a small amount of overhead that becomes very noticeable when quantizing small tensors.
Reviewers: kimishpatel
Subscribers:
Tasks: T76832258
Tags:
Differential Revision: D24286528