Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantize_tensor_per_channel ARM implementation #46018

Closed
wants to merge 6 commits into from

Conversation

ajliu
Copy link
Member

@ajliu ajliu commented Oct 8, 2020

Stack from ghstack:

Summary:

Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not.

Test Plan:

Build for ARM Neon

BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON

Build for ARM64

BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON

Run the test binary over adb shell with the commands below.

adb push build_android/bin/quantized_test /data/local/tmp
adb shell "/data/local/tmp/quantized_test"

Run the benchmark binary over adb shell with the commands below.

adb push build_android/bin/quantize_per_channel /data/local/tmp/
adb shell "/data/local/tmp/quantize_per_channel"

Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu.

adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor"
adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed"

Resulting benchmarks are located here
Google spreadsheet comparing results here

Overall results:

  • aarch64
    • 2x slowdown on 2d tensor benchmarks
    • 4x speed up on 4d contiguous tensor benchmarks
    • 4x speed up on 4d channels last tensor benchmarks
  • neon
    • 2x speed up on 2d tensor benchmarks
    • 20x speed up on 4d contiguous tensor benchmarks
    • 20x speed up on 4d contiguous tensor benchmarks

I suspect that this is an overall performance boost, however it adds a small amount of overhead that becomes very noticeable when quantizing small tensors.

Reviewers: kimishpatel

Subscribers:

Tasks: T76832258

Tags:

Differential Revision: D24286528

Summary: Currently on mobile devices quantize_tensor has a vectorized implementation
using ARM intrinsics; however quantize_tensor_per_channel does not.

Test Plan: Build and push to mobile device
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
adb push build_android/bin/quantize_per_channel /data/local/tmp
```
and then run the benchmark binary over adb shell

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
ajliu added a commit that referenced this pull request Oct 8, 2020
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation
using ARM intrinsics; however quantize_tensor_per_channel does not.

Test Plan: Build and push to mobile device
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
adb push build_android/bin/quantize_per_channel /data/local/tmp
```
and then run the benchmark binary over adb shell

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 54e77eaa727a7c3c7bc9bd68e674eccba1eb7548
Pull Request resolved: #46018
@kimishpatel
Copy link
Contributor

Can you post benchmarking result here?

@ajliu ajliu closed this Oct 9, 2020
@ajliu
Copy link
Member Author

ajliu commented Oct 9, 2020

Closing because after some testing it appears that the compiler is performing these optimizations automatically.

Edit: I was wrong, I evidently was benchmarking the wrong function because randint returns a float tensor unless otherwise specified.

ajliu added a commit to ajliu/pytorch that referenced this pull request Oct 13, 2020
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation
using ARM intrinsics; however quantize_tensor_per_channel does not.

Test Plan: Build and push to mobile device
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
adb push build_android/bin/quantize_per_channel /data/local/tmp
```
and then run the benchmark binary over adb shell

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 54e77eaa727a7c3c7bc9bd68e674eccba1eb7548
Pull Request resolved: pytorch#46018
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation
using ARM intrinsics; however quantize_tensor_per_channel does not.

Test Plan: Build and push to mobile device
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
adb push build_android/bin/quantize_per_channel /data/local/tmp
```
and then run the benchmark binary over adb shell

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
ajliu added a commit that referenced this pull request Oct 13, 2020
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation
using ARM intrinsics; however quantize_tensor_per_channel does not.

Test Plan: Build and push to mobile device
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
adb push build_android/bin/quantize_per_channel /data/local/tmp
```
and then run the benchmark binary over adb shell

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 21882a2b455e5bf6e5d24c756719c03f8dd25bc5
Pull Request resolved: #46018
@ajliu ajliu reopened this Oct 13, 2020
@codecov
Copy link

codecov bot commented Oct 13, 2020

Codecov Report

Merging #46018 into gh/ajliu/2/base will increase coverage by 0.00%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           gh/ajliu/2/base   #46018   +/-   ##
================================================
  Coverage            60.81%   60.81%           
================================================
  Files                 2748     2748           
  Lines               254038   254038           
================================================
+ Hits                154496   154500    +4     
+ Misses               99542    99538    -4     

## Summary:
Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not.

## Test Plan:
Build for ARM Neon
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Build for ARM64
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Then run the benchmark binary over adb shell. Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu.
```
adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor"
adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed"
adb push build_android/bin/quantize_per_channel /data/local/tmp/
adb shell "/data/local/tmp/quantize_per_channel"
```

Resulting benchmarks are located [here](https://gist.github.com/AJLiu/d1711bb6a5e93b3338eca2c14c8aec9f)
Google spreadsheet comparing results [here](https://docs.google.com/spreadsheets/d/1Ky-rEu2CqOqex2a84b67hB1VLAlfEDgAN2ZXe8IlGF8/edit?usp=sharing)

**Overall results:**
- aarch64
    - 2x slowdown on 2d tensor benchmarks
    - 4x speed up on 4d contiguous tensor benchmarks
    - 4x speed up on 4d channels last tensor benchmarks
- neon
    - 2x speed up on 2d tensor benchmarks
    - 20x speed up on 4d contiguous tensor benchmarks
    - 20x speed up on 4d contiguous tensor benchmarks

I suspect that this is an overall performance boost, however it adds a small amount of overhead that becomes very noticeable when quantizing small tensors.

Reviewers: kimishpatel

Subscribers:

Tasks: T76832258

Tags:

Differential Revision: [D24286528](https://our.internmc.facebook.com/intern/diff/D24286528)

[ghstack-poisoned]
ajliu added a commit that referenced this pull request Oct 14, 2020
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation
using ARM intrinsics; however quantize_tensor_per_channel does not.

Test Plan: Build and push to mobile device
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
adb push build_android/bin/quantize_per_channel /data/local/tmp
```
and then run the benchmark binary over adb shell

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: dd1fa8c7b57b0f57e90d8587f9dbb2756c858ae9
Pull Request resolved: #46018
@ajliu
Copy link
Member Author

ajliu commented Oct 14, 2020

Latest commit minimizes the overhead slightly. Now when compiling with Neon we see an overall improvement compared to before, but Arm64 still has a bit of a slowdown on small tensors.

New benchmark outputs: https://gist.github.com/AJLiu/814232ac153b8d8029a8d37dc074320c

Overall results:

  • aarch64
    • 1.4x slowdown on 2d tensor benchmarks
    • 4x speed up on 4d contiguous tensor benchmarks
    • 4x speed up on 4d channels last tensor benchmarks
  • neon
    • 3x speed up on 2d tensor benchmarks
    • 20x speed up on 4d contiguous tensor benchmarks
    • 20x speed up on 4d contiguous tensor benchmarks

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments.
Also can you add this test https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/test/quantized_test.cpp to mobile tests and run it through to make sure it passes.

## Summary:
Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not.

## Test Plan:
Build for ARM Neon
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Build for ARM64
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Then run the benchmark binary over adb shell. Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu.
```
adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor"
adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed"
adb push build_android/bin/quantize_per_channel /data/local/tmp/
adb shell "/data/local/tmp/quantize_per_channel"
```

Resulting benchmarks are located [here](https://gist.github.com/AJLiu/d1711bb6a5e93b3338eca2c14c8aec9f)
Google spreadsheet comparing results [here](https://docs.google.com/spreadsheets/d/1Ky-rEu2CqOqex2a84b67hB1VLAlfEDgAN2ZXe8IlGF8/edit?usp=sharing)

**Overall results:**
- aarch64
    - 2x slowdown on 2d tensor benchmarks
    - 4x speed up on 4d contiguous tensor benchmarks
    - 4x speed up on 4d channels last tensor benchmarks
- neon
    - 2x speed up on 2d tensor benchmarks
    - 20x speed up on 4d contiguous tensor benchmarks
    - 20x speed up on 4d contiguous tensor benchmarks

I suspect that this is an overall performance boost, however it adds a small amount of overhead that becomes very noticeable when quantizing small tensors.

Reviewers: kimishpatel

Subscribers:

Tasks: T76832258

Tags:

Differential Revision: [D24286528](https://our.internmc.facebook.com/intern/diff/D24286528)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Oct 17, 2020

💊 CI failures summary and remediations

As of commit b8958d0 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 7 times.

## Summary:
Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not.

## Test Plan:
Build for ARM Neon
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Build for ARM64
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Then run the benchmark binary over adb shell. Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu.
```
adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor"
adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed"
adb push build_android/bin/quantize_per_channel /data/local/tmp/
adb shell "/data/local/tmp/quantize_per_channel"
```

Resulting benchmarks are located [here](https://gist.github.com/AJLiu/d1711bb6a5e93b3338eca2c14c8aec9f)
Google spreadsheet comparing results [here](https://docs.google.com/spreadsheets/d/1Ky-rEu2CqOqex2a84b67hB1VLAlfEDgAN2ZXe8IlGF8/edit?usp=sharing)

**Overall results:**
- aarch64
    - 2x slowdown on 2d tensor benchmarks
    - 4x speed up on 4d contiguous tensor benchmarks
    - 4x speed up on 4d channels last tensor benchmarks
- neon
    - 2x speed up on 2d tensor benchmarks
    - 20x speed up on 4d contiguous tensor benchmarks
    - 20x speed up on 4d contiguous tensor benchmarks

I suspect that this is an overall performance boost, however it adds a small amount of overhead that becomes very noticeable when quantizing small tensors.

Reviewers: kimishpatel

Subscribers:

Tasks: T76832258

Tags:

Differential Revision: [D24286528](https://our.internmc.facebook.com/intern/diff/D24286528)

[ghstack-poisoned]
ajliu added a commit that referenced this pull request Oct 17, 2020
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation
using ARM intrinsics; however quantize_tensor_per_channel does not.

Test Plan: Build and push to mobile device
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
adb push build_android/bin/quantize_per_channel /data/local/tmp
```
and then run the benchmark binary over adb shell

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 73198c6b8608833effb1840697f883f7402cf1fc
Pull Request resolved: #46018
@ajliu
Copy link
Member Author

ajliu commented Oct 17, 2020

Addressed requested changes and enabled pytorch/aten/src/ATen/test/quantized_test.cpp for mobile.

However it looks like none of the tests in that file actually test quantize_tensor_per_channel so I added two new unit tests for them (to check contiguous and channels_last formats). It looks like both new tests pass for both mobile architectures.

@kimishpatel
Copy link
Contributor

Addressed requested changes and enabled pytorch/aten/src/ATen/test/quantized_test.cpp for mobile.

However it looks like none of the tests in that file actually test quantize_tensor_per_channel so I added two new unit tests for them (to check contiguous and channels_last formats). It looks like both new tests pass for both mobile architectures.

Thanks for these changes. Looks good overall. I am little perplexed as to why 2D tensors are so slow. Only thing different seems to be new allocations for inv scale and offset and writing those values out which maybe is adding extra memory overhead.

## Summary:
Currently on mobile devices quantize_tensor has a vectorized implementation using ARM intrinsics; however quantize_tensor_per_channel does not.

## Test Plan:
Build for ARM Neon
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Build for ARM64
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Run the test binary over adb shell with the commands below.
```
adb push build_android/bin/quantized_test /data/local/tmp
adb shell "/data/local/tmp/quantized_test"
```
Run the benchmark binary over adb shell with the commands below. 
```
adb push build_android/bin/quantize_per_channel /data/local/tmp/
adb shell "/data/local/tmp/quantize_per_channel"
```
Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu.
```
adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor"
adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed"
```

Resulting benchmarks are located [here](https://gist.github.com/AJLiu/d1711bb6a5e93b3338eca2c14c8aec9f)
Google spreadsheet comparing results [here](https://docs.google.com/spreadsheets/d/1Ky-rEu2CqOqex2a84b67hB1VLAlfEDgAN2ZXe8IlGF8/edit?usp=sharing)

**Overall results:**
- aarch64
    - 2x slowdown on 2d tensor benchmarks
    - 4x speed up on 4d contiguous tensor benchmarks
    - 4x speed up on 4d channels last tensor benchmarks
- neon
    - 2x speed up on 2d tensor benchmarks
    - 20x speed up on 4d contiguous tensor benchmarks
    - 20x speed up on 4d contiguous tensor benchmarks

I suspect that this is an overall performance boost, however it adds a small amount of overhead that becomes very noticeable when quantizing small tensors.

Reviewers: kimishpatel

Subscribers:

Tasks: T76832258

Tags:

Differential Revision: [D24286528](https://our.internmc.facebook.com/intern/diff/D24286528)

[ghstack-poisoned]
ajliu added a commit that referenced this pull request Nov 2, 2020
Summary: Currently on mobile devices quantize_tensor has a vectorized implementation
using ARM intrinsics; however quantize_tensor_per_channel does not.

Test Plan: Build and push to mobile device
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
adb push build_android/bin/quantize_per_channel /data/local/tmp
```
and then run the benchmark binary over adb shell

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 05ef282ebe63a48d6a275692bdac59d89446e247
Pull Request resolved: #46018
Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Make sure CI is clean. I see some lint errors.

@ajliu
Copy link
Member Author

ajliu commented Nov 2, 2020

Reran the benchmarks and the overhead is gone.
Overall results:

  • aarch64
    • 1.4x speed up on 2d tensor benchmarks
    • 4x speed up on 4d contiguous tensor benchmarks
    • 4x speed up on 4d channels last tensor benchmarks
  • neon
    • 3x speed up on 2d tensor benchmarks
    • 20x speed up on 4d contiguous tensor benchmarks
    • 20x speed up on 4d contiguous tensor benchmarks

@ajliu
Copy link
Member Author

ajliu commented Nov 2, 2020

Lint errors are unrelated to this pr, looks like b3eb0c8 fixes it

@kimishpatel
Copy link
Contributor

Sounds good.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in b0e954f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants