-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize 3-bit packing #1029
Optimize 3-bit packing #1029
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1029
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit af0ea95 with merge base dec0313 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D64010666 |
Summary: Optimizes 3-bit packing as outlined here: T199311618 Before change: ---------------------------------------------------------------------------------- benchmark_pack_uint_values<3>/128/8 47.0 ns 46.4 ns 15106555 benchmark_pack_uint_values<3>/128/64 6.94 ns 6.90 ns 101226284 benchmark_pack_uint_values<3>/128/128 3.27 ns 3.24 ns 215022716 benchmark_unpack_uint_values<3>/128/8 22.0 ns 21.9 ns 32585572 benchmark_unpack_uint_values<3>/128/64 6.02 ns 5.98 ns 116910230 benchmark_unpack_uint_values<3>/128/128 2.74 ns 2.73 ns 257088291 After change: ---------------------------------------------------------------------------------- benchmark_pack_uint_values<3>/128/8 19.5 ns 19.5 ns 36050883 benchmark_pack_uint_values<3>/128/64 3.90 ns 3.87 ns 181151919 benchmark_pack_uint_values<3>/128/128 1.57 ns 1.57 ns 447247194 benchmark_unpack_uint_values<3>/128/8 20.5 ns 20.4 ns 34490914 benchmark_unpack_uint_values<3>/128/64 3.19 ns 3.11 ns 228019714 benchmark_unpack_uint_values<3>/128/128 1.71 ns 1.70 ns 408587338 Unpacking perf for 128 values is 1.60x faster (2.74/1.71). Reviewed By: digantdesai Differential Revision: D64010666
5387317
to
af0ea95
Compare
This pull request was exported from Phabricator. Differential Revision: D64010666 |
Differential Revision: D64010666 Pull Request resolved: #1029
Differential Revision: D64010666 Pull Request resolved: #1029
Summary:
Optimizes 3-bit packing as outlined here: T199311618
Before change:
benchmark_pack_uint_values<3>/128/8 47.0 ns 46.4 ns 15106555
benchmark_pack_uint_values<3>/128/64 6.94 ns 6.90 ns 101226284
benchmark_pack_uint_values<3>/128/128 3.27 ns 3.24 ns 215022716
benchmark_unpack_uint_values<3>/128/8 22.0 ns 21.9 ns 32585572
benchmark_unpack_uint_values<3>/128/64 6.02 ns 5.98 ns 116910230
benchmark_unpack_uint_values<3>/128/128 2.74 ns 2.73 ns 257088291
After change:
benchmark_pack_uint_values<3>/128/8 19.5 ns 19.5 ns 36050883
benchmark_pack_uint_values<3>/128/64 3.90 ns 3.87 ns 181151919
benchmark_pack_uint_values<3>/128/128 1.57 ns 1.57 ns 447247194
benchmark_unpack_uint_values<3>/128/8 20.5 ns 20.4 ns 34490914
benchmark_unpack_uint_values<3>/128/64 3.19 ns 3.11 ns 228019714
benchmark_unpack_uint_values<3>/128/128 1.71 ns 1.70 ns 408587338
Unpacking perf for 128 values is 1.60x faster (2.74/1.71).
Differential Revision: D64010666